Understanding Fast RCNN

9 min readMay 2, 2023

Deep Convolution networks have significantly improved image classification and object detection accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Prior to the arrival of Fast R-CNN, most of the approaches train models in multi-stage pipelines that are slow and inelegant. In this article I will give a detailed review on Fast Rcnn paper by Ross Girshick. We will divide our review to 7 parts:

Drawbacks of previous State of art techniques (R-CNN and SPP-Net)
Fast RCNN Architecture
Training
Detection
Some Further Observations and Evaluations
Comparison with state of art results
Main Results

1. Drawbacks of previous State of art techniques (R-CNN and SPP-Net)

Fast R-CNN proposes a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations. It came as an improvement of R-CNN and SPP-Net.

Some drawbacks of R-CNN were:

Training is multistage pipeline: In R-CNN first a Convnet is fine tuned on object proposals using log loss. Then, it fits SVM to Convnet features by replacing Soft Max. In third stage regressors are trained.
Training is expensive in space and time.
Detection is also very slow.

The main drawback of R-CNN is that is slow. It is mainly because Convnet forward pass of each object proposals. There are about 2000 object peoposals generated from each image. SPP-Net solves this issue as it computes the feature map of entire image and then embedding ROI proposals to these feature maps through ROI projection. SPP-Net accelerates R-CNN by 10 to 100×at test time. Training time is also reduced by 3×due to faster proposal feature extraction.Even then SPP-Net also had several drawbacks.The main drawbacks are:

Training is multistage pipeline.
The fine-tuning algorithm proposed in cannot update the convolution layers that preceds the spatial pyramid pooling.(But possible in R-CNN)

Fast R-CNN was proposed as a new training algorithm that fixes the disadvantages of R-CNN and SPPnet. Some of the notable features are:

Training in single stage pipeline.
Higher detection quality (mAP) than R-CNN, SPPnet
Training can update all layers.
No disk space required for feature catching.

2. Fast-RCNN Architecture

A fast R-CNN takes a set of object proposals and image as input. The network passes this image through several convolution layers and max pooling layers and forms a feature map. Map the object proposals on feature map using ROI projection.ROI projection is just finding the coordinates of region proposals on feature map corresponding to that in original image.I will explain about it later.

For each object proposal, Region of Interest(ROI) pooling layer will extract a fixed length feature vector and pass it through fully connected layers. These fully connected layers branch into two output layers:

One that produces soft max probability of K+1 classes(K classes and 1 background class)
Other which provides bounding box coordinates for K classes.

Before explaining further its better to explain some concepts which you need to understand.

2.1 ROI projection

In Fast RCNN approach, region proposals in the original image are projected onto the output of the final convolution feature map.This is used by ROI pooling later.

You may think that the input image and the feature map generated is of different dimensions. Then how to translate ROI proposals from image to feature map?

Don’t worry i will explain it with an example.

For that first you need to know about sub sampling ratio.

Consider we have a 18x18 image. After passing through some convolutions and max pooling suppose we get a 1x1 feature map.Then we would say we will have a subsampling ratio of 1/18. It is the ratio between scale of output feature map to input image.

I will explain one more example.In the below figure we have input size of 18x18 and output feature map of size 3x3.Then we will have a sub sampling ratio of 3/18 = 1/6

Now we understood sub sampling ratio. Next we will see how this helps in ROI projection.Let our input image be of size 688x920 and feature map be of size 43x58.We have a region proposal of size 320x128.

sub sampling ratio = 58/920 = 1/16

New bounding box coordinates = (320/16,128/16) = (20,8)

New bounding box center = (340/16,450/16) = (21,28)

This is how we do ROI projection for region proposals,I think now you become clear about the concept.Next thing we will understand how ROI pooling is done.

2.2 ROI pooling

Usually during proposal phase we generate a lot of regions.It is because once the object is not detected in first stage,it won’t get classified in any stage.We cannot compromise on Recall.Our network should have high recall.So large number of proposals must be generated.but it has some disadvantages.

Generating a large number of regions of interest can lead to performance problems. This would make real-time object detection difficult to implement.
We can’t train all the components of the system in one run.

ROI pooling arise as a solution to this. The RoI layer is simply the special case of the spatial pyramid pooling layer used in SPP nets in which there is only one pyramid level.It also speeds up both training and testing process.It takes 2 inputs.

A fixed sized feature map produced by deep convolution network.
An N x 5 matrix of representing a list of regions of interest, where N is a number of RoIs. The first column represents the image index and the remaining four are the coordinates of the top left and bottom right corners of the region

For every region of interest from input list, it takes the corresponding region from the input feature map and scales it to some predefined size(eg.7x7).The scaling is done by:

Dividing the region proposal into equal-sized sections (the number of which is the same as the dimension of the output)
Finding the largest value in each section
Copying these max values to the output buffer

I will explain it through an example.Let our feature map be as follows and the ROI is the dark square inside the feature map.Here we will reduce the roi to size of 2x2.

Note: The size of the region of interest doesn’t have to be perfectly divisible by the number of pooling sections (in this case our RoI is 7×5 and we have 2×2 pooling sections).

3. Training

Training all network weights with back propagation is an important capability of Fast-RCNN. Before getting into it its good to know,

why SPP net is unable to update weights below the spatial pyramid pooling layer?

The root cause is that back-propagation through the SPPlayer is highly inefficient when each training sample (i.e.RoI) comes from a different image, which is exactly how R-CNN and SPP net networks are trained.

In Fast-RCNN we propose a more efficient way,In training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. For example,when using N= 2 and R= 128, the proposed training scheme is roughly 64×faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPP net strategy).

3.1 Multi task Loss:

A Fast-Rcnn architecture have two output layers: one predicting probability of K+1 classes and other layer outputs bounding-box regression offsets,

tk = (tk-x,tk-y,tk-h,tk-w)

Where tk specifies ascale-invariant translation and log-space height/width shift relative to an object proposal.

Lcls is the log loss for true class u.
Llos is the loss for bounding box. We use smooth L1 loss rather than L2 loss used in R-CNN and SPP-Net
[u≥1] is equal to 1 when u≥1. (u=0 is background class)

4. Detection

Once the network is trained, detection is just a forward pass(Assuming object proposals are pre computed).The network takes an input image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. R is typically around 2000.When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 22⁴² pixels.

Now the image is forward passed and class probabilities and bounding box predictions are obtained.We then perform non-maximum suppression independently for each class and best bounding boxes are obtained.

5. Some Further Observations and Evaluations

5.1 SVM vs Soft max for classification

Soft max performs better than SVM.

5.2 Truncated SVD for faster detection

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers Large fully connected layers are easily accelerated by compressing them with truncated SVD.

5.3 Training at different scales

We used 2 approaches:brute force approach and image pyramids.

In the brute-force approach, each image is processed at a predefined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.

During multi-scale training, a pyramid scale is randomly sampled each time when an image is sampled, following , as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

During testing,an input image is tested with 5 different scales and following results were obtained for VOC07.

5.4 Fine turning layers

For SPP-Net and R-CNN fine-tuning only the fully connected layers appeared to be sufficient for good accuracy.But in case of Fast RCNN another approach was used.The thirteen convolution layers were freezed and only the fully connected layers were allowed to learn.But it reduced map from 66.9%–61.4%.

Does this mean that all convolution layers should be fine-tuned?

In short,no. In the smaller networks (S and M) we find that conv1 is generic and task independent. That means if we fine tune conv1 or not, it performs the same.For VGG16, they found that it only necessary to update layers from conv31 and up (9 of the 13conv layers). This improves the mAP.

5.6 Region proposals

Are more proposals always better.

VOC07 test mAP and avg recall

We find that mAP rises and then falls slightly as the proposal count increases.We can see increasing number of proposals does not increase map always.

6. Main Results

Three main results support this paper’s contributions:

1.State-of-the-art mAP on VOC07, 2010, and 2012

2. Fast training and testing compared to R-CNN, SPPnet3.

3. Fine-tuning convolution layers in VGG16 improves mAP.