# Fast R-CNN

Reference: Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.

## 0. Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

## 1. Introduction

• We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

### 1.1 R-CNN and SPPnet

• Region-based Convolutional Network method has notable drawbacks: (1) training is a multi-stage pipeline; (2) training is expensive in space and time; (3) object detection is slow.
• Although SPPnet speeds up R-CNN by sharing computation, it still requires training in a multi-stage pipeline and has limitation on fixed convolutional layers.

### 1.2 Contributions

• The Fast RCNN method has several advantages: (1) higher detection quality (mAP) than R-CNN, SPPnet; (2) training is single-stage, using a multi-task loss; (3) training can update all network layers; (4) no disk storage is required for feature caching.

## 2. Fast R-CNN Architecture and Training

### 2.1 The ROI Pooling Layer

• The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H x W (e.g., 7 x 7), where H and W are layer hyper-parameters that are independent of any particular RoI.

### 2.2 Initializing from Pre-trained Networks

• When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations: (1) the last max pooling layer is replaced by a RoI pooling; (2) the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier; (3) the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

### 2.3 Fine-tuning for Detection

• In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. RoIs from the same image share computation and memory in the forward and backward passes.

• We normalize the ground-truth regression targets $v_i$ to have zero mean and unit variance. All experiments use $\lambda = 1$.

Mini-batch Sampling

• We take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5, which comprise the examples labeled with a foreground object class, i.e. $u \ge 1$.
• The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), which are the background examples and are labeled with u = 0.

Back-propagation through RoI Pooling Layer

SGD Hyper-parameters

• The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0.
• All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.
• A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.

### 2.4 Scale Invariance

• Brute force learning: processing each image at a pre-defined pixel size during training and testing
• Using image pyramids

## 3. Fast R-CNN Detection

### 3.1 Truncated SVD for Faster Detection

• For detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers.
• Large fully connected layers are easily accelerated by compressing them with truncated SVD.

### 4. Main Results

• State-of-the-art mAP on VOC07, 2010, and 2012
• Fast training and testing compared to R-CNN, SPPnet
• Fine-tuning conv layers in VGG16 improves mAP

• Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression.
• Training through the RoI pooling layer is important for very deep nets.

## 5. Design Evaluation

• Multi-task training has consistent positive effect.
• Single-scale detection performs almost as well as multi-scale detection with less computation.
• Sparse object proposals appear to improve detector quality.
Written on December 9, 2017