Fast R-CNN

Reference: Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.

0. Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at

1. Introduction

  • We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

1.1 R-CNN and SPPnet

  • Region-based Convolutional Network method has notable drawbacks: (1) training is a multi-stage pipeline; (2) training is expensive in space and time; (3) object detection is slow.
  • Although SPPnet speeds up R-CNN by sharing computation, it still requires training in a multi-stage pipeline and has limitation on fixed convolutional layers.

1.2 Contributions

  • The Fast RCNN method has several advantages: (1) higher detection quality (mAP) than R-CNN, SPPnet; (2) training is single-stage, using a multi-task loss; (3) training can update all network layers; (4) no disk storage is required for feature caching.

2. Fast R-CNN Architecture and Training

2.1 The ROI Pooling Layer

  • The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H x W (e.g., 7 x 7), where H and W are layer hyper-parameters that are independent of any particular RoI.

2.2 Initializing from Pre-trained Networks

  • When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations: (1) the last max pooling layer is replaced by a RoI pooling; (2) the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier; (3) the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

2.3 Fine-tuning for Detection

  • In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. RoIs from the same image share computation and memory in the forward and backward passes.

Multi-task Loss

  • We normalize the ground-truth regression targets to have zero mean and unit variance. All experiments use .

Mini-batch Sampling

  • We take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5, which comprise the examples labeled with a foreground object class, i.e. .
  • The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), which are the background examples and are labeled with u = 0.

Back-propagation through RoI Pooling Layer

SGD Hyper-parameters

  • The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0.
  • All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.
  • A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.

2.4 Scale Invariance

  • Brute force learning: processing each image at a pre-defined pixel size during training and testing
  • Using image pyramids

3. Fast R-CNN Detection

3.1 Truncated SVD for Faster Detection

  • For detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers.
  • Large fully connected layers are easily accelerated by compressing them with truncated SVD.

4. Main Results

  • State-of-the-art mAP on VOC07, 2010, and 2012
  • Fast training and testing compared to R-CNN, SPPnet
  • Fine-tuning conv layers in VGG16 improves mAP

  • Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression.
  • Training through the RoI pooling layer is important for very deep nets.

5. Design Evaluation

  • Multi-task training has consistent positive effect.
  • Single-scale detection performs almost as well as multi-scale detection with less computation.
  • Sparse object proposals appear to improve detector quality.
Written on December 9, 2017