Fast R-CNNReference: Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
- We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
1.1 R-CNN and SPPnet
- Region-based Convolutional Network method has notable drawbacks: (1) training is a multi-stage pipeline; (2) training is expensive in space and time; (3) object detection is slow.
- Although SPPnet speeds up R-CNN by sharing computation, it still requires training in a multi-stage pipeline and has limitation on fixed convolutional layers.
- The Fast RCNN method has several advantages: (1) higher detection quality (mAP) than R-CNN, SPPnet; (2) training is single-stage, using a multi-task loss; (3) training can update all network layers; (4) no disk storage is required for feature caching.
2. Fast R-CNN Architecture and Training
2.1 The ROI Pooling Layer
- The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H x W (e.g., 7 x 7), where H and W are layer hyper-parameters that are independent of any particular RoI.
2.2 Initializing from Pre-trained Networks
- When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations: (1) the last max pooling layer is replaced by a RoI pooling; (2) the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier; (3) the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
2.3 Fine-tuning for Detection
- In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. RoIs from the same image share computation and memory in the forward and backward passes.
- We normalize the ground-truth regression targets to have zero mean and unit variance. All experiments use .
- We take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5, which comprise the examples labeled with a foreground object class, i.e. .
- The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), which are the background examples and are labeled with u = 0.
Back-propagation through RoI Pooling Layer
- The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0.
- All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.
- A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.
2.4 Scale Invariance
- Brute force learning: processing each image at a pre-defined pixel size during training and testing
- Using image pyramids
3. Fast R-CNN Detection
3.1 Truncated SVD for Faster Detection
- For detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers.
- Large fully connected layers are easily accelerated by compressing them with truncated SVD.
4. Main Results
- State-of-the-art mAP on VOC07, 2010, and 2012
- Fast training and testing compared to R-CNN, SPPnet
Fine-tuning conv layers in VGG16 improves mAP
- Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression.
- Training through the RoI pooling layer is important for very deep nets.
5. Design Evaluation
- Multi-task training has consistent positive effect.
- Single-scale detection performs almost as well as multi-scale detection with less computation.
- Sparse object proposals appear to improve detector quality.