Fast RCNN
Reference: Girshick, Ross. "Fast rcnn." Proceedings of the IEEE international conference on computer vision. 2015.0. Abstract
This paper proposes a Fast Regionbased Convolutional Network method (Fast RCNN) for object detection. Fast RCNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast RCNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast RCNN trains the very deep VGG16 network 9x faster than RCNN, is 213x faster at testtime, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast RCNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast RCNN is implemented in Python and C++ (using Caffe) and is available under the opensource MIT License at https://github.com/rbgirshick/fastrcnn.
1. Introduction
 We propose a singlestage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
1.1 RCNN and SPPnet
 Regionbased Convolutional Network method has notable drawbacks: (1) training is a multistage pipeline; (2) training is expensive in space and time; (3) object detection is slow.
 Although SPPnet speeds up RCNN by sharing computation, it still requires training in a multistage pipeline and has limitation on fixed convolutional layers.
1.2 Contributions
 The Fast RCNN method has several advantages: (1) higher detection quality (mAP) than RCNN, SPPnet; (2) training is singlestage, using a multitask loss; (3) training can update all network layers; (4) no disk storage is required for feature caching.
2. Fast RCNN Architecture and Training
2.1 The ROI Pooling Layer
 The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H x W (e.g., 7 x 7), where H and W are layer hyperparameters that are independent of any particular RoI.
2.2 Initializing from Pretrained Networks
 When a pretrained network initializes a Fast RCNN network, it undergoes three transformations: (1) the last max pooling layer is replaced by a RoI pooling; (2) the network’s last fully connected layer and softmax are replaced with the two sibling layers described earlier; (3) the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
2.3 Finetuning for Detection
 In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. RoIs from the same image share computation and memory in the forward and backward passes.
Multitask Loss
 We normalize the groundtruth regression targets to have zero mean and unit variance. All experiments use .
Minibatch Sampling
 We take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5, which comprise the examples labeled with a foreground object class, i.e. .
 The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5), which are the background examples and are labeled with u = 0.
Backpropagation through RoI Pooling Layer
SGD Hyperparameters
 The fully connected layers used for softmax classification and boundingbox regression are initialized from zeromean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0.
 All layers use a perlayer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.
 A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.
2.4 Scale Invariance
 Brute force learning: processing each image at a predefined pixel size during training and testing
 Using image pyramids
3. Fast RCNN Detection
3.1 Truncated SVD for Faster Detection
 For detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers.
 Large fully connected layers are easily accelerated by compressing them with truncated SVD.
4. Main Results
 Stateoftheart mAP on VOC07, 2010, and 2012
 Fast training and testing compared to RCNN, SPPnet

Finetuning conv layers in VGG16 improves mAP
 Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional finetuning after model compression.
 Training through the RoI pooling layer is important for very deep nets.
5. Design Evaluation
 Multitask training has consistent positive effect.
 Singlescale detection performs almost as well as multiscale detection with less computation.
 Sparse object proposals appear to improve detector quality.