Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks

Reference: Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

0. Abstract

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. Code is available at

1. Introduction

  • Since region proposal step is too slow, we propose Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks. On top of these conv features, we construct RPNs by adding two additional conv layers: one that encodes each conv map position into a short (e.g., 256-d) feature vector and a second that, at each conv map position, outputs an objectness score and regressed bounds for k region proposals relative to various scales and aspect ratios at that location (k = 9 is a typical value).
  • We propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed.
  • Code has been made publicly available at (in MATLAB) and (in Python).
  • OverFeat
  • MultiBox
  • SPP
  • Fast-RCNN

3. Region Proposal Networks

  • A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score (“Objectness” measures membership to a set of object classes vs. background).
  • ReLUs are applied to the output of the n x n conv layer.

Translation-Invariant Anchors

  • The cls layer outputs 2k scores that estimate probability of object / not-object for each proposal.
  • The k proposals are parameterized relative to k reference boxes, called anchors. Each anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. We use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position.

A Loss Function for Learning Region Proposals

  • We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersectionover-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box.
  • We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes.
  • Anchors that are neither positive nor negative do not contribute to the training objective.


  • We randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1 (to avoid bias towards negative samples as they are dominate).
  • Parameters initialization and selection

Sharing Convolutional Features for Region Proposal and Object Detection

  • For the detection network, we adopt Fast R-CNN. Since Fast R-CNN training depends on fixed object proposals, we cannot train RPN and Fast-RCNN in a single network.
  • 4-step training algorithm: (1) train RPN with ImageNetpre-trained model; (2) train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN; (3) use the detector network to initialize RPN training, but we fix the shared conv layers and only fine-tune the layers unique to RPN; (4) keeping the shared conv layers fixed, we fine-tune the fc layers of the Fast R-CNN.

Implementation Details

  • To reduce redundancy, we adopt nonmaximum suppression (NMS) on the proposal regions based on their cls scores. NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals.

4. Experiments

Ablation Experiments

Detection Accuracy and Running Time of VGG-16

Analysis of Recall-to-IoU

One-Stage Detection vs. Two-Stage Proposal + Detection

Written on December 10, 2017