DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

Reference: Donahue, Jeff, et al. "Decaf: A deep convolutional activation feature for generic visual recognition." International conference on machine learning. 2014.

0. Abstract

We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

1. Introduction

  • With limited training data, however, fully-supervised deep architectures will generally dramatically overfit the training data.
  • In this paper we investigate semi-supervised multi-task learning of deep convolutional representations, where representations are learned on a set of related problems but applied to new tasks which have too few training examples to learn a full deep representation.
  • Our model can either be considered as a deep architecture for transfer learning based on a supervised pre-training phase, or simply as a new visual feature DeCAF defined by the convolutional network weights learned on a set of pre-defined object recognition tasks.
  • Deep convolutional networks in computer vision
  • Learning from related tasks
  • Transfer learning across tasks using deep representations

3. Deep Convolutional Activation Features

In our approach, a deep convolutional model is first trained in a fully supervised setting using a state-of-the-art method Krizhevsky et al. (2012). We then extract various features from this network, and evaluate the efficacy of these features on generic vision tasks.

3.1 An Open-source Convolutional Model

  • We chose this model due to its performance on a difficult 1000-way classification task, hypothesizing that the activations of the neurons in its late hidden layers might serve as very strong features for a variety of object recognition tasks.
  • we closely followed with the exception of two small differences in the input data. First, we ignore the image’s original aspect ratio and warp it to 256 x 256, rather than resizing and cropping to preserve the proportions. Secondly, we did not perform the data augmentation trick of adding random multiples of the principle components of the RGB pixel values throughout the dataset, proposed as a way of capturing invariance to changes in illumination and color.

3.2 Feature Generalization and Visualization

  • We visualize features in the following way: we run the t-SNE algorithm to find a 2-dimensional embedding of the high-dimensional feature space, and plot them as points colored depending on their semantic category in a particular hierarchy.
  • Figure 1 is compatible with common deep learning knowledge that the first layers learn “low-level” features, whereas the latter layers learn semantic or “highlevel” features. Furthermore, other features such as GIST or LLC fail to capture the semantic difference in the image.

3.3 Time Analysis

  • We observe that the convolution and fully-connected layers take most of the time to run, which is understandable as they involve large matrix-matrix multiplications.

4. Experiments

  • We chose not to evaluate features from any earlier in the network, as the earlier convolutional layers are unlikely to contain a richer semantic representation than the later features which form higher-level hypotheses from the low to mid-level local information in the activations of the convolutional layers.

4.1 Object recognition

  • Our one-shot learning results (e.g., 33.0% for SVM) suggest that with sufficiently strong representations like DeCAF, useful models of visual categories can often be learned from just a single positive example.

4.2 Domain adaptation

4.3 Subcategory recognition

4.4 Scene recognition

5. Discussion

Written on September 28, 2017