DeepXplore: Automated Whitebox Testing of Deep Learning Systems

Reference: Pei, Kexin, et al. "DeepXplore: Automated Whitebox Testing of Deep Learning Systems." arXiv preprint arXiv:1705.06640 (2017).

0. Abstract

Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system’s behavior for corner case inputs are of great importance. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs.

We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems. First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking. Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques.

DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model’s accuracy by up to 3%.

1. Introduction

  • Unfortunately, DL systems, despite their impressive capabilities, often demonstrate unexpected or incorrect behaviors in corner cases for several reasons such as biased training data, over-fitting, and under-fitting of the models.
  • The key challenges in automated systematic testing of large-scale DL systems are twofold: (1) how to generate inputs that trigger different parts of a DL system’s logic and uncover different types of erroneous behaviors, and (2) how to identify erroneous behaviors of a DL system without manual labeling/checking.

Contributions

  • We introduce neuron coverage as the first whitebox testing metric for DL systems that can estimate the amount of DL logic explored by a set of test inputs.
  • We demonstrate that the problem of finding a large number of behavioral differences between similar DL systems while maximizing neuron coverage can be formulated as a joint optimization problem. We present a gradient-based algorithm for solving this problem efficiently.
  • We implement all of these techniques as part of Deep-Xplore, the first whitebox DL-testing framework that exposed thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails as shown in Figure 1) in 15 state-of-the-art DL models with a total of 132,057 neurons trained on five popular datasets containing around 162 GB of data.
  • We show that the tests generated by DeepXplore can also be used to retrain the corresponding DL systems to improve classification accuracy by up to 3%.

2. Background

2.1 DL Systems

We define a DL system to be any software system that includes at least one Deep Neural Network (DNN) component.

2.2 DNN Architecture

  • DNN Architecture
  • Each layer of the network transforms the information contained in its input to a higher-level representation of the data.

2.3 Limitations of Existing DNN Testing

  • Expensive labeling effort
  • Low test coverage
  • Problems with low-coverage DNN tests

3. Overview

  • We perform a gradient-guided local search starting from the seed inputs and find new inputs that maximize the desired goals.
  • Goals: (1) DeepXplore tries to maximize the chances of finding differential behavior by modifying the input. (2) DeepXplore also tries to cover as many neurons as possible by activating (i.e., causing a neuron’s output to have a value greater than a threshold) inactive neurons in the hidden layer; (3) We further add domain-specific constraints (e.g., ensure the pixel values are integers within 0 and 255 for image input) to make sure that the modified inputs still represent real-world images.

4. Methodology

4.1 Definitions

  • Neuron coverage: We define neuron coverage of a set of test inputs as the ratio of the number of unique activated neurons for all test inputs and the total number of neurons in the DNN. We consider a neuron to be activated if its output is higher than a threshold value (e.g., 0).
  • Gradient

4.2 DeepXplore Algorithm

5. Implementation

We implement DeepXplore using TensorFlow 1.0.1 and Keras 2.0.3 DL frameworks.

6. Experimental Setup

6.1 Test Datasets and DNNs

  • We adopt five popular public datasets with different types of data - MNIST, ImageNet, Driving, Contagio/VirusTotal, and Drebin - and then evaluate DeepXplore on three DNNs for each dataset.

6.2 Domain-specific Constraints

  • Image constraints (MNIST, ImageNet, and Driving): (1) lighting effects for simulating different intensities of lights, (2) occlusion by a single small rectangle for simulating an attacker potentially blocking some parts of a camera, and (3) occlusion by multiple tiny black rectangles for simulating effects of dirt on camera lens.
  • Other constraints (Drebin and Contagio/VirusTotal)

7. Results

7.1 Benefits of Neuron Coverage

  • It has recently been shown that each neuron in a DNN tends to independently extract a specific feature of the input instead of collaborating with other neurons for feature extraction. Essentially, each neuron tends to learn a different set of rules than others.
  • We perform two different experiments to justify that neuron coverage is a good metric for DNN testing comprehensiveness: (1) Neuron coverage vs. code coverage. (2) Effect of neuron coverage on the difference-inducing inputs found by DeepXplore.

7.2 Performance

  • Neuron coverage: DeepXplore, on average, covers 34.4% and 33.2% more neurons than random testing and adversarial testing.
  • Execution time and number of seed inputs: we measure the execution time of DeepXplore to generate difference-inducing inputs with 100% neuron coverage for all the tested DNNs. We note that some neurons in fully-connected layers of DNNs on MNIST, ImageNet and Driving are very hard to activate, we thus only consider neuron coverage on layers except fully-connected layers. DeepXplore is very efficient in terms of finding difference-inducing inputs as well as increasing neuron coverage.
  • Different choices of hyperparameters: we use the time taken by DeepXplore to find the first difference-inducing input as the metric for comparing different choices of hyperparameters. We choose this metric as we observe that finding the first difference-inducing input for a given seed tend to be significantly harder than increasing the number of difference-inducing inputs.
  • Testing very similar models with DeepXplore

7.3 Improving DNNs with DeepXplore

  • Augmenting training data to improve accuracy: We augment the original training data of a DNN with the error-inducing inputs generated by DeepXplore for retraining the DNN to fix the erroneous behaviors and therefore improve its accuracy. It is different from adversarial inputs because adversarial testing requires manual labeling while DeepXplore can adopt majority voting to automatically generate labels for the generated test inputs.
  • Detecting training data pollution attack: We use DeepXplore to generate error-inducing inputs that are classified as the digit 9 and 1 by the unpolluted and polluted versions of the LeNet-5 DNN respectively. We then search for samples in the training set that are closest to the inputs generated by DeepXplore in terms of structural similarity and identify them as polluted data. Using this process, we are able to correctly identify 95.6% of the polluted samples.

8. Discussion

  • Causes of differences between DNNs: training data, the DNN architecture, hyperparameters, etc.
  • Overhead of training vs. testing DNNs
  • Limitations: (1) differential testing requires at least two different DNNs with the same functionality, if two DNNs only differ slightly (i.e., by a few neurons), DeepXplore will take longer to find difference-inducing inputs than if the DNNs were significantly different from each other; (2) differential testing can only detect an erroneous behavior if at least one DNN produces different results than other DNNs.
  • Adversarial deep learning
  • Testing and verification of DNNs
  • Other applications of DNN gradients
  • Differential testing of traditional software
Written on November 12, 2017