Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Reference: Li, Lisha, et al. "Hyperband: A novel bandit-based approach to hyperparameter optimization." arXiv preprint arXiv:1603.06560 (2016).

0. Abstract

Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While current methods offer efficiencies by adaptively choosing new configurations to train, an alternative strategy is to adaptively allocate resources across the selected configurations. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinitely many armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce Hyperband for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with state-of-the-art methods on a suite of hyperparameter optimization problems. We observe that Hyperband provides 5x to 30x speedup over state-of-the-art Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems.

1. Introduction

  • In an effort to develop more efficient search methods, the problem of hyperparameter optimization has recently been dominated by Bayesian optimization methods that focus on optimizing hyperparameter configuration selection. Existing empirical evidence suggests that these methods outperform random search.
  • However, these methods tackle a fundamentally challenging problem of simultaneously fitting and optimizing a high-dimensional, non-convex function with unknown smoothness, and possibly noisy evaluations. To overcome these difficulties, some Bayesian optimization methods resort to heuristics, which are not endowed with any theoretical consistency guarantees.
  • Moreover, these adaptive configuration selection methods are intrinsically sequential and thus difficult to parallelize.

  • We explore a different direction for hyperparameter optimization that focuses on speeding up configuration evaluation.
  • In light of a recent empirical study by Feurer et al. (random 2x was better than two Bayesian optimization methods), along with the amenability of random search to theoretical analysis, we focus on speeding up random search using our configuration evaluation approach.
  • A theoretical contribution of this work is the introduction of the pure-exploration, infinitely-many armed bandit problem in the non-stochastic setting, for which Hyperband is one solution.

2. Hyperband Algorithm

2.1 SuccessiveHalving

  • The idea behind the original SuccessiveHalving algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configurations remains.
  • If more resources are required before configurations can differentiate themselves in terms of quality, then it would be reasonable to work with a small number of configurations; if the quality of a configuration is typically revealed using minimal resources, then n is the bottleneck and we should choose n to be large.

2.2 Hyperband

2.3 Example Application with Iterations as A Resource: LeNet

2.4 Different Types of Resources

  • Time
  • Dataset Subsampling
  • Feature Subsampling

2.5 Setting R

  • R: the maximum amount of resources that can be allocated to any given configuration
  • R is also the number of configurations evaluated in the bracket that performs the most exploration, i.e .

2.6 Setting η

  • We stress that results are not very sensitive to the choice of η. If our theoretical bounds are optimized (see Section 4) they suggest choosing η = e ≈ 2.718 but in practice we suggest taking η to be equal to 3 or 4 (if you don’t know how to choose η, use η = 3).

2.7 Overview of Theoretical Results

  • The relative size of the budget required for uniform allocation and SuccessiveHalving depends on the envelope functions bounding deviation from terminal losses as well as the distribution from which ’s are drawn.

3. Hyperparameter Optimization Experiments

3.1 Early Stopping Iterative Algorithms for Deep Learning

  • The first result returned by Hyperband after using a budget of 5R is often competitive with results returned by other searchers after using 50R.
  • Hyperband is less variable than other searchers across trials, which is highly desirable in practice.
  • For computationally expensive problems in high dimensional search spaces, it may make sense to just repeat the most exploratory brackets.

3.2 Dataset Subsampling

3.3 Feature Subsampling to Speed up Approximate Kernel Classification

3.4 Experimental Discussion

  • The differences in realized speedup can be explained by two factors: (1) the scaling properties of total evaluation time as a function of the allocated resource and (2) the difficulty of finding a good configuration.
  • Total evaluation time consists of the time to train the model on a given amount of resource as well as overhead costs. If training time is superlinear as a function of the resource, then Hyperband can offer higher speedups.
  • In practice, it is important to choose regularization hyperparameters that are independent of the resource when possible.

4. Theory

  • Non-stochastic infinite-armed bandit (NIAB) problem
  • No notes needed for this section

5.1 Hyperparameter Optimization

  • Bayesian optimization techniques model the conditional probability p(f|λ) of a configurations’ performance on a metric f given a set of hyperparameters λ. Sequential Model-based Algorithm Configuration (SMAC), Tree-structure Parzen Estimator (TPE), and Spearmint are the three most well known Bayesian searchers.
  • One of the first extensive surveys of these three methods by Eggensperger et al. (2013) introduces a benchmark library for hyperparameter optimization called HPOlib, which we use in our experiments.
  • As for adaptive configuration evaluation, early stopping poor configurations is not a new idea, however previous approaches either require strong assumptions or use heuristics to perform adaptive resource allocation.

5.2 Bandit Problems

  • One could argue that our work is the first to propose a practical pure exploration algorithm for infinitely many armed bandits and test it on a real application.

6. Extensions and Open Questions

6.1 Distributed Implementations

  • Hyperband is easy to parallelize since arms are independent and sampled randomly.

6.2 Adjusting for Different Convergence Rates

6.3 Incorporating Non-Random Sampling

Appendix

  • Experiments using Alex Krizhevsky’s CNN architecture: the 18% model provided on cuda-convnet for CIFAR-10
Written on October 25, 2017