Hyperband: A Novel BanditBased Approach to Hyperparameter Optimization
Reference: Li, Lisha, et al. "Hyperband: A novel banditbased approach to hyperparameter optimization." arXiv preprint arXiv:1603.06560 (2016).0. Abstract
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While current methods offer efficiencies by adaptively choosing new configurations to train, an alternative strategy is to adaptively allocate resources across the selected configurations. We formulate hyperparameter optimization as a pureexploration nonstochastic infinitely many armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce Hyperband for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with stateoftheart methods on a suite of hyperparameter optimization problems. We observe that Hyperband provides 5x to 30x speedup over stateoftheart Bayesian optimization algorithms on a variety of deeplearning and kernelbased learning problems.
1. Introduction
 In an effort to develop more efficient search methods, the problem of hyperparameter optimization has recently been dominated by Bayesian optimization methods that focus on optimizing hyperparameter configuration selection. Existing empirical evidence suggests that these methods outperform random search.
 However, these methods tackle a fundamentally challenging problem of simultaneously fitting and optimizing a highdimensional, nonconvex function with unknown smoothness, and possibly noisy evaluations. To overcome these difficulties, some Bayesian optimization methods resort to heuristics, which are not endowed with any theoretical consistency guarantees.

Moreover, these adaptive configuration selection methods are intrinsically sequential and thus difficult to parallelize.
 We explore a different direction for hyperparameter optimization that focuses on speeding up configuration evaluation.
 In light of a recent empirical study by Feurer et al. (random 2x was better than two Bayesian optimization methods), along with the amenability of random search to theoretical analysis, we focus on speeding up random search using our configuration evaluation approach.
 A theoretical contribution of this work is the introduction of the pureexploration, infinitelymany armed bandit problem in the nonstochastic setting, for which Hyperband is one solution.
2. Hyperband Algorithm
2.1 SuccessiveHalving
 The idea behind the original SuccessiveHalving algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configurations remains.
 If more resources are required before configurations can differentiate themselves in terms of quality, then it would be reasonable to work with a small number of configurations; if the quality of a configuration is typically revealed using minimal resources, then n is the bottleneck and we should choose n to be large.
2.2 Hyperband
2.3 Example Application with Iterations as A Resource: LeNet
2.4 Different Types of Resources
 Time
 Dataset Subsampling
 Feature Subsampling
2.5 Setting R
 R: the maximum amount of resources that can be allocated to any given configuration
 R is also the number of configurations evaluated in the bracket that performs the most exploration, i.e .
2.6 Setting η
 We stress that results are not very sensitive to the choice of η. If our theoretical bounds are optimized (see Section 4) they suggest choosing η = e ≈ 2.718 but in practice we suggest taking η to be equal to 3 or 4 (if you don’t know how to choose η, use η = 3).
2.7 Overview of Theoretical Results
 The relative size of the budget required for uniform allocation and SuccessiveHalving depends on the envelope functions bounding deviation from terminal losses as well as the distribution from which ’s are drawn.
3. Hyperparameter Optimization Experiments
3.1 Early Stopping Iterative Algorithms for Deep Learning
 The first result returned by Hyperband after using a budget of 5R is often competitive with results returned by other searchers after using 50R.
 Hyperband is less variable than other searchers across trials, which is highly desirable in practice.
 For computationally expensive problems in high dimensional search spaces, it may make sense to just repeat the most exploratory brackets.
3.2 Dataset Subsampling
3.3 Feature Subsampling to Speed up Approximate Kernel Classification
3.4 Experimental Discussion
 The differences in realized speedup can be explained by two factors: (1) the scaling properties of total evaluation time as a function of the allocated resource and (2) the difficulty of finding a good configuration.
 Total evaluation time consists of the time to train the model on a given amount of resource as well as overhead costs. If training time is superlinear as a function of the resource, then Hyperband can offer higher speedups.
 In practice, it is important to choose regularization hyperparameters that are independent of the resource when possible.
4. Theory
 Nonstochastic infinitearmed bandit (NIAB) problem
 No notes needed for this section
5. Related Work
5.1 Hyperparameter Optimization
 Bayesian optimization techniques model the conditional probability p(fλ) of a configurations’ performance on a metric f given a set of hyperparameters λ. Sequential Modelbased Algorithm Configuration (SMAC), Treestructure Parzen Estimator (TPE), and Spearmint are the three most well known Bayesian searchers.
 One of the first extensive surveys of these three methods by Eggensperger et al. (2013) introduces a benchmark library for hyperparameter optimization called HPOlib, which we use in our experiments.
 As for adaptive configuration evaluation, early stopping poor configurations is not a new idea, however previous approaches either require strong assumptions or use heuristics to perform adaptive resource allocation.
5.2 Bandit Problems
 One could argue that our work is the first to propose a practical pure exploration algorithm for infinitely many armed bandits and test it on a real application.
6. Extensions and Open Questions
6.1 Distributed Implementations
 Hyperband is easy to parallelize since arms are independent and sampled randomly.
6.2 Adjusting for Different Convergence Rates
6.3 Incorporating NonRandom Sampling
Appendix
 Experiments using Alex Krizhevsky’s CNN architecture: the 18% model provided on cudaconvnet for CIFAR10