Combining Hyperband and Bayesian Optimization

Reference: Falkner, Stefan, Aaron Klein, and Frank Hutter. "Combining Hyperband and Bayesian Optimization."

0. Abstract

Proper hyperparameter optimization is computationally very costly for expensive machine learning methods, such as deep neural networks; the same holds true for neural architecture search. Recently, the bandit-based strategy Hyperband has shown superior performance to vanilla Bayesian optimization methods that are limited to the traditional problem formulation of expensive blackbox optimization. However, while Hyperband has strong anytime performance for finding configurations with acceptable results, it relies on random search and therefore does not find the best configurations quickly. We propose to combine Hyperband with Bayesian optimization by maintaining a probabilistic model that captures the density of good configurations in the input space and samples from this model instead of sampling uniformly at random. We empirically show that our new method combines Hyperband’s strong anytime performance with the strong eventual performance of Bayesian optimization.

1. Introduction

  • Bayesian optimization (BO) and random search automates the search space for good hyperparameter settings by formulating it as a blackbox optimization problem. However, they are too slow when the evaluation of configurations is costly.
  • Hyperband (HB) and its building block of successive halving exploit this strategy by evaluating N hyperparameter configurations based on one unit of budget (e.g., a fixed number of epochs, or time interval), continue the best half to two units of budget, the best half thereof to four units, etc. However, it mostly finds reasonably good configurations; since Hyperband and successive halving are based on random search, they do not use previous samples to guide their search and therefore are slow in finding the best configurations.

2. Background

Bayesian Optimization

Tree-of-Parzen-Estimators (TPE)

Hyperband (HB)

In practice, HB works very well and typically outperforms random search and Bayesian optimization methods operating on the full function evaluation budget quite easily for small to medium total budgets. However, its convergence to the global optimum is limited by its reliance on randomly-drawn configurations.

3. Model-based Hyperband

  • Strong anytime performance: for short/medium budgets, we aim to do as well as HB.
  • Strong final performance: for large budgets, we aim to converge as well as BO.
  • Computational efficiency: to facilitate parallelization, we aim for minimal overhead of model construction and use compared to the shortest possible runs.
  • Conceptual simplicity: compared to sophisticated BO methods, Hyperband is very simple, allowing quick re-implementation in different frameworks; we thus aim for a simple model.
  • Robustness: in contrast to existing BO approaches that go beyond blackbox optimization, we aim to avoid any parametric assumptions and to also perform well for high-dimensional and conditional spaces.

Our extension replaces the random sampling of configurations at the beginning of each HB iteration by a model-based search. Once the desired number of configurations for the iteration is reached, the standard successive halving procedure is carried out using these configurations, and we keep track of the performance across all budgets to use as a basis for our models in later iterations.

4. Experiments

Written on January 27, 2018