Algorithms for Hyper-Parameter Optimization

Reference: Bergstra, James S., et al. "Algorithms for hyper-parameter optimization." Advances in neural information processing systems. 2011.

0. Abstract

Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P(y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements.

[1] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML 2007, pages 473–480, 2007.

1. Introduction

  • Motivation: (1) new models have many hyper-parameters, which is difficult to fine tune for humans; (2) recent results demonstrate that better performance could be achieved just with better hyper-parameters.
  • Hyper-parameter optimization is the problem of optimizing a loss function over a graph-structured configuration space. In this work we restrict ourselves to tree-structured configuration spaces.
  • In this work we define a configuration space by a generative process for drawing valid samples. Optimization algorithms work by identifying hyper-parameter assignments that could have been drawn, and that appear promising on the basis of the loss function’s value at other points.

2. Sequential Model-based Global Optimization

  • In an application where the true fitness function f : X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated.
  • SMBO algorithms differ in what criterion they optimize to obtain given a model (or surrogate) of f, and in they model f via observation history H. The algorithms in this work optimize the criterion of Expected Improvement (EI).
  • Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion.
  • Expected improvement is the expectation under some model M of f : X → that f(x) will exceed (negatively) some threshold .

3. The Gaussian Process Approach (GP)

  • Gaussian Processes are priors over functions that are closed under sampling, which means that if the prior distribution of f is believed to be a GP with mean 0 and kernel k, the conditional distribution of f knowing a sample of its values is also a GP, whose mean and covariance function are analytically derivable.
  • We model f with a GP and set to the best value found after observing H: . The model in (1) is then the posterior GP knowing H.

Optimizing EI in the GP

  • Sort of complex now, temporarily skipped

  • Since all hyper-parameters are not relevant for each point (e.g., a DBN with only one hidden layer does not have parameters associated to a second or third layer), we chose to group the hyper-parameters by common use in a tree-like fashion and place different independent GPs over each group.

4. Tree-structured Parzen Estimator Approach (TPE)

  • Whereas the Gaussian-process based approach modeled p(y|x) directly, this strategy models p(x|y) and p(y).
  • The TPE algorithm depends on a that is larger than the best observed f(x) so that some points can be used to form (x). The TPE algorithm chooses to be some quantile of the observed y values, so that , but no specific model for p(y) is necessary.
  • By maintaining sorted lists of observed variables in H, the runtime of each iteration of the TPE algorithm can scale linearly in |H| and linearly in the number of variables (dimensions) being optimized.

Optimizing EI in the TPE algorithm

Details of the Parzen Estimator

  • Sort of detailed now, temporarily skipped

5. Random Search for Hyper-Parameter Optimization in DBNs

  • The result of manual search can be reliably matched with 32 random trials for several datasets.

6. Sequential Search for Hyper-Parameter Optimization in DBNs

  • Results

Parallelizing Sequential Search

  • Different methods for GP and TPE

7. Discussion

  • TPE outperformed GE
Written on January 27, 2018