Algorithms for HyperParameter Optimization
Reference: Bergstra, James S., et al. "Algorithms for hyperparameter optimization." Advances in neural information processing systems. 2011.0. Abstract
Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyperparameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyperparameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyperparameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P(yx) in which many elements of hyperparameter assignment (x) are known to be irrelevant given particular values of other elements.
[1] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML 2007, pages 473–480, 2007.
1. Introduction
 Motivation: (1) new models have many hyperparameters, which is difficult to fine tune for humans; (2) recent results demonstrate that better performance could be achieved just with better hyperparameters.
 Hyperparameter optimization is the problem of optimizing a loss function over a graphstructured configuration space. In this work we restrict ourselves to treestructured configuration spaces.
 In this work we define a configuration space by a generative process for drawing valid samples. Optimization algorithms work by identifying hyperparameter assignments that could have been drawn, and that appear promising on the basis of the loss function’s value at other points.
2. Sequential Modelbased Global Optimization
 In an application where the true fitness function f : X → R is costly to evaluate, modelbased algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated.
 SMBO algorithms differ in what criterion they optimize to obtain given a model (or surrogate) of f, and in they model f via observation history H. The algorithms in this work optimize the criterion of Expected Improvement (EI).
 Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the banditbased criterion.
 Expected improvement is the expectation under some model M of f : X → that f(x) will exceed (negatively) some threshold .
3. The Gaussian Process Approach (GP)
 Gaussian Processes are priors over functions that are closed under sampling, which means that if the prior distribution of f is believed to be a GP with mean 0 and kernel k, the conditional distribution of f knowing a sample of its values is also a GP, whose mean and covariance function are analytically derivable.
 We model f with a GP and set to the best value found after observing H: . The model in (1) is then the posterior GP knowing H.
Optimizing EI in the GP

Sort of complex now, temporarily skipped

Since all hyperparameters are not relevant for each point (e.g., a DBN with only one hidden layer does not have parameters associated to a second or third layer), we chose to group the hyperparameters by common use in a treelike fashion and place different independent GPs over each group.
4. Treestructured Parzen Estimator Approach (TPE)
 Whereas the Gaussianprocess based approach modeled p(yx) directly, this strategy models p(xy) and p(y).
 The TPE algorithm depends on a that is larger than the best observed f(x) so that some points can be used to form (x). The TPE algorithm chooses to be some quantile of the observed y values, so that , but no specific model for p(y) is necessary.
 By maintaining sorted lists of observed variables in H, the runtime of each iteration of the TPE algorithm can scale linearly in H and linearly in the number of variables (dimensions) being optimized.
Optimizing EI in the TPE algorithm
Details of the Parzen Estimator
 Sort of detailed now, temporarily skipped
5. Random Search for HyperParameter Optimization in DBNs
 The result of manual search can be reliably matched with 32 random trials for several datasets.
6. Sequential Search for HyperParameter Optimization in DBNs
 Results
Parallelizing Sequential Search
 Different methods for GP and TPE
7. Discussion
 TPE outperformed GE