Hyperparameteroptimierung

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.^[1] The objective function takes a tuple of hyperparameters and returns the associated loss.^[1] Cross-validation is often used to estimate this generalization performance.^[2]

Approaches

Grid search

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set^[3] or evaluation on a held-out validation set.^[4]

Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.

For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say

C\in \{10,100,1000\}

\gamma \in \{0.1,0.2,0.5,1.0\}

Grid search then trains an SVM with each pair (C, γ) in the Cartesian product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.

Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because typically the hyperparameter settings it evaluates are independent of each other.^[2]

Random search

Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm.^[2] In this case, the optimization problem is said to have a low intrinsic dimensionality.^[5] Random Search is also embarrassingly parallel, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample.

Bayesian optimization

Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization, aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). In practice, Bayesian optimization has been shown^[6]^[7]^[8]^[9] to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.

Gradient-based optimization

For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks.^[10] Since then, these methods have been extended to other models such as support vector machines^[11] or logistic regression.^[12]

A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using automatic differentiation.^[13]^[14]

Evolutionary optimization

Evolutionary optimization is a methodology for the global optimization of noisy black-box functions. In hyperparameter optimization, evolutionary optimization uses evolutionary algorithms to search the space of hyperparameters for a given algorithm.^[7] Evolutionary hyperparameter optimization follows a process inspired by the biological concept of evolution:

Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+)
Evaluate the hyperparameters tuples and acquire their fitness function (e.g., 10-fold cross-validation accuracy of the machine learning algorithm with those hyperparameters)
Rank the hyperparameter tuples by their relative fitness
Replace the worst-performing hyperparameter tuples with new hyperparameter tuples generated through crossover and mutation
Repeat steps 2-4 until satisfactory algorithm performance is reached or algorithm performance is no longer improving

Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,^[7] automated machine learning, deep neural network architecture search,^[15]^[16] as well as training of the weights in deep neural networks.^[17]

Others

RBF^[18] and spectral^[19] approaches have also been developed.

Open-source software

Grid search

scikit-learn is a Python package which includes grid search.
Talos includes grid search for Keras.

Random search

hyperopt, also via hyperas and hyperopt-sklearn, are Python packages which include random search.
scikit-learn is a Python package which includes random search.
Talos includes a customizable random search for Keras.

Bayesian

Auto-WEKA^[20] is a Bayesian hyperparameter optimization layer on top of WEKA.
Auto-sklearn^[21] is a Bayesian hyperparameter optimization layer on top of scikit-learn.
BOCS is a Matlab package which uses semidefinite programming for minimizing a black-box function over discrete inputs.^[22] A Python 3 implementation is also included.
HpBandSter is a Python package which combines Bayesian optimization with bandit-based methods.^[23]
mlrMBO, also with mlr, is an R package for model-based/Bayesian optimization of black-box functions.
scikit-optimize is a Python package or sequential model-based optimization with a scipy.optimize interface.^[24]
SMAC SMAC is a Python/Java library implementing Bayesian optimization.^[25]
tuneRanger is an R package for tuning random forests using model-based optimization.

Evolutionary

deap is a Python framework for general evolutionary computation which is flexible and integrates with parallelization packages like scoop and pyspark, and other Python frameworks like sklearn via sklearn-deap.
devol is a Python package that performs Deep Neural Network architecture search using genetic programming.
nevergrad^[26] is a Python package which includes population control methods and particle swarm optimization.^[27]

Other

Harmonica is a Python package for spectral hyperparameter optimization.^[19]
hyperopt, also via hyperas and hyperopt-sklearn, are Python packages which include Tree of Parzen Estimators based distributed hyperparameter optimization.
nevergrad^[26] is a Python package for gradient-free optimization using techniques such as differential evolution, sequential quadratic programming, fastGA, covariance matrix adaptation, population control methods, and particle swarm optimization.^[27]
pycma is a Python implementation of Covariance Matrix Adaptation Evolution Strategy.
rbfopt is a Python package that uses a radial basis function model^[18]

Commercial services

BigML OptiML supports mixed search domains
Google HyperTune supports mixed search domains
Indie Solver supports multiobjective, multifidelity and constraint optimization
Mind Foundry OPTaaS supports mixed search domains, multiobjective, constraints, parallel optimization and surrogate models.
SigOpt supports mixed search domains, multiobjective, multisolution, multifidelity, constraint (linear and black-box), and parallel optimization.

References

Vorlage:Reflist

↑ ^a ^b Vorlage:Cite arxiv
↑ ^a ^b ^c James Bergstra, Yoshua Bengio: Random Search for Hyper-Parameter Optimization. In: J. Machine Learning Research. 13. Jahrgang, 2012, S. 281–305 (mit.edu [PDF]).
↑ Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.
↑ Chicco D: Ten quick tips for machine learning in computational biology. In: BioData Mining. 10. Jahrgang, Nr. 35, Dezember 2017, S. 35, doi:10.1186/s13040-017-0155-3, PMID 29234465, PMC 5721660 (freier Volltext).
↑ Wang Ziyu, Hutter Frank, Zoghi Masrour, Matheson David, de Feitas Nando: Bayesian Optimization in a Billion Dimensions via Random Embeddings. In: Journal of Artificial Intelligence Research. 55. Jahrgang, 2016, S. 361–387, doi:10.1613/jair.4806 (englisch, jair.org).
↑ Vorlage:Citation
↑ ^a ^b ^c Vorlage:Citation
↑ Jasper Snoek, Hugo Larochelle, Ryan Adams: Practical Bayesian Optimization of Machine Learning Algorithms. In: Advances in Neural Information Processing Systems. 2012, arxiv:1206.2944, bibcode:2012arXiv1206.2944S (nips.cc [PDF]).
↑ Chris Thornton, Frank Hutter, Holger Hoos, Kevin Leyton-Brown: Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In: Knowledge Discovery and Data Mining. 2013, arxiv:1208.3719, bibcode:2012arXiv1208.3719T (ubc.ca [PDF]).
↑ Jan Larsen, Lars Kai Hansen, Claus Svarer, M Ohlsson: Design and regularization of neural networks: the optimal use of a validation set. In: Proceedings of the 1996 IEEE Signal Processing Society Workshop. 1996.
↑ Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee: Choosing multiple parameters for support vector machines. In: Machine Learning. 46. Jahrgang, 2002, S. 131–159, doi:10.1023/a:1012450327387 (chapelle.cc [PDF]).
↑ Chuong B, Chuan-Sheng Foo, Andrew Y Ng: Efficient multiple hyperparameter learning for log-linear models. In: Advances in Neural Information Processing Systems 20. 2008.
↑ Justin Domke: Generic Methods for Optimization-Based Modeling. In: AISTATS. 22. Jahrgang, 2012 (jmlr.org [PDF]).
↑ Vorlage:Cite arXiv
↑ Vorlage:Cite arxiv
↑ Vorlage:Cite arxiv
↑ Vorlage:Cite arxiv
↑ ^a ^b Vorlage:Cite arxiv
↑ ^a ^b Vorlage:Cite arxiv
↑ Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K: Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. In: Journal of Machine Learning Research. 18. Jahrgang, Nr. 25, 2017, S. 1–5 (jmlr.org).
↑ Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F: Efficient and Robust Automated Machine Learning. In: Advances in Neural Information Processing Systems 28 (NIPS 2015). 2015, S. 2962–2970 (nips.cc).
↑ Vorlage:Cite arXiv
↑ Vorlage:Cite arXiv
↑ {[cite web |url=https://scikit-optimize.github.io/ |title=skopt module}}
↑ Hutter F, Hoos HH, Leyton-Brown K: Sequential Model-Based Optimization for General Algorithm Configuration. In: Proceedings of the Conference on Learning and Intelligent OptimizatioN (LION 5). (ubc.ca [PDF]).
↑ ^a ^b Nevergrad: How to use to optimize NN hyperparameters. Abgerufen im 1. Januar 1
↑ ^a ^b Nevergrad: An open source tool for derivative-free optimization. Abgerufen im 1. Januar 1

[abs1502.02127-1] Vorlage:Cite arxiv

[bergstra-2] James Bergstra, Yoshua Bengio: Random Search for Hyper-Parameter Optimization. In: J. Machine Learning Research. 13. Jahrgang, 2012, S. 281–305 (mit.edu [PDF]).

[3] Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.

[4] Chicco D: Ten quick tips for machine learning in computational biology. In: BioData Mining. 10. Jahrgang, Nr. 35, Dezember 2017, S. 35, doi:10.1186/s13040-017-0155-3, PMID 29234465, PMC 5721660 (freier Volltext).

[5] Wang Ziyu, Hutter Frank, Zoghi Masrour, Matheson David, de Feitas Nando: Bayesian Optimization in a Billion Dimensions via Random Embeddings. In: Journal of Artificial Intelligence Research. 55. Jahrgang, 2016, S. 361–387, doi:10.1613/jair.4806 (englisch, jair.org).

[hutter-6] Vorlage:Citation

[bergstra11-7] Vorlage:Citation

[snoek-8] Jasper Snoek, Hugo Larochelle, Ryan Adams: Practical Bayesian Optimization of Machine Learning Algorithms. In: Advances in Neural Information Processing Systems. 2012, arxiv:1206.2944, bibcode:2012arXiv1206.2944S (nips.cc [PDF]).

[thornton-9] Chris Thornton, Frank Hutter, Holger Hoos, Kevin Leyton-Brown: Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In: Knowledge Discovery and Data Mining. 2013, arxiv:1208.3719, bibcode:2012arXiv1208.3719T (ubc.ca [PDF]).

[10] Jan Larsen, Lars Kai Hansen, Claus Svarer, M Ohlsson: Design and regularization of neural networks: the optimal use of a validation set. In: Proceedings of the 1996 IEEE Signal Processing Society Workshop. 1996.

[11] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee: Choosing multiple parameters for support vector machines. In: Machine Learning. 46. Jahrgang, 2002, S. 131–159, doi:10.1023/a:1012450327387 (chapelle.cc [PDF]).

[12] Chuong B, Chuan-Sheng Foo, Andrew Y Ng: Efficient multiple hyperparameter learning for log-linear models. In: Advances in Neural Information Processing Systems 20. 2008.

[13] Justin Domke: Generic Methods for Optimization-Based Modeling. In: AISTATS. 22. Jahrgang, 2012 (jmlr.org [PDF]).

[abs1502.03492-14] Vorlage:Cite arXiv

[miikkulainen1-15] Vorlage:Cite arxiv

[jaderberg1-16] Vorlage:Cite arxiv

[such1-17] Vorlage:Cite arxiv

[abs1705.08520-18] Vorlage:Cite arxiv

[abs1706.00764-19] Vorlage:Cite arxiv

[autoweka-20] Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K: Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. In: Journal of Machine Learning Research. 18. Jahrgang, Nr. 25, 2017, S. 1–5 (jmlr.org).

[autosklearn-21] Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F: Efficient and Robust Automated Machine Learning. In: Advances in Neural Information Processing Systems 28 (NIPS 2015). 2015, S. 2962–2970 (nips.cc).

[arXiv:1806.08838-22] Vorlage:Cite arXiv

[arXiv:1807.01774-23] Vorlage:Cite arXiv

[skopt-24] {[cite web |url=https://scikit-optimize.github.io/ |title=skopt module}}

[SMAC-25] Hutter F, Hoos HH, Leyton-Brown K: Sequential Model-Based Optimization for General Algorithm Configuration. In: Proceedings of the Conference on Learning and Intelligent OptimizatioN (LION 5). (ubc.ca [PDF]).

[nevergrad_issue1-26] Nevergrad: How to use to optimize NN hyperparameters. Abgerufen im 1. Januar 1

[nevergrad-27] Nevergrad: An open source tool for derivative-free optimization. Abgerufen im 1. Januar 1

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

Approaches

Grid search

Random search

Bayesian optimization

Gradient-based optimization

Evolutionary optimization

Others

Open-source software

Grid search

Random search

Bayesian

Evolutionary

Other

Commercial services

See also

References