Learning to learn by gradient descent by gradient descentReference: Andrychowicz, Marcin, et al. "Learning to learn by gradient descent by gradient descent." Advances in Neural Information Processing Systems. 2016.
The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.
In this work we take a different tack and instead propose to replace hand-designed update rules with a learned update rule, which we call the optimizer g, specified by its own set of parameters φ.
1.1 Transfer learning and generalization
The goal of this work is to develop a procedure for constructing a learning algorithm which performs well on a particular class of optimization problems.
1.2 A brief history and related work
2. Learning to learn with recurrent neural networks
2.1 Coordinatewise LSTM optimizer
- Optimizing at this scale with a fully connected RNN is not feasible as it would require a huge hidden state and an enormous number of parameters. To avoid this difficulty we will use an optimizer m which operates coordinatewise on the parameters of the objective function, similar to other common update rules like RMSprop and ADAM.
- In practice rescaling inputs and outputs of an LSTM optimizer using suitable constants (shared across all timesteps and functions f).