Behl, Baydin and Torr
to problems bigger than toy scales, due to the difficulty in assessing whether MAML is not
suitable for a complex task or whether the hyperparameters are not sufficiently tuned.
In this paper, we provide a conceptually simple solution to this problem, by introducing
an extension of the MAML algorithm to incorporate adaptive tuning of both the learning
rate
α
and the meta-learning rate
β
. Our aim is to make it possible to use MAML without or
with significantly less parameter tuning, and thus to reduce the need for grid search. We also
aim to make the algorithm converge in fewer iterations. The solution we propose is based
on the hypergradient descent (HD) algorithm (Baydin et al., 2018), which automatically
updates a learning rate by performing gradient descent on the learning rate alongside original
optimization steps. The proposed algorithm does not need any extra gradient computations,
and just involves storing the gradients from the previous optimization step.
2. Related Work
Our work is primarily related with the subfields of hyperparameter optimization and meta-
learning. In hyperparameter optimization one typically uses parallel runs to populate a
selected grid of hyperparameter values (e.g., a range of learning rates), or use more advanced
techniques such as Bayesian optimization (Snoek et al., 2012) and model-based approaches
(Bergstra et al., 2013; Hutter et al., 2013). An interesting line of research, which also
inspired our approach in this paper, is to use gradient-based optimization for the tuning
of hyperparameters (Bengio, 2000). Recent work in this area include reversible learning
(Maclaurin et al., 2015), which allows gradient-based optimization of hyperparameters
through a training run consisting of multiple iterations, and hypergradient descent (Baydin
et al., 2018), which achieves a similar optimization in an online, per-gradient-update, fashion.
Meta-learning is often referred to as “learning to learn” (Thrun and Pratt, 2012), meaning
a learning procedure (most of the time gradient-based) is able to improve aspects of the
learning process itself, such as the optimizer, hyperparameters like the learning rate, and
initializations. In this sense, the “meta” concept of meta-learning has aspects in common
with hyperparameter optimization. The MAML model (Finn et al., 2017) on which we
base our method, relies on meta-optimization through gradient descent in a model-agnostic
way. Another recent method, Meta-SGD (Li et al., 2017), performs online optimization
of a per-parameter learning rate vector
α
, to which the authors refer as learning both the
learning rate and update direction (because of the per-parameter nature being able to modify
direction), and model parameters
θ
, using a single hyper-learning rate
β
. Our work differs
from Meta-SGD as we perform simultaneous online optimization of both MAML learning
rates α and β, which are both scalars.
3. Model-Agnostic Meta-Learning (MAML)
The MAML algorithm, given model parameters
θ
, aims to adapt to a new task
T
t
with SGD:
θ
0
t
= θ − α∇
θ
L
T
train(t)
(f
θ
) , (1)
where
t
is the task number and
α
is the learning rate.
T
train(t)
and
T
test(t)
denote the
training and test set within task
t
. The tasks are sampled from a defined
p
(
T
). The
meta-objective is:
min
θ
L
T
test(t)
(f
θ
0
t
) = L
T
test(t)
(f
θ−α∇
θ
L
T
train(t)
(f
θ
)
) (2)
2
评论0