没有合适的资源?快使用搜索试试~ 我知道了~
元学习应用,在神经网络方面的应用,NN
资源推荐
资源详情
资源评论
1
Meta-Learning in Neural Networks: A Survey
Timothy Hospedales, Antreas Antoniou, Paul Micaelli, Amos Storkey
Abstract—The field of meta-learning, or learning-to-learn, has seen a dramatic rise in interest in recent years. Contrary to
conventional approaches to AI where tasks are solved from scratch using a fixed learning algorithm, meta-learning aims to improve the
learning algorithm itself, given the experience of multiple learning episodes. This paradigm provides an opportunity to tackle many
conventional challenges of deep learning, including data and computation bottlenecks, as well as generalization. This survey describes
the contemporary meta-learning landscape. We first discuss definitions of meta-learning and position it with respect to related fields,
such as transfer learning and hyperparameter optimization. We then propose a new taxonomy that provides a more comprehensive
breakdown of the space of meta-learning methods today. We survey promising applications and successes of meta-learning such as
few-shot learning and reinforcement learning. Finally, we discuss outstanding challenges and promising areas for future research.
Index Terms—Meta-Learning, Learning-to-Learn, Few-Shot Learning, Transfer Learning, Neural Architecture Search
F
1 INTRODUCTION
Contemporary machine learning models are typically
trained from scratch for a specific task using a fixed learn-
ing algorithm designed by hand. Deep learning-based ap-
proaches specifically have seen great successes in a variety
of fields [1]–[3]. However there are clear limitations [4]. For
example, successes have largely been in areas where vast
quantities of data can be collected or simulated, and where
huge compute resources are available. This excludes many
applications where data is intrinsically rare or expensive [5],
or compute resources are unavailable [6].
Meta-learning provides an alternative paradigm where
a machine learning model gains experience over multiple
learning episodes – often covering a distribution of related
tasks – and uses this experience to improve its future
learning performance. This ‘learning-to-learn’ [7] can lead
to a variety of benefits such as data and compute efficiency,
and it is better aligned with human and animal learning [8],
where learning strategies improve both on a lifetime and
evolutionary timescales [8]–[10].
Historically, the success of machine learning was driven
by the choice of hand-engineered features [11], [12]. Deep
learning realised the promise of joint feature and model
learning [13], providing a huge improvement in perfor-
mance for many tasks [1], [3]. Meta-learning in neural
networks can be seen as aiming to provide the next step
of integrating joint feature, model, and algorithm learning.
Neural network meta-learning has a long history [7],
[14], [15]. However, its potential as a driver to advance the
frontier of the contemporary deep learning industry has
led to an explosion of recent research. In particular meta-
learning has the potential to alleviate many of the main
criticisms of contemporary deep learning [4], for instance
by improving data efficiency, knowledge transfer and un-
supervised learning. Meta-learning has proven useful both
in multi-task scenarios where task-agnostic knowledge is
T. Hospedales is with Samsung AI Centre, Cambridge and University of Edin-
burgh. A. Antoniou, P. Micaelli and Storkey are with University of Edinburgh.
Email: {t.hospedales,a.antoniou,paul.micaelli,a.storkey}@ed.ac.uk.
extracted from a family of tasks and used to improve learn-
ing of new tasks from that family [7], [16]; and single-task
scenarios where a single problem is solved repeatedly and
improved over multiple episodes [17]–[19]. Successful appli-
cations have been demonstrated in areas spanning few-shot
image recognition [16], [20], unsupervised learning [21],
data efficient [22], [23] and self-directed [24] reinforcement
learning (RL), hyperparameter optimization [17], and neural
architecture search (NAS) [18], [25], [26].
Many perspectives on meta-learning can be found in
the literature, in part because different communities use the
term differently. Thrun [7] operationally defines learning-to-
learn as occurring when a learner’s performance at solving
tasks drawn from a given task family improves with respect
to the number of tasks seen. (cf., conventional machine
learning performance improves as more data from a single
task is seen). This perspective [27]–[29] views meta-learning
as a tool to manage the ‘no free lunch’ theorem [30] and im-
prove generalization by searching for the algorithm (induc-
tive bias) that is best suited to a given problem, or problem
family. However, this definition can include transfer, multi-
task, feature-selection, and model-ensemble learning, which
are not typically considered as meta-learning today. Another
usage of meta-learning [31] deals with algorithm selection
based on dataset features, and becomes hard to distinguish
from automated machine learning (AutoML) [32], [33].
In this paper, we focus on contemporary neural-network
meta-learning. We take this to mean algorithm learning as
per [27], [28], but focus specifically on where this is achieved
by end-to-end learning of an explicitly defined objective func-
tion (such as cross-entropy loss). Additionally we consider
single-task meta-learning, and discuss a wider variety of
(meta) objectives such as robustness and compute efficiency.
This paper thus provides a unique, timely, and up-to-
date survey of the rapidly growing area of neural network
meta-learning. In contrast, previous surveys are rather out
of date and/or focus on algorithm selection for data mining
[27], [31], [34], [35], AutoML [32], [33], or particular appli-
cations of meta-learning such as few-shot learning [36] or
neural architecture search [37].
arXiv:2004.05439v2 [cs.LG] 7 Nov 2020
2
We address both meta-learning methods and applica-
tions. We first introduce meta-learning through a high-level
problem formalization that can be used to understand and
position work in this area. We then provide a new taxonomy
in terms of meta-representation, meta-objective and meta-
optimizer. This framework provides a design-space for de-
veloping new meta learning methods and customizing them
for different applications. We survey several popular and
emerging application areas including few-shot, reinforce-
ment learning, and architecture search; and position meta-
learning with respect to related topics such as transfer and
multi-task learning. We conclude by discussing outstanding
challenges and areas for future research.
2 B ACKGROUND
Meta-learning is difficult to define, having been used in var-
ious inconsistent ways, even within contemporary neural-
network literature. In this section, we introduce our defini-
tion and key terminology, and then position meta-learning
with respect to related topics.
Meta-learning is most commonly understood as learn-
ing to learn, which refers to the process of improving a
learning algorithm over multiple learning episodes. In con-
trast, conventional ML improves model predictions over
multiple data instances. During base learning, an inner
(or lower/base) learning algorithm solves a task such as
image classification [13], defined by a dataset and objective.
During meta-learning, an outer (or upper/meta) algorithm
updates the inner learning algorithm such that the model
it learns improves an outer objective. For instance this
objective could be generalization performance or learning
speed of the inner algorithm. Learning episodes of the base
task, namely (base algorithm, trained model, performance)
tuples, can be seen as providing the instances needed by the
outer algorithm to learn the base learning algorithm.
As defined above, many conventional algorithms such
as random search of hyper-parameters by cross-validation
could fall within the definition of meta-learning. The
salient characteristic of contemporary neural-network meta-
learning is an explicitly defined meta-level objective, and end-
to-end optimization of the inner algorithm with respect to
this objective. Often, meta-learning is conducted on learning
episodes sampled from a task family, leading to a base
learning algorithm that performs well on new tasks sampled
from this family. However, in a limiting case all training
episodes can be sampled from a single task. In the following
section, we introduce these notions more formally.
2.1 Formalizing Meta-Learning
Conventional Machine Learning In conventional super-
vised machine learning, we are given a training dataset
D = {(x
1
, y
1
), . . . , (x
N
, y
N
)}, such as (input image, output
label) pairs. We can train a predictive model ˆy = f
θ
(x)
parameterized by θ, by solving:
θ
∗
= arg min
θ
L(D; θ, ω)
(1)
where L is a loss function that measures the error between
true labels and those predicted by f
θ
(·). The conditioning on
ω denotes the dependence of this solution on assumptions
about ‘how to learn’, such as the choice of optimizer for θ
or function class for f . Generalization is then measured by
evaluating a number of test points with known labels.
The conventional assumption is that this optimization is
performed from scratch for every problem D; and that ω is
pre-specified. However, the specification of ω can drastically
affect performance measures like accuracy or data efficiency.
Meta-learning seeks to improve these measures by learning
the learning algorithm itself, rather than assuming it is pre-
specified and fixed. This is often achieved by revisiting the
first assumption above, and learning from a distribution of
tasks rather than from scratch.
Meta-Learning: Task-Distribution View A common view
of meta-learning is to learn a general purpose learning algo-
rithm that can generalize across tasks, and ideally enable
each new task to be learned better than the last. We can
evaluate the performance of ω over a distribution of tasks
p(T ). Here we loosely define a task to be a dataset and loss
function T = {D, L}. Learning how to learn thus becomes
min
ω
E
T ∼p(T )
L(D; ω) (2)
where L(D; ω) measures the performance of a model
trained using ω on dataset D. ‘How to learn’, i.e. ω, is often
referred to as across-task knowledge or meta-knowledge.
To solve this problem in practice, we often assume access
to a set of source tasks sampled from p(T ). Formally, we
denote the set of M source tasks used in the meta-training
stage as D
source
= {(D
train
source
, D
val
source
)
(i)
}
M
i=1
where each
task has both training and validation data. Often, the source
train and validation datasets are respectively called support
and query sets. The meta-training step of ‘learning how to
learn’ can be written as:
ω
∗
= arg max
ω
log p(ω|D
source
) (3)
Now we denote the set of Q target tasks used in the
meta-testing stage as D
target
= {(D
train
target
, D
test
target
)
(i)
}
Q
i=1
where each task has both training and test data. In the meta-
testing stage we use the learned meta-knowledge ω
∗
to train
the base model on each previously unseen target task i:
θ
∗ (i)
= arg max
θ
log p(θ|ω
∗
, D
train (i)
target
) (4)
In contrast to conventional learning in Eq. 1, learning on
the training set of a target task i now benefits from meta-
knowledge ω
∗
about the algorithm to use. This could be an
estimate of the initial parameters [16], or an entire learning
model [38] or optimization strategy [39]. We can evaluate the
accuracy of our meta-learner by the performance of θ
∗ (i)
on
the test split of each target task D
test (i)
target
.
This setup leads to analogies of conventional underfit-
ting and overfitting: meta-underfitting and meta-overfitting. In
particular, meta-overfitting is an issue whereby the meta-
knowledge learned on the source tasks does not generalize
to the target tasks. It is relatively common, especially in
the case where only a small number of source tasks are
available. It can be seen as learning an inductive bias ω
that constrains the hypothesis space of θ too tightly around
solutions to the source tasks.
3
Meta-Learning: Bilevel Optimization View The previous
discussion outlines the common flow of meta-learning in a
multiple task scenario, but does not specify how to solve
the meta-training step in Eq. 3. This is commonly done
by casting the meta-training step as a bilevel optimization
problem. While this picture is arguably only accurate for
the optimizer-based methods (see section 3.1), it is helpful
to visualize the mechanics of meta-learning more generally.
Bilevel optimization [40] refers to a hierarchical optimiza-
tion problem, where one optimization contains another
optimization as a constraint [17], [41]. Using this notation,
meta-training can be formalised as follows:
ω
∗
= arg min
ω
M
X
i=1
L
meta
(θ
∗ (i)
(ω), ω, D
val (i)
source
) (5)
s.t. θ
∗(i)
(ω) = arg min
θ
L
task
(θ, ω, D
train (i)
source
) (6)
where L
meta
and L
task
refer to the outer and inner ob-
jectives respectively, such as cross entropy in the case of
few-shot classification. Note the leader-follower asymmetry
between the outer and inner levels: the inner level optimiza-
tion Eq. 6 is conditional on the learning strategy ω defined
by the outer level, but it cannot change ω during its training.
Here ω could indicate an initial condition in non-convex
optimization [16], a hyper-parameter such as regularization
strength [17], or even a parameterization of the loss function
to optimize L
task
[42]. Section 4.1 discusses the space of
choices for ω in detail. The outer level optimization learns
ω such that it produces models θ
∗ (i)
(ω) that perform well
on their validation sets after training. Section 4.2 discusses
how to optimize ω in detail. In Section 4.3 we consider
what L
meta
can measure, such as validation performance,
learning speed or model robustness.
Finally, we note that the above formalization of meta-
training uses the notion of a distribution over tasks. While
common in the meta-learning literature, it is not a necessary
condition for meta-learning. More formally, if we are given
a single train and test dataset (M = Q = 1), we can split
the training set to get validation data such that D
source
=
(D
train
source
, D
val
source
) for meta-training, and for meta-testing
we can use D
target
= (D
train
source
∪ D
val
source
, D
test
target
). We still
learn ω over several episodes, and different train-val splits
are usually used during meta-training.
Meta-Learning: Feed-Forward Model View As we will
see, there are a number of meta-learning approaches that
synthesize models in a feed-forward manner, rather than via
an explicit iterative optimization as in Eqs. 5-6 above. While
they vary in their degree of complexity, it can be instructive
to understand this family of approaches by instantiating the
abstract objective in Eq. 2 to define a toy example for meta-
training linear regression [43].
min
ω
E
T ∼p(T )
(D
tr
,D
val
)∈T
X
(x,y)∈D
val
h
(x
T
g
ω
(D
tr
) − y)
2
i
(7)
Here we meta-train by optimizing over a distribution of
tasks. For each task a train and validation set is drawn. The
train set D
tr
is embedded [44] into a vector g
ω
which defines
the linear regression weights to predict examples x from the
validation set. Optimizing Eq. 7 ‘learns to learn’ by training
the function g
ω
to map a training set to a weight vector.
Thus g
ω
should provide a good solution for novel meta-
test tasks T
te
drawn from p(T ). Methods in this family
vary in the complexity of the predictive model g used, and
how the support set is embedded [44] (e.g., by pooling,
CNN or RNN). These models are also known as amortized
[45] because the cost of learning a new task is reduced
to a feed-forward operation through g
ω
(·), with iterative
optimization already paid for during meta-training of ω.
2.2 Historical Context of Meta-Learning
Meta-learning and learning-to-learn first appear in the lit-
erature in 1987 [14]. J. Schmidhuber introduced a family of
methods that can learn how to learn, using self-referential
learning. Self-referential learning involves training neural
networks that can receive as inputs their own weights and
predict updates for said weights. Schmidhuber proposed to
learn the model itself using evolutionary algorithms.
Meta-learning was subsequently extended to multiple
areas. Bengio et al. [46], [47] proposed to meta-learn biolog-
ically plausible learning rules. Schmidhuber et al.continued
to explore self-referential systems and meta-learning [48],
[49]. S. Thrun et al. took care to more clearly define the
term learning to learn in [7] and introduced initial theoretical
justifications and practical implementations. Proposals for
training meta-learning systems using gradient descent and
backpropagation were first made in 1991 [50] followed by
more extensions in 2001 [51], [52], with [27] giving an
overview of the literature at that time. Meta-learning was
used in the context of reinforcement learning in 1995 [53],
followed by various extensions [54], [55].
2.3 Related Fields
Here we position meta-learning against related areas whose
relation to meta-learning is often a source of confusion.
Transfer Learning (TL) TL [34], [56] uses past experi-
ence from a source task to improve learning (speed, data
efficiency, accuracy) on a target task. TL refers both to
this problem area and family of solutions, most commonly
parameter transfer plus optional fine tuning [57] (although
there are numerous other approaches [34]).
In contrast, meta-learning refers to a paradigm that can
be used to improve TL as well as other problems. In TL
the prior is extracted by vanilla learning on the source task
without the use of a meta-objective. In meta-learning, the
corresponding prior would be defined by an outer opti-
mization that evaluates the benefit of the prior when learn
a new task, as illustrated by MAML [16]. More generally,
meta-learning deals with a much wider range of meta-
representations than solely model parameters (Section 4.1).
Domain Adaptation (DA) and Domain Generalization
(DG) Domain-shift refers to the situation where source
and target problems share the same objective, but the input
distribution of the target task is shifted with respect to the
source task [34], [58], reducing model performance. DA is
a variant of transfer learning that attempts to alleviate this
issue by adapting the source-trained model using sparse or
unlabeled data from the target. DG refers to methods to train
4
a source model to be robust to such domain-shift without
further adaptation. Many knowledge transfer methods have
been studied [34], [58] to boost target domain performance.
However, as for TL, vanilla DA and DG don’t use a meta-
objective to optimize ‘how to learn’ across domains. Mean-
while, meta-learning methods can be used to perform both
DA [59] and DG [42] (see Sec. 5.8).
Continual learning (CL) Continual or lifelong learning
[60]–[62] refers to the ability to learn on a sequence of tasks
drawn from a potentially non-stationary distribution, and
in particular seek to do so while accelerating learning new
tasks and without forgetting old tasks. Similarly to meta-
learning, a task distribution is considered, and the goal is
partly to accelerate learning of a target task. However most
continual learning methodologies are not meta-learning
methodologies since this meta objective is not solved for
explicitly. Nevertheless, meta-learning provides a potential
framework to advance continual learning, and a few recent
studies have begun to do so by developing meta-objectives
that encode continual learning performance [63]–[65].
Multi-Task Learning (MTL) aims to jointly learn sev-
eral related tasks, to benefit from regularization due to
parameter sharing and the diversity of the resulting shared
representation [66]–[68], as well as compute/memory sav-
ings. Like TL, DA, and CL, conventional MTL is a single-
level optimization without a meta-objective. Furthermore,
the goal of MTL is to solve a fixed number of known tasks,
whereas the point of meta-learning is often to solve unseen
future tasks. Nonetheless, meta-learning can be brought in
to benefit MTL, e.g. by learning the relatedness between
tasks [69], or how to prioritise among multiple tasks [70].
Hyperparameter Optimization (HO) is within the remit
of meta-learning, in that hyperparameters like learning rate
or regularization strength describe ‘how to learn’. Here we
include HO tasks that define a meta objective that is trained
end-to-end with neural networks, such as gradient-based
hyperparameter learning [69], [71] and neural architecture
search [18]. But we exclude other approaches like random
search [72] and Bayesian Hyperparameter Optimization
[73], which are rarely considered to be meta-learning.
Hierarchical Bayesian Models (HBM) involve Bayesian
learning of parameters θ under a prior p(θ|ω). The prior
is written as a conditional density on some other variable
ω which has its own prior p(ω). Hierarchical Bayesian
models feature strongly as models for grouped data D =
{D
i
|i = 1, 2, . . . , M}, where each group i has its own
θ
i
. The full model is
h
Q
M
i=1
p(D
i
|θ
i
)p(θ
i
|ω)
i
p(ω). The lev-
els of hierarchy can be increased further; in particular ω
can itself be parameterized, and hence p(ω) can be learnt.
Learning is usually full-pipeline, but using some form of
Bayesian marginalisation to compute the posterior over
ω: P (ω|D) ∼ p(ω)
Q
M
i=1
R
dθ
i
p(D
i
|θ
i
)p(θ
i
|ω). The ease of
doing the marginalisation depends on the model: in some
(e.g. Latent Dirichlet Allocation [74]) the marginalisation is
exact due to the choice of conjugate exponential models,
in others (see e.g. [75]), a stochastic variational approach is
used to calculate an approximate posterior, from which a
lower bound to the marginal likelihood is computed.
Bayesian hierarchical models provide a valuable view-
point for meta-learning, by providing a modeling rather
than an algorithmic framework for understanding the meta-
learning process. In practice, prior work in HBMs has typi-
cally focused on learning simple tractable models θ while
most meta-learning work considers complex inner-loop
learning processes, involving many iterations. Nonetheless,
some meta-learning methods like MAML [16] can be under-
stood through the lens of HBMs [76].
AutoML: AutoML [31]–[33] is a rather broad umbrella
for approaches aiming to automate parts of the machine
learning process that are typically manual, such as data
preparation, algorithm selection, hyper-parameter tuning,
and architecture search. AutoML often makes use of numer-
ous heuristics outside the scope of meta-learning as defined
here, and focuses on tasks such as data cleaning that are
less central to meta-learning. However, AutoML sometimes
makes use of end-to-end optimization of a meta-objective,
so meta-learning can be seen as a specialization of AutoML.
3 TAXONOMY
3.1 Previous Taxonomies
Previous [77], [78] categorizations of meta-learning meth-
ods have tended to produce a three-way taxonomy across
optimization-based methods, model-based (or black box)
methods, and metric-based (or non-parametric) methods.
Optimization Optimization-based methods include those
where the inner-level task (Eq. 6) is literally solved as
an optimization problem, and focuses on extracting meta-
knowledge ω required to improve optimization perfor-
mance. A famous example is MAML [16], which aims to
learn the initialization ω = θ
0
, such that a small number
of inner steps produces a classifier that performs well on
validation data. This is also performed by gradient descent,
differentiating through the updates of the base model. More
elaborate alternatives also learn step sizes [79], [80] or
train recurrent networks to predict steps from gradients
[19], [39], [81]. Meta-optimization by gradient over long
inner optimizations leads to several compute and memory
challenges which are discussed in Section 6. A unified view
of gradient-based meta learning expressing many existing
methods as special cases of a generalized inner loop meta-
learning framework has been proposed [82].
Black Box / Model-based In model-based (or black-box)
methods the inner learning step (Eq. 6, Eq. 4) is wrapped up
in the feed-forward pass of a single model, as illustrated
in Eq. 7. The model embeds the current dataset D into
activation state, with predictions for test data being made
based on this state. Typical architectures include recurrent
networks [39], [51], convolutional networks [38] or hyper-
networks [83], [84] that embed training instances and labels
of a given task to define a predictor for test samples. In this
case all the inner-level learning is contained in the activation
states of the model and is entirely feed-forward. Outer-
level learning is performed with ω containing the CNN,
RNN or hypernetwork parameters. The outer and inner-
level optimizations are tightly coupled as ω and D directly
specify θ. Memory-augmented neural networks [85] use an
explicit storage buffer and can be seen as a model-based
剩余19页未读,继续阅读
资源评论
稽函数
- 粉丝: 6
- 资源: 33
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功