没有合适的资源?快使用搜索试试~ 我知道了~
元学习(meta learning)综述论文(2020年)
需积分: 9 70 下载量 11 浏览量
2020-04-17
19:15:41
上传
评论 6
收藏 713KB PDF 举报
温馨提示
试读
23页
元学习旨在学会学习,是当下研究热点之一。最近来自爱丁堡大学的学者发布了关于元学习最新综述论文《Meta-Learning in Neural Networks: A Survey》,值得关注,详述了元学习体系,包括定义、方法、应用、挑战,成为不可缺少的文献。
资源推荐
资源详情
资源评论
1
Meta-Learning in Neural Networks: A Survey
Timothy Hospedales, Antreas Antoniou, Paul Micaelli, Amos Storkey
Abstract—The field of meta-learning, or learning-to-learn, has seen a dramatic rise in interest in recent years. Contrary to
conventional approaches to AI where a given task is solved from scratch using a fixed learning algorithm, meta-learning aims to
improve the learning algorithm itself, given the experience of multiple learning episodes. This paradigm provides an opportunity to
tackle many of the conventional challenges of deep learning, including data and computation bottlenecks, as well as the fundamental
issue of generalization. In this survey we describe the contemporary meta-learning landscape. We first discuss definitions of
meta-learning and position it with respect to related fields, such as transfer learning, multi-task learning, and hyperparameter
optimization. We then propose a new taxonomy that provides a more comprehensive breakdown of the space of meta-learning
methods today. We survey promising applications and successes of meta-learning including few-shot learning, reinforcement learning
and architecture search. Finally, we discuss outstanding challenges and promising areas for future research.
Index Terms—Meta-Learning, Learning-to-Learn, Few-Shot Learning, Transfer Learning, Neural Architecture Search
F
1 INTRODUCTION
Contemporary machine learning models are typically
trained from scratch for a specific task using a fixed learn-
ing algorithm designed by hand. Deep learning-based ap-
proaches have seen great successes in a variety of fields
[1]–[3]. However there are clear limitations [4]. For example,
successes have largely been in areas where vast quantities of
data can be collected or simulated, and where huge compute
resources are available. This excludes many applications
where data is intrinsically rare or expensive [5], or compute
resources are unavailable [6], [7].
Meta-learning provides an alternative paradigm where
a machine learning model gains experience over multiple
learning episodes – often covering a distribution of related
tasks – and uses this experience to improve its future learn-
ing performance. This ‘learning-to-learn’ [8] can lead to a va-
riety of benefits such as data and compute efficiency, and it
is better aligned with human and animal learning [9], where
learning strategies improve both on a lifetime and evo-
lutionary timescale [9]–[11]. Machine learning historically
built models upon hand-engineered features, and feature
choice was often the determining factor in ultimate model
performance [12]–[14]. Deep learning realised the promise
of joint feature and model learning [15], [16], providing a
huge improvement in performance for many tasks [1], [3].
Meta-learning in neural networks can be seen as aiming to
provide the next step of integrating joint feature, model,
and algorithm learning. Neural network meta-learning has a
long history [8], [17], [18]. However, its potential as a driver
to advance the frontier of the contemporary deep learning
industry has led to an explosion of recent research. In
particular meta-learning has the potential to alleviate many
of the main criticisms of contemporary deep learning [4], for
instance by providing better data efficiency, exploitation of
prior knowledge transfer, and enabling unsupervised and
self-directed learning. Successful applications have been
T. Hospedales is with Samsung AI Centre, Cambridge and University of Edin-
burgh. A. Antoniou, P. Micaelli and Storkey are with University of Edinburgh.
Email: {t.hospedales,a.antoniou,paul.micaelli,a.storkey}@ed.ac.uk.
demonstrated in areas spanning few-shot image recognition
[19], [20], unsupervised learning [21], data efficient [22], [23]
and self-directed [24] reinforcement learning (RL), hyper-
parameter optimization [25], and neural architecture search
(NAS) [26]–[28].
Many different perspectives on meta-learning can be
found in the literature. Especially as different communities
use the term somewhat differently, it can be difficult to de-
fine. A perspective related to ours [29] views meta-learning
as a tool to manage the ‘no free lunch’ theorem [30] and
improve generalization by searching for the algorithm (in-
ductive bias) that is best suited to a given problem, or family
of problems. However, taken broadly, this definition can
include transfer, multi-task, feature-selection, and model-
ensemble learning, which are not typically considered as
meta-learning today. Another perspective on meta-learning
[31] broadly covers algorithm selection and configuration
techniques based on dataset features, and becomes hard
to distinguish from automated machine learning (AutoML)
[32]. In this paper, we focus on contemporary neural-network
meta-learning. We take this to mean algorithm or inductive
bias search as per [29], but focus on where this is achieved
by end-to-end learning of an explicitly defined objective function
(such as cross-entropy loss, accuracy or speed).
This paper thus provides a unique, timely, and up-to-
date survey of the rapidly growing area of neural network
meta-learning. In contrast, previous surveys are rather out
of date in this fast moving field, and/or focus on algorithm
selection for data mining [29], [31], [33]–[37], AutoML [32],
or particular applications of meta-learning such as few-shot
learning [38] or neural architecture search [39].
We address both meta-learning methods and applica-
tions. In particular, we first provide a high-level prob-
lem formalization which can be used to understand and
position recent work. We then provide a new taxonomy
of methodologies, in terms of meta-representation, meta-
objective and meta-optimizer. We survey several of the
popular and emerging application areas including few-
shot, reinforcement learning, and architecture search; and
position meta-learning with respect to related topics such
arXiv:2004.05439v1 [cs.LG] 11 Apr 2020
2
as transfer learning, multi-task learning and AutoML. We
conclude by discussing outstanding challenges and areas
for future research.
2 B ACKGROUND
Meta-learning is difficult to define, having been used in
various inconsistent ways, even within the contemporary
neural-network literature. In this section, we introduce our
definition and key terminology, which aims to be useful
for understanding a large body of literature. We then po-
sition meta-learning with respect to related topics such as
transfer and multi-task learning, hierarchical models, hyper-
parameter optimization, lifelong/continual learning, and
AutoML.
Meta-learning is most commonly understood as learning
to learn, which refers to the process of improving a learn-
ing algorithm over multiple learning episodes. In contrast,
conventional ML considers the process of improving model
predictions over multiple data instances. During base learn-
ing, an inner (or lower, base) learning algorithm solves a
task such as image classification [15], defined by a dataset
and objective. During meta-learning, an outer (or upper,
meta) algorithm updates the inner learning algorithm, such
that the model learned by the inner algorithm improves
an outer objective. For instance this objective could be
generalization performance or learning speed of the inner
algorithm. Learning episodes of the base task, namely (base
algorithm, trained model, performance) tuples, can be seen
as providing the instances needed by the outer algorithm in
order to learn the base learning algorithm.
As defined above, many conventional machine learning
practices such as random hyper-parameter search by cross-
validation could fall within the definition of meta-learning.
The salient characteristic of contemporary neural-network
meta-learning is an explicitly defined meta-level objective,
and end-to-end optimization of the inner algorithm with
respect to this objective. Often, meta-learning is conducted
on learning episodes sampled from a task family, leading
to a base learning algorithm that is tuned to perform well
on new tasks sampled from this family. This can be a
particularly powerful technique to improve data efficiency
when learning new tasks. However, in a limiting case all
training episodes can be sampled from a single task. In the
following section, we introduce these notions more formally.
2.1 Formalizing Meta-Learning
Conventional Machine Learning In conventional super-
vised machine learning, we are given a training dataset
D = {(x
1
, y
1
), . . . , (x
N
, y
N
)}, such as (input image, output
label) pairs. We can train a predictive model ˆy = f
θ
(x)
parameterized by θ, by solving:
θ
∗
= arg min
θ
L(D; θ, ω)
(1)
where L is a loss function that measures the match between
true labels and those predicted by f
θ
(·). We include condi-
tion ω to make explicit the dependence of this solution on
factors such as choice of optimizer for θ or function class for
f, which we denote by ω. Generalization is then measured
by evaluating a number of test points with known labels.
The conventional assumption is that this optimization
is performed from scratch for every problem D; and fur-
thermore that ω is pre-specified. However, the specification
ω of ‘how to learn’ θ can dramatically affect generalization,
data-efficiency, computation cost, and so on. Meta-learning
addresses improving performance by learning the learning
algorithm itself, rather than assuming it is pre-specified and
fixed. This is often (but not always) achieved by revisiting
the first assumption above, and learning from a distribution
of tasks rather than from scratch.
Meta-Learning: Task-Distribution View Meta-learning
aims to improve performance by learning ‘how to learn’ [8].
In particular, the vision is often to learn a general purpose
learning algorithm, that can generalize across tasks and
ideally enable each new task to be learned better than the
last. As such ω specifies ‘how to learn’ and is often evaluated
in terms of performance over a distribution of tasks p(T ).
Here we loosely define a task to be a dataset and loss
function T = {D, L}. Learning how to learn thus becomes
min
ω
E
T ∼p(T )
L(D; ω) (2)
where L(D; ω) measures the performance of a model
trained using ω on dataset D. The knowledge ω of ‘how to
learn’ is often referred to as across-task knowledge or meta-
knowledge.
To solve this problem in practice, we usually assume
access to a set of source tasks sampled from p(T ), with
which we learn ω. Formally, we denote the set of M
source tasks used in the meta-training stage as D
source
=
{(D
train
source
, D
val
source
)
(i)
}
M
i=1
where each task has both training
and validation data. Often, the source train and validation
datasets are respectively called support and query sets. De-
noting the meta-knowledge as ω, the meta-training step of
‘learning how to learn’ is then:
ω
∗
= arg max
ω
log p(ω|D
source
) (3)
Now we denote the set of Q target tasks used in the
meta-testing stage as D
target
= {(D
train
target
, D
test
target
)
(i)
}
Q
i=1
where each task has both training and test data. In the meta-
testing stage we use the learned meta-knowledge to train
the base model on each previously unseen target task i:
θ
∗ (i)
= arg max
θ
log p(θ|ω
∗
, D
train (i)
target
) (4)
In contrast to conventional learning in Eq. 1, learning
on the training set of a target task i now benefits from meta-
knowledge ω about the algorithm to use. This could take the
form of an estimate of the initial parameters [19], in which
case ω and θ are the same sized objects referring to the same
quantities. However, ω can more generally encode other
objects such as an entire learning model [40] or optimization
strategy [41]. Finally, we can evaluate the accuracy of our
meta-learner by the performance of θ
∗ (i)
on the test split of
each target task D
test (i)
target
.
This setup leads to analogies of conventional underfit-
ting and overfitting: meta-underfitting and meta-overfitting. In
particular, meta-overfitting is an issue whereby the meta-
knowledge learned on the source tasks does not generalize
3
to the target tasks. It is relatively common, especially in the
case where only a small number of source tasks are avail-
able. In terms of meta-learning as inductive-bias learning
[29], meta-overfitting corresponds to learning an inductive
bias ω that constrains the hypothesis space of θ too tightly
around solutions to the source tasks.
Meta-Learning: Bilevel Optimization View The previous
discussion outlines the common flow of meta-learning in a
multiple task scenario, but does not specify how to solve
the meta-training step in Eq. 3. This is commonly done
by casting the meta-training step as a bilevel optimization
problem. While this picture is arguably only accurate for
the optimizer-based methods (see section 3.1), it is helpful
to visualize the mechanics of meta-learning more generally.
Bilevel optimization [42] refers to a hierarchical optimiza-
tion problem, where one optimization contains another
optimization as a constraint [25], [43]. Using this notation,
meta-training can be formalised as follows:
ω
∗
= arg min
ω
M
X
i=1
L
meta
(θ
∗ (i)
(ω), ω, D
val (i)
source
) (5)
s.t. θ
∗(i)
(ω) = arg min
θ
L
task
(θ, ω, D
train (i)
source
) (6)
where L
meta
and L
task
refer to the outer and inner ob-
jectives respectively, such as cross entropy in the case of
few-shot classification. A key characteristic of the bilevel
paradigm is the leader-follower asymmetry between the
outer and inner levels respectively: the inner level optimiza-
tion Eq. 6 is conditional on the learning strategy ω defined
by the outer level, but it cannot change ω during its training.
Here ω could indicate an initial condition in non-convex
optimization [19], a hyper-parameter such as regulariza-
tion strength [25], or even a parameterization of the loss
function to optimize L
task
[44]. Section 4.1 discusses the
space of choices for ω in detail. The outer level optimization
trains the learning strategy ω such that it produces models
θ
∗ (i)
(ω) that perform well on their validation sets after
training. Section 4.2 discusses how to optimize ω in detail.
Note that while L
meta
can measure simple validation per-
formance, we shall see that it can also measure more subtle
quantities such as learning speed and model robustness, as
discussed in Section 4.3.
Finally, we note that the above formalization of meta-
training uses the notion of a distribution over tasks, and
using M samples from that distribution. While this is pow-
erful, and widely used in the meta-learning literature, it is
not a necessary condition for meta-learning. More formally,
if we are given a single train and test dataset, we can split
the training set to get validation data such that D
source
=
(D
train
source
, D
val
source
) for meta-training, and for meta-testing
we can use D
target
= (D
train
source
∪ D
val
source
, D
test
target
). We still
learn ω over several episodes, and one could consider that
M = Q = 1, although different train-val splits are usually
used during meta-training.
Meta-Learning: Feed-Forward Model View As we will
see, there are a number of meta-learning approaches that
synthesize models in a feed-forward manner, rather than via
an explicit iterative optimization as in Eqs. 5-6 above. While
they vary in their degree of complexity, it can be instructive
to understand this family of approaches by instantiating the
abstract objective in Eq. 2 to define a toy example for meta-
training linear regression [45].
min
ω
E
T ∼p(T )
(D
tr
,D
val
)∈T
X
(x,y)∈D
val
h
(x
T
g
ω
(D
tr
) − y)
2
i
(7)
Here we can see that we meta-train by optimising over a
distribution of tasks. For each task a train and validation
(aka query and support) set is drawn. The train set D
tr
is
embedded into a vector g
ω
which defines the linear regres-
sion weights to predict examples x drawn from the test set.
Optimising the above objective thus ‘learns how to learn’ by
training the function g
ω
to instantiate a learning algorithm
that maps a training set to a weight vector. Thus if a novel
meta-test task T
te
is drawn from p(T ) we might also expect
g
ω
to provide a good solution. Different methods in this
family vary in the complexity of the predictive model used
(parameters g that they instantiate), and how the support
set is embedded (e.g., by simple pooling, CNN or RNN).
2.2 Historical Context of Meta-Learning
Meta-learning first appears in the literature in 1987 in two
separate and independent pieces of work, by J. Schmid-
huber and G. Hinton [17], [46]. Schmidhuber [17] set the
theoretical framework for a new family of methods that
can learn how to learn, using self-referential learning. Self-
referential learning involves training neural networks that
can receive as inputs their own weights and predict updates
for said weights. Schmidhuber further proposed that the
model itself can be learned using evolutionary algorithms.
Hinton et al. [46] proposed the usage of two weights per
neural network connection instead of one. The first weight is
the standard slow-weight which acquires knowledge slowly
(called slow-knowledge) over optimizer updates, whereas the
second weight or fast-weight acquires knowledge quickly
(called fast-knowledge) during inference. The fast weight’s
responsibility is to be able to deblur or recover slow weights
learned in the past, that have since been forgotten due to
optimizer updates. Both of these papers introduce funda-
mental concepts that later on branch out and give rise to
contemporary meta-learning.
After the introduction of meta-learning, one can see a
rapid increase in the usage of the idea in multiple dif-
ferent areas. Bengio et al. [47], [48] proposed systems that
attempt to meta-learn biologically plausible learning rules.
Schmidhuber et al.continued to explore self-referential sys-
tems and meta-learning in subsequent work [49], [50]. S.
Thrun et al.coined the term learning to learn in [8] as an
alternative to meta-learning and proceeded to explore and
dissect available literature in meta-learning in search for
a general meta-learning definition. Proposals for training
meta-learning systems using gradient descent and back-
propagation were first made in 2001 [51], [52]. Additional
overviews of the meta-learning literature shortly followed
[29]. Meta-learning was first used in reinforcement learning
in work by Schweighofer et al. [53] after which came the first
usage of meta-learning in zero-shot learning by Larochelle
et al. [54]. Finally in 2012 Thrun et al. [8] re-introduced meta-
learning in the modern era of deep neural networks, which
4
marked the beginning of modern meta-learning of the type
discussed in this survey.
Meta-Learning is also closely related to methods for
hierarchical and multi-level models in statistics for grouped
data. In such hierarchical models, grouped data elements
are modelled with a within-group model and the differences
between each group is modelled with an between-group
model. Examples of such hierarchical models in the machine
learning literature include topic models such as Latent
Dirichlet Allocation [55] and its variants. In topic models,
a model for a new document is learnt from the document’s
data; the learning of that model is guided by the set of topics
already learnt from the whole corpus. Hierarchical models
are discussed further in Section 2.3.
2.3 Related Fields
Here we position meta-learning against related areas, which
is often the source of confusion in the literature.
Transfer Learning (TL) TL [34] uses past experience of
a source task to improve learning (speed, data efficiency,
accuracy) on a target task – by transferring a parameter
prior, initial condition, or feature extractor [56] from the
solution of a previous task. TL refers both to this endeavour
to a problem area. In contemporary neural network context
it often refers to a particular methodology of parameter
transfer plus optional fine tuning (although there are nu-
merous other approaches to this problem [34]).
While TL can refer to a problem area, meta-learning
refers to a methodology which can be used to improve
TL as well as other problems. TL as a methodology is
differentiated to meta-learning as the prior is extracted by
vanilla learning on the source task without the use of a
meta-objective. In meta-learning, the corresponding prior
would be defined by an outer optimization that evaluates
how well the prior performs when helping to learn a new
task, as illustrated, e.g., by MAML [19]. More generally,
meta-learning deals with a much wider range of meta-
representations than solely model parameters (Section 4.1).
Domain Adaptation (DA) and Domain Generalization
(DG)
Domain-shift refers to the situation where source and
target tasks have the same classes but the input distribution
of the target task is shifted with respect to the source
task [34], [57], leading to reduced model performance upon
transfer. DA is a variant of transfer learning that attempts
to alleviate this issue by adapting the source-trained model
using sparse or unlabeled data from the target. DG refers
to methods to train a source model to be robust to such
domain-shift without further adaptation. Many methods
have been studied [34], [57], [58] to transfer knowledge and
boost performance in the target domain. However, as for
TL, vanilla DA and DG are differentiated in that there is no
meta-objective that optimizes ‘how to learn’ across domains.
Meanwhile, meta-learning methods can be used to perform
both DA and DG, which we cover in section 5.9.
Continual learning (CL) Continual and lifelong learning
[59], [60] refer to the ability to learn on a sequence of tasks
drawn from a potentially non-stationary distribution, and
in particular seek to do so while accelerating learning new
tasks and without forgetting old tasks. It is related insofar
as working with a task distribution, and that the goal is
partly to accelerate learning of a target task. However most
continual learning methodologies are not meta-learning
methodologies since this meta objective is not solved for
explicitly. Nevertheless, meta-learning provides a potential
framework to advance continual learning, and a few recent
studies have begun to do so by developing meta-objectives
that encode continual learning performance [61]–[63].
Multi-Task Learning (MTL) aims to jointly learn several
related tasks, and benefits from the effect regularization due
to parameter sharing and of the diversity of the resulting
shared representation [64]–[66]. Like TL, DA, and CL, con-
ventional MTL is a single-level optimization without a meta-
objective. Furthermore, the goal of MTL is to solve a fixed
number of known tasks, whereas the point of meta-learning
is often to solve unseen future tasks. Nonetheless, meta-
learning can be brought in to benefit MTL, e.g. by learning
the relatedness between tasks [67], or how to prioritise
among multiple tasks [68].
Hyperparameter Optimization (HO) is within the remit
of meta-learning, in that hyperparameters such as learn-
ing rate or regularization strength can be included in the
definition of ‘how to learn’. Here we focus on HO tasks
defining a meta objective that is trained end-to-end with
neural networks. This includes some work in HO, such
as gradient-based hyperparameter learning [67] and neural
architecture search [26]. But we exclude other approaches
like random search [69] and Bayesian Hyperparameter Op-
timization [70], which are rarely considered to be meta-
learning.
Hierarchical Bayesian Models (HBM) involve Bayesian
learning of parameters θ under a prior p(θ|ω). The prior is
written as a conditional density on some other variable ω
which has its own prior p(ω). Hierarchical Bayesian models
feature strongly as models for grouped data D = {D
i
|i =
1, 2, . . . , M}, where each group i has its own θ
i
.
The full model is
h
Q
M
i=1
p(D
i
|θ
i
)p(θ
i
|ω)
i
p(ω). The levels
of hierarchy can be increased further; in particular ω can
itself be parameterized, and hence p(ω) can be learnt.
Learning is usually full-pipeline, but using some form
of Bayesian marginalisation to compute the posterior over
ω: P (ω|D) ∼ p(ω)
Q
M
i=1
R
dθ
i
p(D
i
|θ
i
)p(θ
i
|ω). The ease of
doing the marginalisation depends on the model: in some
(e.g. Latent Dirichlet Allocation [55]) the marginalisation is
exact due to the choice of conjugate exponential models,
in others (see e.g. [71]), a stochastic variational approach is
used to calculate an approximate posterior, from which a
lower bound to the marginal likelihood is computed.
Bayesian hierarchical models provide a valuable view-
point for meta-learning, in that they provide a modeling
rather than an algorithmic framework for understanding the
meta-learning process. In practice, prior work in Bayesian
hierarchical models has typically focused on learning sim-
ple tractable models θ; most meta-learning work however
considers complex inner-loop learning processes, involving
many iterations. Nonetheless, some meta-learning methods
like MAML [19] can be understood through the lens of
HBMs [72].
AutoML: AutoML [31], [32] is a rather broad umbrella
5
for approaches aiming to automate parts of the machine
learning process that are typically manual, such as data
preparation and cleaning, feature selection, algorithm se-
lection, hyper-parameter tuning, architecture search, and
so on. AutoML often makes use of numerous heuristics
outside the scope of meta-learning as defined here, and
addresses tasks such as data cleaning that are less central
to meta-learning. However, AutoML sometimes makes use
of meta-learning as we define it here in terms of end-to-end
optimization of a meta-objective, so meta-learning can be
seen as a specialization of AutoML.
3 TAXONOMY
3.1 Previous Taxonomies
Previous [73], [74] categorizations of meta-learning meth-
ods have tended to produce a three-way taxonomy across
optimization-based methods, model-based (or black box)
methods, and metric-based (or non-parametric) methods.
Optimization Optimization-based methods include those
where the inner-level task (Eq. 6) is literally solved as
an optimization problem, and focuses on extracting meta-
knowledge ω required to improve optimization perfor-
mance. The most famous of these is perhaps MAML [19],
where the meta-knowledge ω is the initialization of the
model parameters in the inner optimization, namely θ
0
. The
goal is to learn θ
0
such that a small number of inner steps on
a small number of train instances produces a classifier that
performs well on validation data. This is also performed
by gradient descent, differentiating through the updates to
the base model. More elaborate alternatives also learn step
sizes [75], [76] or train recurrent networks to predict steps
from gradients [41], [77], [78]. Meta-optimization by gradi-
ent leads to the challenge of efficiently evaluating expen-
sive second-order derivatives and differentiating through a
graph of potentially thousands of inner optimization steps
(see Section 6). For this reason it is often applied to few-shot
learning where few inner-loop steps may be sufficient.
Black Box / Model-based In model-based (or black-box)
methods the inner learning step (Eq. 6, Eq. 4) is wrapped up
in the feed-forward pass of a single model, as illustrated
in Eq. 7. The model embeds the current dataset D into
activation state, with predictions for test data being made
based on this state. Typical architectures include recurrent
networks [41], [51], convolutional networks [40] or hyper-
networks [79], [80] that embed training instances and labels
of a given task to define a predictor that inputs testing
example and predicts its label. In this case all the inner-
level learning is contained in the activation states of the
model and is entirely feed-forward. Outer-level learning
is performed with ω containing the CNN, RNN or hyper-
network parameters. The outer and inner-level optimiza-
tions are tightly coupled as ω directly specifies θ. Memory-
augmented neural networks [81] use an explicit storage
buffer and can also be used as a model-based algorithm [82],
[83]. It has been observed that model-based approaches are
usually less able to generalize to out-of-distribution tasks
than optimization-based methods [84]. Furthermore, while
they are often very good at data efficient few-shot learning,
they have been criticised for being asymptotically weaker
[84] as it isn’t clear that black-box models can successfully
embed a large training set into a rich base model.
Metric-Learning Metric-learning or non-parametric algo-
rithms are thus far largely restricted to the popular but spe-
cific few-shot application of meta-learning (Section 5.1.1).
The idea is to perform non-parametric ‘learning’ at the inner
(task) level by simply comparing validation points with
training points and predicting the label of matching training
points. In chronological order, this has been achieved with
methods such as siamese networks [85], matching networks
[86], prototypical networks [20], relation networks [87], and
graph neural networks [88]. Here the outer-level learning
corresponds to metric learning (finding a feature extractor
ω that encodes the data to a representation suitable for
comparison). As before ω is learned on source tasks, and
used for target tasks.
Discussion The common breakdown reviewed above
does not expose all facets of interest and is insufficient to
understand the connections between the wide variety of
meta-learning frameworks available today. In the following
subsections we therefore present a new cross-cutting break-
down of meta-learning methods.
3.2 Proposed Taxonomy
We introduce a new breakdown along three independent
axes. For each axis we provide a taxonomy that reflects the
current meta-learning landscape.
Meta-Representation (“What?”) The first axis is the
choice of representation of meta-knowledge ω. This could
span an estimate of model parameters [19] used for opti-
mizer initialization, to readable code in the case of program
induction [89]. Note that the base model representation θ
is usually application-specific, for example a convolutional
neural network (CNN) [1] in the case of computer vision.
Meta-Optimizer (“How?”) The second axis is the choice
of optimizer to use for the outer level during meta-training
(see Eq. 5)
1
. The outer-level optimizer for ω can take a va-
riety of forms from gradient-descent [19], to reinforcement
learning [89] and evolutionary search [23].
Meta-Objective (“Why?”) The third axis is the goal of
meta-learning which is determined by choice of meta-
objective L
meta
(Eq. 5), task distribution p(T ), and data-
flow between the two levels. Together these can customize
meta-learning for different purposes such as sample efficient
few-shot learning [19], [40], fast many-shot optimization
[89], [91], or robustness to domain-shift [44], [92], label noise
[93], and adversarial attack [94].
4 SURVEY: METHODOLOGIES
In this section we break down existing literature according
to our proposed new methodological taxonomy.
1. In contrast, the inner level optimizer for θ (Eq. 6) may be specified
by the application at hand (e.g., gradient-descent supervised learning
of cross-entropy loss in the case of image recognition [1], or policy-
gradient reinforcement learning in the case of continuous control [90]).
剩余22页未读,继续阅读
资源评论
syp_net
- 粉丝: 158
- 资源: 1196
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功