没有合适的资源?快使用搜索试试~ 我知道了~
变分自动编码器的终身混合_Lifelong Mixture of Variational Autoencoders
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 151 浏览量
2022-01-23
15:41:40
上传
评论
收藏 5.86MB PDF 举报
温馨提示
试读
15页
变分自动编码器的终身混合_Lifelong Mixture of Variational Autoencoders.pdf
资源详情
资源评论
资源推荐
1
Lifelong Mixture of Variational Autoencoders
Fei Ye and Adrian G. Bors, Senior Member, IEEE
Department of Computer Science, University of York, York YO10 5GH, UK
E-mail: fy689@york.ac.uk, adrian.bors@york.ac.uk
Abstract—In this paper, we propose an end-to-end lifelong
learning mixture of experts. Each expert is implemented by
a Variational Autoencoder (VAE). The experts in the mixture
system are jointly trained by maximizing a mixture of indi-
vidual component evidence lower bounds (MELBO) on the log-
likelihood of the given training samples. The mixing coefficients
in the mixture model, control the contributions of each expert in
the global representation. These are sampled from a Dirichlet
distribution whose parameters are determined through non-
parametric estimation during the lifelong learning. The model
can learn new tasks fast when these are similar to those
previously learnt. The proposed Lifelong mixture of VAE (L-
MVAE) expands its architecture with new components when
learning a completely new task. After the training, our model can
automatically determine the relevant expert to be used when fed
with new data samples. This mechanism benefits both the mem-
ory efficiency and the required computational cost as only one
expert is used during the inference. The L-MVAE inference model
is able to perform interpolations in the joint latent space across
the data domains associated with different tasks and is shown to
be efficient for disentangled learning representation. The code is
available : https://github.com/dtuzi123/LifelongMixtureVAEs
Index Terms—Lifelong learning, Mixture of Variational Au-
toencoders, Multi-task learning, Mixture of Evidence Lower
Bounds, Disentangled representations.
I. INTRODUCTION
Deep learning models suffer from catastrophic forgetting [1]
when trained on multiple databases in a sequential manner.
A deep learning model quickly forgets the characteristics of
the previously learned experiences while adjusting to learning
new information. The ability of artificial learning systems of
continuously acquiring, preserving and transferring skills and
knowledge throughout their lifespan is called lifelong learning
[1]. Existing approaches would either use dynamic architec-
tures, adopt regularization during training, or employ gen-
erative replay mechanisms. Dynamic architecture approaches
[2], [3], [4], [5], [6] would increase the network capacity by
adding new layers and processing units in order to adapt the
network’s architecture to acquiring new information. However,
such approaches would require a specific architecture design
while their parameters would increase progressively with the
number of tasks. Regularization approaches [7], [8], [9], [10],
[11] aim to impose a penalty when updating the network’
parameters in order to preserve the knowledge associated
with previously learned tasks. In practice, these approaches
suffer from performance degradation when learning a series
of tasks where the datasets are entirely different from the
previously learned ones. Memory-based methods use a buffer
in order to upload previously learned data samples [12], [13],
or utilize powerful generative networks such as a Variational
Autoencoders (VAEs) [14], [15], [16], [17] or Generative
Adversarial Networks (GANs) [18], [19] as memory-based
replay networks that reproduces and generates data which
is consistent with what has seen and learned before. These
approaches would need additional memory storage space for
recording parameters of the generated data while their perfor-
mance on the previously learned tasks is heavily dependent on
the generator’s ability to realistically replicate data.
Promising results have been achieved on prediction tasks
[6], [7], [20], [21], [22], [23]. However, these methods do not
capture the underlying structure behind the data, which pre-
vents them from being applied in a wide range of applications.
There are very few attempts addressing representation learning
under the lifelong setting [15], [16]. The performance of these
methods degrades significantly when engaging in the lifelong
training with datasets containing complex images or on a long
sequence of tasks. The reason is that these approaches require
to retrain their generators on artificially generated data. Mean-
while, the performance loss on each dataset is accumulated
during the lifelong learning of a sequence of several tasks.
To address this problem, we propose a probabilistic mixture
of experts model, where each expert infers a probabilistic
representation of a given task. A Dirichlet sampling process
defines the likelihood of a certain expert to be activated when
presented with a new task.
This paper has the following contributions :
• A novel mixture learning model, called Lifelong Mixture
of VAEs (L-MVAE). Instead of capturing different char-
acteristics of a database as in other mixture models [24],
[25], [26], [27], the proposed mixture model enables to
automatically embed the knowledge associated with each
database into a distinct latent space modelled by one of
the mixture’s experts during the lifelong learning.
• A training algorithm based on the maximization of the
mixture of individual component evidence lower bounds
(MELBO)
• A mixing-coefficient sampling process is introduced in
order to activate or drop out experts in L-MVAE. Besides
defining an adaptive architecture, this procedure acceler-
ates the learning process of new tasks while overcoming
the forgetting of the previously learned tasks.
The remainder of the paper contains a detailed overview of
the existing state of the art in Section II, while the proposed
L-MVAE model is discussed in Section III. In Section IV we
discuss the theory behind the proposed L-MAE model and in
Section V we explain how the proposed methodology can be
used in unsupervised, supervised and semi-supervised learning
applications. The expansion mechanism for the model’s archi-
tecture is presented in Section VI. The experimental results
arXiv:2107.04694v1 [cs.LG] 9 Jul 2021
2
are analyzed in Section VII while the conclusions are drawn
in Section VIII.
II. RELATED RESEARCH STUDIES
A variational autoencoder (VAE) [28] is made up of two
networks, an encoder and a decoder. Given a data set, the
encoder extracts a latent vector z, and the decoder aims
to reconstruct the given data from the latent vectors. A
number of research works have been developed for capturing
meaningful and disentangled data representations by using the
VAE framework [29], [30], [31], [32], [33]. These approaches
show promising results on achieving disentanglement between
latent variables as well as interpretable visual results, where
specific properties of the scene can be manipulated through
changing the relevant latent variables. However, these models
work well only on data samples drawn from a single domain,
corresponding to a specific database used for training. When
they are re-trained on a different database, their parameters
are updated and then they fail to perform on the tasks learned
previously. This happens because they do not have appropriate
objective functions to deal with catastrophic forgetting [9],
[34], [35].
Recently, there have been some attempts to learn cross-
domain representations under the lifelong learning by intro-
ducing an environment-dependent mask that specifies a subset
of generative factors [16], by proposing a Teacher-Student
lifelong learning framework [15], [36] or a hybrid model [37]
of Generative Adversarial Nets (GANs) [38] and VAE. The
models proposed in [15], [16], [37] are based on Generative
Replay Mechanisms (GRM) aiming to overcome forgetting.
However, these methods suffer from poor performance when
considering complex data.
Aljundi et al. [39] proposed a lifelong learning system
named the Expert Gate model, where new experts are added
to a network of experts. The most relevant expert from the
given set is chosen during the testing stage, according to
the reconstruction error of the data. However, this may not
necessarily correspond to the best log-likelihood estimate for
the data. Moreover, the Expert Gate model was used only for
supervised classification tasks.
Regularization based approaches alleviate catastrophic for-
getting by adding an auxiliary term that penalizes changes in
the weights when the model is trained on a new task [6], [7],
[8], [9], [10], [11], [35], [40], [41], [42] or store past samples
to regulate the optimization [20], [43]. However, regularization
based approaches have huge computation requirements when
the number of tasks increases [44].
In another direction of research, mixtures of VAEs have
been employed for continuous learning [24], [25], [26], [27].
These models are able to capture underlying complex struc-
tures behind data and therefore perform well on many down-
stream tasks including clustering and semi-supervised classifi-
cation. However, these mixture models would only capture
characteristics of a single database which had been split
into batches of data, and tend to forget previously learned
data characteristics when attempting to learn a sequence of
distinct tasks. In contrast to the above mentioned methods,
our model is able to capture underlying generative latent
variable representations across multiple data domains during
the lifelong learning.
III. THE LIFELONG MIXTURE OF VAES
A. Problem formulation
In this paper we consider a model made up of a mixture
of networks [45] which is able to deal with three different
learning scenarios: supervised, semi-supervised and unsuper-
vised, under the lifelong learning setting. Let us consider a
sequence of tasks and denote D
(k)
= {x
(k)
i
, y
(k)
i
}
N
k
i
as a
dataset characterizing the k − th task, where x
(k)
i
∈ X
(k)
is the source domain and y
(k)
i
∈ Y
(k)
is the target domain
which is usually defined by class labels, while each domain
{D
(i)
|i = 1, . . . , K} is associated to a given task. We aim to
learn a model which not only generates or reconstructs data
but which can also generate meaningful representations useful
for various tasks during a lifelong learning process.
B. Mixture objective function
Traditional mixture models [46], [47] normally capture
different characteristics of a dataset by learning several latent
variable vectors, with distinct sets of variables associated to
each mixture’ component. In this paper, we implement each
expert by using a generative latent variable model p
θ
(x, z) =
p
θ
(x|z)p(z), where z ∈ IR
d
is the latent variable and θ
represents the decoder’s parameters, as in VAEs [28]. The
learning goal of the generative model is to maximize the
log-likelihood of the data distribution, which is actually a
difficult problem due to the intractability of the marginal
distribution p(x) =
R
p
θ
(x|z)p(z)dz, requiring access to
all latent variables. Instead, we optimize the evidence lower
bound (ELBO) on the data log-likelihood, [28] :
log p(x) ≥ E
z∼q
ε
(z|x)
[log p
θ
(x|z)] − D
KL
[q
ε
(z|x)||p(z)]
= L
VAE,θ,ε
(x),
(1)
where q
ε
(z|x) is called the variational distribution, and ε
represents the parameters of the encoder. We use the Gaussian
distribution for both the prior p(z) as well as for the variational
distribution q
ς
(z|x). The latent variable z is sampled using
the reparametrization trick [28], z
i
= u
i
+ δ ⊗ σ
i
, where u
i
and σ
i
are inferred by the encoder, and δ is sampled from
N (0, I). p
θ
(x|z) is implemented by a decoder with trainable
parameters θ, receiving the latent variables z and producing
data reconstructions x
0
.
When considering that we have K experts in the mixture
model, we introduce the loss function as the Mixture of
individual ELBOs (MELBO) L
i
V AE
(x), defined through (1) :
L
L−MV AE
(x) =
K
P
i=1
w
i
L
i
V AE
(x)
K
P
i=1
w
i
, (2)
where w
i
is the mixing coefficient, which controls the sig-
nificance of the i-th expert. We model all mixing coefficients
by using a Dirichlet distribution {w
1
, . . . , w
K
} ∼ Dir(a), of
3
parameters a = {a
1
, . . . , a
K
}. In the following we describe
the mechanism for selecting appropriate L-MVAE components
during the training.
C. The selection of L-MVAE mixture’s components during
training
Certain research studies [24], [25] have considered equal
contributions for the components of deep learning mixture
systems. However, in this paper we consider that each mixture
component is specialized for a specific task. The selection of a
specific mixture component is performed through the mixing
weights w
i
, i = 1, . . . , K. We assume that the weighting
probability for each mixture’s component is drawn from a
Multinomial distribution, such as the Bernoulli distribution,
defined by a Dirichlet prior.
Assignment vector.
In the following, we introduce an as-
signment vector c, with each of its entries c
i
∈ {0, 1},
i = 1, . . . , K, representing the probability of including or not
the i-th expert in the mixture. c
i
is sampled from as Bernoulli
distribution. Before starting the training, we set all entries as
c
i
= 0, i = 1, . . . , K. The assignment probability for each
mixing component is calculated considering the sample log-
likelihood of each expert after learning each task, as :
p(c
j
) = 1 −
exp(−L
j
V AE
(x
b
)) + u c
0
j
K
P
i=1
exp(−L
i
V AE
(x
b
)) + u c
0
i
,
(3)
where x
b
is sampled from the given data batch, drawn from the
database corresponding to the current task learning. c
0
j
denotes
the assignment variable for j-th expert and represents the value
resulted when learning the previous task before evaluating
Eq. (3). u c
0
j
is used to ensure that p(c
j
) is outside the range
of possible values for c
0
j
= 1, when evaluating Eq. (3), and
therefore we consider u as a large value. Then we find the
maximum probability for a mixing component :
p(c
j
∗
) = max(p(c
1
), . . . , p(c
K
)) , (4)
where j
∗
represents the index of the selected VAE component
according to the parameters learnt during the previous tasks.
We then normalize the other assignment variables, except for
j
∗
:
p(c
i
) =
(
1 c
0
i
= 1
0 c
0
i
= 0
, i = 1, 2, . . . , K, i 6= j
∗
. (5)
Since c
0
i
is an assignment corresponding to the learning process
of the previous task, before evaluating Eq. (3), in order to
determine the dropout status of i-th expert during the current
task learning, we use Eq. (5) to recover the dropout status of
all experts except for j
∗
-th expert which is actually dropped
out from the future training because it is going to be used
for recording and reproducing the information associated with
the current task being learnt. When learning the first task, all
mixture’s components will be trained and then when learning
the second task, only K − 1 components are trained, while
one component is no longer trained because it is considered
as a depository of the information associated with the first
task. This component will consequently be used to generate
information consistent with the probabilistic representation
associated with the first task. This process is continued and
for the last task at least one VAE is available for training. The
number of mixing components K considered initially should
be larger or at least equal to the number of tasks assumed to
be learned during the lifelong learning process. In Section VI
we describe a mechanism for expanding the mixture.
The sampling of mixing weights.
Suppose that L-MVAE
finished learning the t-th task. We collect several batches
of samples {x
i
, . . . , x
N
} from the (t + 1)-th task, where
each x
i
represents the i-th batch of samples, which are used
to evaluate the assignment vector c by using Eq. (3). We
calculate the average probability p(c
j
) =
P
N
i=1
p(c
i
j
)/N,
where each p(c
i
j
) represents the probability for the assignment
of x
i
. Then we find p(c
j∗
) by using Eq. (4) and we recover
the previous assignments except for c
j∗
by using Eq. (5).
The Dirichlet parameters are calculated in order to fix the
mixture components containing the information corresponding
to the previously learnt tasks while making the other mixture
components available for training with the future tasks. For
the mixing components that have been used for learning the
previous tasks, we consider
a
i
=
(
e c
i
= 1
1−e∗K
0
K−K
0
c
i
= 0 , i = 1, . . . , K
0
(6)
where e is a very small positive value. For i = 1, . . . , K
0
,
where K
0
represents the number of tasks learnt so far out
of a total of K given tasks, during the lifelong learning.
A small value for the Dirichlet parameters implies that the
corresponding mixture components are no longer trained. The
mixing weights w
1
, . . . , w
K
are sampled from a Dirichlet
distribution with parameters a
1
, . . . , a
K
. We then train the
mixture model with w
1
, . . . , w
K
by using Eq. (2) at the (t+1)-
th task learning.
Testing phase.
Suppose that after the lifelong learning pro-
cess, we have trained K components. In the testing phase,
we perform a selection of a single component to be used
for the given data samples. We firstly calculate the selection
probability {v
1
, . . . , v
K
} by calculating the log-likelihood of
the data sample for each component :
v
j
=
exp
−
1
L
j
V AE
(x)
K
P
i=1
exp
−
1
L
i
V AE
(x)
, j = 1, . . . , K .
(7)
Then we select a component by sampling the mixing weight
vector w from Categorical distribution Cat(v
1
, . . . , v
K
).
The structure of the proposed L-MVAE model is shown
in Fig. 1. In the next section we evaluate the convergence
properties of L-MVAE model during the lifelong learning.
IV. THEORETICAL ANALYSIS OF L-MVAE
In this section, we evaluate the convergence properties of
the proposed L-MVAE model during the lifelong learning. We
evaluate the evolution of the objective function L
L−MV AE
(x)
剩余14页未读,继续阅读
易小侠
- 粉丝: 6508
- 资源: 9万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0