变分自动编码器的终身混合_LifelongMixtureofVariationalAutoencoders资源-CSDN文库

版权申诉

53 浏览量 2022-01-23 15:41:40 上传评论收藏 5.86MB PDF 举报

系统，如深度学习模型，去适应新任务而不遗忘已学知识是终身学习（Lifelong Learning）的核心目标。变分自动编码器（Variational Autoencoder, VAE）是一种强大的无监督学习工具，常用于数据建模、特征表示学习以及生成新样本。然而，标准的VAE在面对多任务或连续学习场景时，会遇到所谓的“灾难性遗忘”问题，即在学习新任务时，模型会丢失对旧任务的学习成果。本文提出的终身混合变分自动编码器（Lifelong Mixture of Variational Autoencoders, L-MVAE）是为了解决这个问题。它采用了一个混合专家（Mixture of Experts）架构，每个专家都是一个独立的VAE，能够专注于特定的数据模式或任务。通过最大化每个组件的证据下界（Evidence Lower Bound, ELBO）的混合（MELBO），所有专家一起训练，以提高整体对训练样本的对数似然性的估计。混合模型中的系数决定了每个专家在全球表示中的贡献。这些系数由Dirichlet分布采样得到，其参数通过非参数估计在终身学习过程中动态调整。这种方法允许模型在新任务与旧任务相似时快速学习，同时保留对先前任务的记忆。当遇到全新的任务时，L-MVAE模型会扩展其架构，添加新的VAE组件。训练完成后，模型能自动选择与输入数据样本最相关的专家进行推理。这种策略提高了内存效率，因为只需使用一个专家进行推断，同时也降低了计算成本。 L-MVAE的推断模型还能在不同任务相关数据域的联合潜在空间中进行插值操作，这对于学习解纠缠的表示（Disentangled Representations）非常有效。解纠缠的表示使得各个特征因子独立，有助于理解数据的本质和提高模型的泛化能力。此外，作者提供了代码开源链接（https://github.com/dtuzi123/LifelongMixtureVAEs），便于研究者和开发者复现实验结果并进一步探索L-MVAE在终身学习和多任务学习中的应用。 L-MVAE通过引入混合模型和终身学习策略，解决了传统VAE在多任务环境下的遗忘问题，提高了模型的适应性和泛化性能，为深度学习模型在复杂和不断变化的学习环境中持续学习提供了新的思路。

资源详情

资源评论

资源推荐

Lifelong Mixture of Variational Autoencoders

Fei Ye and Adrian G. Bors, Senior Member, IEEE

Department of Computer Science, University of York, York YO10 5GH, UK

E-mail: fy689@york.ac.uk, adrian.bors@york.ac.uk

Abstract—In this paper, we propose an end-to-end lifelong

learning mixture of experts. Each expert is implemented by

a Variational Autoencoder (VAE). The experts in the mixture

system are jointly trained by maximizing a mixture of indi-

vidual component evidence lower bounds (MELBO) on the log-

likelihood of the given training samples. The mixing coefﬁcients

in the mixture model, control the contributions of each expert in

the global representation. These are sampled from a Dirichlet

distribution whose parameters are determined through non-

parametric estimation during the lifelong learning. The model

can learn new tasks fast when these are similar to those

previously learnt. The proposed Lifelong mixture of VAE (L-

MVAE) expands its architecture with new components when

learning a completely new task. After the training, our model can

automatically determine the relevant expert to be used when fed

with new data samples. This mechanism beneﬁts both the mem-

ory efﬁciency and the required computational cost as only one

expert is used during the inference. The L-MVAE inference model

is able to perform interpolations in the joint latent space across

the data domains associated with different tasks and is shown to

be efﬁcient for disentangled learning representation. The code is

available : https://github.com/dtuzi123/LifelongMixtureVAEs

Index Terms—Lifelong learning, Mixture of Variational Au-

toencoders, Multi-task learning, Mixture of Evidence Lower

Bounds, Disentangled representations.

I. INTRODUCTION

Deep learning models suffer from catastrophic forgetting [1]

when trained on multiple databases in a sequential manner.

A deep learning model quickly forgets the characteristics of

the previously learned experiences while adjusting to learning

new information. The ability of artiﬁcial learning systems of

continuously acquiring, preserving and transferring skills and

knowledge throughout their lifespan is called lifelong learning

[1]. Existing approaches would either use dynamic architec-

tures, adopt regularization during training, or employ gen-

erative replay mechanisms. Dynamic architecture approaches

[2], [3], [4], [5], [6] would increase the network capacity by

adding new layers and processing units in order to adapt the

network’s architecture to acquiring new information. However,

such approaches would require a speciﬁc architecture design

while their parameters would increase progressively with the

number of tasks. Regularization approaches [7], [8], [9], [10],

[11] aim to impose a penalty when updating the network’

parameters in order to preserve the knowledge associated

with previously learned tasks. In practice, these approaches

suffer from performance degradation when learning a series

of tasks where the datasets are entirely different from the

previously learned ones. Memory-based methods use a buffer

in order to upload previously learned data samples [12], [13],

or utilize powerful generative networks such as a Variational

Autoencoders (VAEs) [14], [15], [16], [17] or Generative

Adversarial Networks (GANs) [18], [19] as memory-based

replay networks that reproduces and generates data which

is consistent with what has seen and learned before. These

approaches would need additional memory storage space for

recording parameters of the generated data while their perfor-

mance on the previously learned tasks is heavily dependent on

the generator’s ability to realistically replicate data.

Promising results have been achieved on prediction tasks

[6], [7], [20], [21], [22], [23]. However, these methods do not

capture the underlying structure behind the data, which pre-

vents them from being applied in a wide range of applications.

There are very few attempts addressing representation learning

under the lifelong setting [15], [16]. The performance of these

methods degrades signiﬁcantly when engaging in the lifelong

training with datasets containing complex images or on a long

sequence of tasks. The reason is that these approaches require

to retrain their generators on artiﬁcially generated data. Mean-

while, the performance loss on each dataset is accumulated

during the lifelong learning of a sequence of several tasks.

To address this problem, we propose a probabilistic mixture

of experts model, where each expert infers a probabilistic

representation of a given task. A Dirichlet sampling process

deﬁnes the likelihood of a certain expert to be activated when

presented with a new task.

This paper has the following contributions :

• A novel mixture learning model, called Lifelong Mixture

of VAEs (L-MVAE). Instead of capturing different char-

acteristics of a database as in other mixture models [24],

[25], [26], [27], the proposed mixture model enables to

automatically embed the knowledge associated with each

database into a distinct latent space modelled by one of

the mixture’s experts during the lifelong learning.

• A training algorithm based on the maximization of the

mixture of individual component evidence lower bounds

(MELBO)

• A mixing-coefﬁcient sampling process is introduced in

order to activate or drop out experts in L-MVAE. Besides

deﬁning an adaptive architecture, this procedure acceler-

ates the learning process of new tasks while overcoming

the forgetting of the previously learned tasks.

The remainder of the paper contains a detailed overview of

the existing state of the art in Section II, while the proposed

L-MVAE model is discussed in Section III. In Section IV we

discuss the theory behind the proposed L-MAE model and in

Section V we explain how the proposed methodology can be

used in unsupervised, supervised and semi-supervised learning

applications. The expansion mechanism for the model’s archi-

tecture is presented in Section VI. The experimental results

arXiv:2107.04694v1 [cs.LG] 9 Jul 2021

are analyzed in Section VII while the conclusions are drawn

in Section VIII.

II. RELATED RESEARCH STUDIES

A variational autoencoder (VAE) [28] is made up of two

networks, an encoder and a decoder. Given a data set, the

encoder extracts a latent vector z, and the decoder aims

to reconstruct the given data from the latent vectors. A

number of research works have been developed for capturing

meaningful and disentangled data representations by using the

VAE framework [29], [30], [31], [32], [33]. These approaches

show promising results on achieving disentanglement between

latent variables as well as interpretable visual results, where

speciﬁc properties of the scene can be manipulated through

changing the relevant latent variables. However, these models

work well only on data samples drawn from a single domain,

corresponding to a speciﬁc database used for training. When

they are re-trained on a different database, their parameters

are updated and then they fail to perform on the tasks learned

previously. This happens because they do not have appropriate

objective functions to deal with catastrophic forgetting [9],

[34], [35].

Recently, there have been some attempts to learn cross-

domain representations under the lifelong learning by intro-

ducing an environment-dependent mask that speciﬁes a subset

of generative factors [16], by proposing a Teacher-Student

lifelong learning framework [15], [36] or a hybrid model [37]

of Generative Adversarial Nets (GANs) [38] and VAE. The

models proposed in [15], [16], [37] are based on Generative

Replay Mechanisms (GRM) aiming to overcome forgetting.

However, these methods suffer from poor performance when

considering complex data.

Aljundi et al. [39] proposed a lifelong learning system

named the Expert Gate model, where new experts are added

to a network of experts. The most relevant expert from the

given set is chosen during the testing stage, according to

the reconstruction error of the data. However, this may not

necessarily correspond to the best log-likelihood estimate for

the data. Moreover, the Expert Gate model was used only for

supervised classiﬁcation tasks.

Regularization based approaches alleviate catastrophic for-

getting by adding an auxiliary term that penalizes changes in

the weights when the model is trained on a new task [6], [7],

[8], [9], [10], [11], [35], [40], [41], [42] or store past samples

to regulate the optimization [20], [43]. However, regularization

based approaches have huge computation requirements when

the number of tasks increases [44].

In another direction of research, mixtures of VAEs have

been employed for continuous learning [24], [25], [26], [27].

These models are able to capture underlying complex struc-

tures behind data and therefore perform well on many down-

stream tasks including clustering and semi-supervised classiﬁ-

cation. However, these mixture models would only capture

characteristics of a single database which had been split

into batches of data, and tend to forget previously learned

data characteristics when attempting to learn a sequence of

distinct tasks. In contrast to the above mentioned methods,

our model is able to capture underlying generative latent

variable representations across multiple data domains during

the lifelong learning.

III. THE LIFELONG MIXTURE OF VAES

A. Problem formulation

In this paper we consider a model made up of a mixture

of networks [45] which is able to deal with three different

learning scenarios: supervised, semi-supervised and unsuper-

vised, under the lifelong learning setting. Let us consider a

sequence of tasks and denote D

(k)

= {x

(k)

, y

(k)

}

as a

dataset characterizing the k − th task, where x

(k)

∈ X

(k)

is the source domain and y

(k)

∈ Y

(k)

is the target domain

which is usually deﬁned by class labels, while each domain

(i)

|i = 1, . . . , K} is associated to a given task. We aim to

learn a model which not only generates or reconstructs data

but which can also generate meaningful representations useful

for various tasks during a lifelong learning process.

B. Mixture objective function

Traditional mixture models [46], [47] normally capture

different characteristics of a dataset by learning several latent

variable vectors, with distinct sets of variables associated to

each mixture’ component. In this paper, we implement each

expert by using a generative latent variable model p

(x, z) =

(x|z)p(z), where z ∈ IR

is the latent variable and θ

represents the decoder’s parameters, as in VAEs [28]. The

learning goal of the generative model is to maximize the

log-likelihood of the data distribution, which is actually a

difﬁcult problem due to the intractability of the marginal

distribution p(x) =

(x|z)p(z)dz, requiring access to

all latent variables. Instead, we optimize the evidence lower

bound (ELBO) on the data log-likelihood, [28] :

log p(x) ≥ E

z∼q

(z|x)

[log p

(x|z)] − D

(z|x)||p(z)]

= L

VAE,θ,ε

(x),

(1)

where q

(z|x) is called the variational distribution, and ε

represents the parameters of the encoder. We use the Gaussian

distribution for both the prior p(z) as well as for the variational

distribution q

(z|x). The latent variable z is sampled using

the reparametrization trick [28], z

= u

+ δ ⊗ σ

, where u

and σ

are inferred by the encoder, and δ is sampled from

N (0, I). p

(x|z) is implemented by a decoder with trainable

parameters θ, receiving the latent variables z and producing

data reconstructions x

When considering that we have K experts in the mixture

model, we introduce the loss function as the Mixture of

individual ELBOs (MELBO) L

V AE

(x), deﬁned through (1) :

L−MV AE

(x) =

i=1

V AE

(x)

i=1

, (2)

where w

is the mixing coefﬁcient, which controls the sig-

niﬁcance of the i-th expert. We model all mixing coefﬁcients

by using a Dirichlet distribution {w

, . . . , w

} ∼ Dir(a), of

parameters a = {a

, . . . , a

}. In the following we describe

the mechanism for selecting appropriate L-MVAE components

during the training.

C. The selection of L-MVAE mixture’s components during

training

Certain research studies [24], [25] have considered equal

contributions for the components of deep learning mixture

systems. However, in this paper we consider that each mixture

component is specialized for a speciﬁc task. The selection of a

speciﬁc mixture component is performed through the mixing

weights w

, i = 1, . . . , K. We assume that the weighting

probability for each mixture’s component is drawn from a

Multinomial distribution, such as the Bernoulli distribution,

deﬁned by a Dirichlet prior.

Assignment vector.

In the following, we introduce an as-

signment vector c, with each of its entries c

∈ {0, 1},

i = 1, . . . , K, representing the probability of including or not

the i-th expert in the mixture. c

is sampled from as Bernoulli

distribution. Before starting the training, we set all entries as

= 0, i = 1, . . . , K. The assignment probability for each

mixing component is calculated considering the sample log-

likelihood of each expert after learning each task, as :

p(c

) = 1 −

exp(−L

V AE

)) + u c

i=1



exp(−L

V AE

)) + u c



(3)

where x

is sampled from the given data batch, drawn from the

database corresponding to the current task learning. c

denotes

the assignment variable for j-th expert and represents the value

resulted when learning the previous task before evaluating

Eq. (3). u c

is used to ensure that p(c

) is outside the range

of possible values for c

= 1, when evaluating Eq. (3), and

therefore we consider u as a large value. Then we ﬁnd the

maximum probability for a mixing component :

p(c

∗

) = max(p(c

), . . . , p(c

)) , (4)

where j

∗

represents the index of the selected VAE component

according to the parameters learnt during the previous tasks.

We then normalize the other assignment variables, except for

∗

p(c

) =

(

1 c

= 1

0 c

= 0

, i = 1, 2, . . . , K, i 6= j

∗

. (5)

Since c

is an assignment corresponding to the learning process

of the previous task, before evaluating Eq. (3), in order to

determine the dropout status of i-th expert during the current

task learning, we use Eq. (5) to recover the dropout status of

all experts except for j

∗

-th expert which is actually dropped

out from the future training because it is going to be used

for recording and reproducing the information associated with

the current task being learnt. When learning the ﬁrst task, all

mixture’s components will be trained and then when learning

the second task, only K − 1 components are trained, while

one component is no longer trained because it is considered

as a depository of the information associated with the ﬁrst

task. This component will consequently be used to generate

information consistent with the probabilistic representation

associated with the ﬁrst task. This process is continued and

for the last task at least one VAE is available for training. The

number of mixing components K considered initially should

be larger or at least equal to the number of tasks assumed to

be learned during the lifelong learning process. In Section VI

we describe a mechanism for expanding the mixture.

The sampling of mixing weights.

Suppose that L-MVAE

ﬁnished learning the t-th task. We collect several batches

of samples {x

, . . . , x

} from the (t + 1)-th task, where

each x

represents the i-th batch of samples, which are used

to evaluate the assignment vector c by using Eq. (3). We

calculate the average probability p(c

) =

i=1

p(c

)/N,

where each p(c

) represents the probability for the assignment

of x

. Then we ﬁnd p(c

j∗

) by using Eq. (4) and we recover

the previous assignments except for c

j∗

by using Eq. (5).

The Dirichlet parameters are calculated in order to ﬁx the

mixture components containing the information corresponding

to the previously learnt tasks while making the other mixture

components available for training with the future tasks. For

the mixing components that have been used for learning the

previous tasks, we consider

(

e c

= 1

1−e∗K

K−K

= 0 , i = 1, . . . , K

(6)

where e is a very small positive value. For i = 1, . . . , K

where K

represents the number of tasks learnt so far out

of a total of K given tasks, during the lifelong learning.

A small value for the Dirichlet parameters implies that the

corresponding mixture components are no longer trained. The

mixing weights w

, . . . , w

are sampled from a Dirichlet

distribution with parameters a

, . . . , a

. We then train the

mixture model with w

, . . . , w

by using Eq. (2) at the (t+1)-

th task learning.

Testing phase.

Suppose that after the lifelong learning pro-

cess, we have trained K components. In the testing phase,

we perform a selection of a single component to be used

for the given data samples. We ﬁrstly calculate the selection

probability {v

, . . . , v

} by calculating the log-likelihood of

the data sample for each component :

exp



−

V AE

(x)



i=1

exp



−

V AE

(x)



, j = 1, . . . , K .

(7)

Then we select a component by sampling the mixing weight

vector w from Categorical distribution Cat(v

, . . . , v

The structure of the proposed L-MVAE model is shown

in Fig. 1. In the next section we evaluate the convergence

properties of L-MVAE model during the lifelong learning.

IV. THEORETICAL ANALYSIS OF L-MVAE

In this section, we evaluate the convergence properties of

the proposed L-MVAE model during the lifelong learning. We

evaluate the evolution of the objective function L

L−MV AE

(x)

剩余14页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6634
资源: 9万+

变分自动编码器的终身混合_Lifelong Mixture of Variational Autoencoders

评论0

最新资源

变分自动编码器的终身混合_Lifelong Mixture of Variational Autoencoders

评论0

Auto-Encoding Variational Bayes

自动编码器

Tutorial on Variational Autoencoders

终身学习社交媒体上的仇恨言论分类_Lifelong Learning of Hate Speech Classification

终身师生网络学习_Lifelong Teacher-Student Network Learning

通过多策略再平衡实现终身意图检测_Lifelong Intent Detection via Multi-Strategy Re

终身孪生对抗网络_Lifelong Twin Generative Adversarial Networks

基于生成回放的终身车辆轨迹预测框架_Lifelong Vehicle Trajectory Prediction Framewo

深度学习论文集二

Encoder-Based-Lifelong-learning

AN ALTERNATIVE APPROACH TO EFFICIENT ENSEMBLE AND LIFELONG LEARNING

KinderGarten（备用资料）

Lifelong 数据集采集的标定代码上传.zip

基于MMSkeleton工具包中的ST-GCN模型实现一种基于动态拓扑图的人体骨架动作识别算法python源码+使用说明.zip

毕业设计-基于自适应图卷积网络的人体动作识别系统python源码+使用说明.zip

Lifelong Learning.pdf

基于链接实体回放的多源知识图谱终身表示学习python源码+项目操作说明.zip

Hyper-LifelongGAN Scalable Lifelong Learning for Image Conditi

End-to-End Lifelong Learning: a Framework to Achieve Plasticities of both the Feature and Classifier Constructions

stories-of-a-lifelong-student::open_book:我的博客-https

lifelong_stemcells

awesome-lifelong-continual-learning:终身持续机器学习领域的论文，博客，数据集和软件列表

终身学习的理念与发展.ppt

基于终身机器学习的主题挖掘与评分预测联合模型.pdf

传播终身学习理念推进全民教育发展.docx

最新资源