1
Lifelong Mixture of Variational Autoencoders
Fei Ye and Adrian G. Bors, Senior Member, IEEE
Department of Computer Science, University of York, York YO10 5GH, UK
E-mail: fy689@york.ac.uk, adrian.bors@york.ac.uk
Abstract—In this paper, we propose an end-to-end lifelong
learning mixture of experts. Each expert is implemented by
a Variational Autoencoder (VAE). The experts in the mixture
system are jointly trained by maximizing a mixture of indi-
vidual component evidence lower bounds (MELBO) on the log-
likelihood of the given training samples. The mixing coefficients
in the mixture model, control the contributions of each expert in
the global representation. These are sampled from a Dirichlet
distribution whose parameters are determined through non-
parametric estimation during the lifelong learning. The model
can learn new tasks fast when these are similar to those
previously learnt. The proposed Lifelong mixture of VAE (L-
MVAE) expands its architecture with new components when
learning a completely new task. After the training, our model can
automatically determine the relevant expert to be used when fed
with new data samples. This mechanism benefits both the mem-
ory efficiency and the required computational cost as only one
expert is used during the inference. The L-MVAE inference model
is able to perform interpolations in the joint latent space across
the data domains associated with different tasks and is shown to
be efficient for disentangled learning representation. The code is
available : https://github.com/dtuzi123/LifelongMixtureVAEs
Index Terms—Lifelong learning, Mixture of Variational Au-
toencoders, Multi-task learning, Mixture of Evidence Lower
Bounds, Disentangled representations.
I. INTRODUCTION
Deep learning models suffer from catastrophic forgetting [1]
when trained on multiple databases in a sequential manner.
A deep learning model quickly forgets the characteristics of
the previously learned experiences while adjusting to learning
new information. The ability of artificial learning systems of
continuously acquiring, preserving and transferring skills and
knowledge throughout their lifespan is called lifelong learning
[1]. Existing approaches would either use dynamic architec-
tures, adopt regularization during training, or employ gen-
erative replay mechanisms. Dynamic architecture approaches
[2], [3], [4], [5], [6] would increase the network capacity by
adding new layers and processing units in order to adapt the
network’s architecture to acquiring new information. However,
such approaches would require a specific architecture design
while their parameters would increase progressively with the
number of tasks. Regularization approaches [7], [8], [9], [10],
[11] aim to impose a penalty when updating the network’
parameters in order to preserve the knowledge associated
with previously learned tasks. In practice, these approaches
suffer from performance degradation when learning a series
of tasks where the datasets are entirely different from the
previously learned ones. Memory-based methods use a buffer
in order to upload previously learned data samples [12], [13],
or utilize powerful generative networks such as a Variational
Autoencoders (VAEs) [14], [15], [16], [17] or Generative
Adversarial Networks (GANs) [18], [19] as memory-based
replay networks that reproduces and generates data which
is consistent with what has seen and learned before. These
approaches would need additional memory storage space for
recording parameters of the generated data while their perfor-
mance on the previously learned tasks is heavily dependent on
the generator’s ability to realistically replicate data.
Promising results have been achieved on prediction tasks
[6], [7], [20], [21], [22], [23]. However, these methods do not
capture the underlying structure behind the data, which pre-
vents them from being applied in a wide range of applications.
There are very few attempts addressing representation learning
under the lifelong setting [15], [16]. The performance of these
methods degrades significantly when engaging in the lifelong
training with datasets containing complex images or on a long
sequence of tasks. The reason is that these approaches require
to retrain their generators on artificially generated data. Mean-
while, the performance loss on each dataset is accumulated
during the lifelong learning of a sequence of several tasks.
To address this problem, we propose a probabilistic mixture
of experts model, where each expert infers a probabilistic
representation of a given task. A Dirichlet sampling process
defines the likelihood of a certain expert to be activated when
presented with a new task.
This paper has the following contributions :
• A novel mixture learning model, called Lifelong Mixture
of VAEs (L-MVAE). Instead of capturing different char-
acteristics of a database as in other mixture models [24],
[25], [26], [27], the proposed mixture model enables to
automatically embed the knowledge associated with each
database into a distinct latent space modelled by one of
the mixture’s experts during the lifelong learning.
• A training algorithm based on the maximization of the
mixture of individual component evidence lower bounds
(MELBO)
• A mixing-coefficient sampling process is introduced in
order to activate or drop out experts in L-MVAE. Besides
defining an adaptive architecture, this procedure acceler-
ates the learning process of new tasks while overcoming
the forgetting of the previously learned tasks.
The remainder of the paper contains a detailed overview of
the existing state of the art in Section II, while the proposed
L-MVAE model is discussed in Section III. In Section IV we
discuss the theory behind the proposed L-MAE model and in
Section V we explain how the proposed methodology can be
used in unsupervised, supervised and semi-supervised learning
applications. The expansion mechanism for the model’s archi-
tecture is presented in Section VI. The experimental results
arXiv:2107.04694v1 [cs.LG] 9 Jul 2021
评论0