Published as a conference paper at ICLR 2019
(Krishnan et al., 2015; Fraccaro et al., 2016; Buesing et al., 2018) which relies on jointly training a
filtering posterior and a local smoothing posterior. We demonstrate that on a simple task, this new
inference network and associated lower bound lead to improved likelihood compared to methods
classically used to train deep state-space models.
Following the intuition given by the sequential TD-VAE, we develop the full TD-VAE model, which
learns from temporally extended data by making jumpy predictions into the future. We show it can
be used to train consistent jumpy simulators of complex 3D environments. Finally, we illustrate how
training a filtering a posterior leads to the computation of a neural belief state with good representation
of the uncertainty on the state of the environment.
2 MODEL DESIDERATA
2.1 CONSTRUCTION OF A LATENT STATE-SPACE
Autoregressive models
. One of the simplest way to model sequential data
(x
1
, . . . , x
T
)
is to use
the chain rule to decompose the joint sequence likelihood as a product of conditional probabili-
ties, i.e.
log p(x
1
, . . . , x
T
) =
P
t
log p(x
t
| x
1
, . . . , x
t−1
)
. This formula can be used to train an
autoregressive model of data, by combining an RNN which aggregates information from the past
(recursively computing an internal state
h
t
= f(h
t−1
, x
t
)
) with a conditional generative model
which can score the data
x
t
given the context
h
t
. This idea is used in handwriting synthesis (Graves,
2013), density estimation (Uria et al., 2016), image synthesis (van den Oord et al., 2016b), audio
synthesis (van den Oord et al., 2017), video synthesis (Kalchbrenner et al., 2016), generative recall
tasks (Gemici et al., 2017), and environment modeling (Oh et al., 2015; Chiappa et al., 2017).
While these models are conceptually simple and easy to train, one potential weakness is that they only
make predictions in the original observation space, and don’t learn a compressed representation of
data. As a result, these models tend to be computationally heavy (for video prediction, they constantly
decode and re-encode single video frames). Furthermore, the model can be computationally unstable
at test time since it is trained as a next step model (the RNN encoding real data), but at test time
it feeds back its prediction into the RNN. Various methods have been used to alleviate this issue
(Bengio et al., 2015; Lamb et al., 2016; Goyal et al., 2017; Amos et al., 2018).
State-space models
. An alternative to autoregressive models are models which operate on a higher
level of abstraction, and use latent variables to model stochastic transitions between states (grounded
by observation-level predictions). This enables to sample state-to-state transitions only, without
needing to render the observations, which can be faster and more conceptually appealing. They
generally consist of decoder or prior networks, which detail the generative process of states and
observations, and encoder or posterior networks, which estimate the distribution of latents given the
observed data. There is a large amount of recent work on these type of models, which differ in the
precise wiring of model components (Bayer & Osendorfer, 2014; Chung et al., 2015; Krishnan et al.,
2015; Archer et al., 2015; Fraccaro et al., 2016; Liu et al., 2017; Serban et al., 2017; Buesing et al.,
2018; Lee et al., 2018; Ha & Schmidhuber, 2018).
Let
z = (z
1
, . . . , z
T
)
be a state sequence and
x = (x
1
, . . . , x
T
)
an observation sequence. We
assume a general form of state-space model, where the joint state and observation likelihood can
be written as p(x, z) =
Q
t
p(z
t
| z
t−1
)p(x
t
| z
t
).
1
These models are commonly trained with a VAE-
inspired bound, by computing a posterior
q(z | x)
over the states given the observations. Often, the
posterior is decomposed autoregressively:
q(z | x) =
Q
t
q(z
t
| z
t−1
, φ
t
(x))
, where
φ
t
is a function
of
(x
1
, . . . , x
t
)
for filtering posteriors or the entire sequence
x
for smoothing posteriors. This leads
to the following lower bound:
log p(x) ≥ E
z∼q(z | x)
h
X
t
log p(x
t
| z
t
) + log p(z
t
| z
t−1
) − log q(z
t
| z
t−1
, φ
t
(x))
i
. (1)
1
For notational simplicity,
p(z
1
| z
0
) = p(z
1
)
. Also note the conditional distributions could be very complex,
using additional latent variables, flow models, or implicit models (for instance, if a deterministic RNN with
stochastic inputs is used in the decoder).
2
评论0