【免费】Learningdeeprepresentationsbymutualinformationestimationa

自然语言处理

需积分: 0 195 浏览量 2022-08-03 12:55:31 上传评论收藏 4.57MB PDF 举报

资源详情

资源评论

资源推荐

Published as a conference paper at ICLR 2019

LEARNING DEEP REPRESENTATIONS BY MUTUAL IN-

FORMATION ESTIMATION AND MAXIMIZATION

R Devon Hjelm

MSR Montreal, MILA, UdeM, IVADO

devon.hjelm@microsoft.com

Alex Fedorov

MRN, UNM

Samuel Lavoie-Marchildon

MILA, UdeM

Karan Grewal

U Toronto

Phil Bachman

MSR Montreal

Adam Trischler

MSR Montreal

Yoshua Bengio

MILA, UdeM, IVADO, CIFAR

ABSTRACT

This work investigates unsupervised learning of representations by maximizing

mutual information between an input and the output of a deep neural network en-

coder. Importantly, we show that structure matters: incorporating knowledge about

locality in the input into the objective can signiﬁcantly improve a representation’s

suitability for downstream tasks. We further control characteristics of the repre-

sentation by matching to a prior distribution adversarially. Our method, which we

call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning

methods and compares favorably with fully-supervised learning on several clas-

siﬁcation tasks in with some standard architectures. DIM opens new avenues for

unsupervised learning of representations and is an important step towards ﬂexible

formulations of representation learning objectives for speciﬁc end-goals.

1 INTRODUCTION

One core objective of deep learning is to discover useful representations, and the simple idea explored

here is to train a representation-learning function, i.e. an encoder, to maximize the mutual information

(MI) between its inputs and outputs. MI is notoriously difﬁcult to compute, particularly in continuous

and high-dimensional settings. Fortunately, recent advances enable effective computation of MI

between high dimensional input/output pairs of deep neural networks (Belghazi et al., 2018). We

leverage MI estimation for representation learning and show that, depending on the downstream

task, maximizing MI between the complete input and the encoder output (i.e., global MI) is often

insufﬁcient for learning useful representations. Rather, structure matters: maximizing the average

MI between the representation and local regions of the input (e.g. patches rather than the complete

image) can greatly improve the representation’s quality for, e.g., classiﬁcation tasks, while global MI

plays a stronger role in the ability to reconstruct the full input given the representation.

Usefulness of a representation is not just a matter of information content: representational char-

acteristics like independence also play an important role (Gretton et al., 2012; Hyv

arinen & Oja,

2000; Hinton, 2002; Schmidhuber, 1992; Bengio et al., 2013; Thomas et al., 2017). We combine MI

maximization with prior matching in a manner similar to adversarial autoencoders (AAE, Makhzani

et al., 2015) to constrain representations according to desired statistical properties. This approach is

closely related to the infomax optimization principle (Linsker, 1988; Bell & Sejnowski, 1995), so we

call our method Deep InfoMax (DIM). Our main contributions are the following:

•

We formalize Deep InfoMax (DIM), which simultaneously estimates and maximizes the

mutual information between input data and learned high-level representations.

•

Our mutual information maximization procedure can prioritize global or local information,

which we show can be used to tune the suitability of learned representations for classiﬁcation

or reconstruction-style tasks.

•

We use adversarial learning (

a la Makhzani et al., 2015) to constrain the representation to

have desired statistical characteristics speciﬁc to a prior.

Published as a conference paper at ICLR 2019

•

We introduce two new measures of representation quality, one based on Mutual Information

Neural Estimation (MINE, Belghazi et al., 2018) and a neural dependency measure (NDM)

based on the work by Brakel & Bengio (2017), and we use these to bolster our comparison

of DIM to different unsupervised methods.

2 RELATED WORK

There are many popular methods for learning representations. Classic methods, such as independent

component analysis (ICA, Bell & Sejnowski, 1995) and self-organizing maps (Kohonen, 1998),

generally lack the representational capacity of deep neural networks. More recent approaches

include deep volume-preserving maps (Dinh et al., 2014; 2016), deep clustering (Xie et al., 2016;

Chang et al., 2017), noise as targets (NAT, Bojanowski & Joulin, 2017), and self-supervised or

co-learning (Doersch & Zisserman, 2017; Dosovitskiy et al., 2016; Sajjadi et al., 2016).

Generative models are also commonly used for building representations (Vincent et al., 2010; Kingma

et al., 2014; Salimans et al., 2016; Rezende et al., 2016; Donahue et al., 2016), and mutual information

(MI) plays an important role in the quality of the representations they learn. In generative models that

rely on reconstruction (e.g., denoising, variational, and adversarial autoencoders, Vincent et al., 2008;

Rifai et al., 2012; Kingma & Welling, 2013; Makhzani et al., 2015), the reconstruction error can be

related to the MI as follows:

(X, Y ) = H

(X) − H

(X|Y ) ≥ H

(X) − R

e,d

(X|Y ), (1)

where

and

denote the input and output of an encoder which is applied to inputs sampled from

some source distribution.

e,d

(X|Y )

denotes the expected reconstruction error of

given the codes

(X)

and

(X|Y )

denote the marginal and conditional entropy of

in the distribution formed

by applying the encoder to inputs sampled from the source distribution. Thus, in typical settings,

models with reconstruction-type objectives provide some guarantees on the amount of information

encoded in their intermediate representations. Similar guarantees exist for bi-directional adversarial

models (Dumoulin et al., 2016; Donahue et al., 2016), which adversarially train an encoder / decoder

to match their respective joint distributions or to minimize the reconstruction error (Chen et al., 2016).

Mutual-information estimation

Methods based on mutual information have a long history in

unsupervised feature learning. The infomax principle (Linsker, 1988; Bell & Sejnowski, 1995),

as prescribed for neural networks, advocates maximizing MI between the input and output. This

is the basis of numerous ICA algorithms, which can be nonlinear (Hyv

arinen & Pajunen, 1999;

Almeida, 2003) but are often hard to adapt for use with deep networks. Mutual Information Neural

Estimation (MINE, Belghazi et al., 2018) learns an estimate of the MI of continuous variables, is

strongly consistent, and can be used to learn better implicit bi-directional generative models. Deep

InfoMax (DIM) follows MINE in this regard, though we ﬁnd that the generator is unnecessary.

We also ﬁnd it unnecessary to use the exact KL-based formulation of MI. For example, a simple

alternative based on the Jensen-Shannon divergence (JSD) is more stable and provides better results.

We will show that DIM can work with various MI estimators. Most signiﬁcantly, DIM can leverage

local structure in the input to improve the suitability of representations for classiﬁcation.

Leveraging known structure in the input when designing objectives based on MI maximization is

nothing new (Becker, 1992; 1996; Wiskott & Sejnowski, 2002), and some very recent works also

follow this intuition. It has been shown in the case of discrete MI that data augmentations and other

transformations can be used to avoid degenerate solutions (Hu et al., 2017). Unsupervised clustering

and segmentation is attainable by maximizing the MI between images associated by transforms or

spatial proximity (Ji et al., 2018). Our work investigates the suitability of representations learned

across two different MI objectives that focus on local or global structure, a ﬂexibility we believe is

necessary for training representations intended for different applications.

Proposed independently of DIM, Contrastive Predictive Coding (CPC, Oord et al., 2018) is a MI-

based approach that, like DIM, maximizes MI between global and local representation pairs. CPC

shares some motivations and computations with DIM, but there are important ways in which CPC and

DIM differ. CPC processes local features sequentially to build partial “summary features”, which are

used to make predictions about speciﬁc local features in the “future” of each summary feature. This

equates to ordered autoregression over the local features, and requires training separate estimators

Published as a conference paper at ICLR 2019

for each temporal offset at which one would like to predict the future. In contrast, the basic version

of DIM uses a single summary feature that is a function of all local features, and this “global” feature

predicts all local features simultaneously in a single step using a single estimator. Note that, when

using occlusions during training (see Section 4.3 for details), DIM performs both “self” predictions

and orderless autoregression.

3 DEEP INFOMAX

Figure 1:

The base encoder model in the

context of image data.

An image (in this

case) is encoded using a convnet until reach-

ing a feature map of

M × M

feature vec-

tors corresponding to

M × M

input patches.

These vectors are summarized into a single

feature vector,

. Our goal is to train this net-

work such that useful information about the

input is easily extracted from the high-level

features.

Figure 2:

Deep InfoMax (DIM) with a

global MI(X; Y ) objective.

Here, we pass

both the high-level feature vector,

, and the

lower-level

M ×M

feature map (see Figure 1)

through a discriminator to get the score. Fake

samples are drawn by combining the same

feature vector with a

M × M

feature map

from another image.

Here we outline the general setting of training an encoder to maximize mutual information between

its input and output. Let

and

be the domain and range of a continuous and (almost everywhere)

differentiable parametric function,

: X → Y

with parameters

(e.g., a neural network). These

parameters deﬁne a family of encoders,

= {E

}

ψ∈Ψ

over

. Assume that we are given a set of

training examples on an input space,

X := {x

(i)

∈ X }

i=1

, with empirical probability distribution

. We deﬁne

ψ,P

to be the marginal distribution induced by pushing samples from

through

I.e.,

ψ,P

is the distribution over encodings

y ∈ Y

produced by sampling observations

x ∼ X

and

then sampling y ∼ E

(x).

An example encoder for image data is given in Figure 1, which will be used in the following sections,

but this approach can easily be adapted for temporal data. Similar to the infomax optimization

principle (Linsker, 1988), we assert our encoder should be trained according to the following criteria:

• Mutual information maximization:

Find the set of parameters,

, such that the mutual

information,

I(X; E

(X))

, is maximized. Depending on the end-goal, this maximization

can be done over the complete input, X, or some structured or “local” subset.

• Statistical constraints:

Depending on the end-goal for the representation, the marginal

ψ,P

should match a prior distribution,

. Roughly speaking, this can be used to encourage

the output of the encoder to have desired characteristics (e.g., independence).

The formulation of these two objectives covered below we call Deep InfoMax (DIM).

3.1 MUTUAL INFORMATION ESTIMATION AND MAXIMIZATION

Our basic mutual information maximization framework is presented in Figure 2. The approach

follows Mutual Information Neural Estimation (MINE, Belghazi et al., 2018), which estimates mutual

information by training a classiﬁer to distinguish between samples coming from the joint,

, and the

Published as a conference paper at ICLR 2019

Figure 3:

Maximizing mutual information

between local features and global features.

First we encode the image to a feature map

that reﬂects some structural aspect of the data,

e.g. spatial locality, and we further summarize

this feature map into a global feature vector

(see Figure 1). We then concatenate this fea-

ture vector with the lower-level feature map

at every location. A score is produced for

each local-global pair through an additional

function (see the Appendix A.2 for details).

product of marginals,

, of random variables

and

. MINE uses a lower-bound to the MI based

on the Donsker-Varadhan representation (DV, Donsker & Varadhan, 1983) of the KL-divergence,

I(X; Y ) := D

(J||M) ≥

(DV )

(X; Y ) := E

(x, y)] − log E

(x,y)

], (2)

where

: X × Y → R

is a discriminator function modeled by a neural network with parameters

At a high level, we optimize E

by simultaneously estimating and maximizing I(X, E

(X)),

(ˆω,

ψ)

= arg max

ω,ψ

(X; E

(X)), (3)

where the subscript G denotes “global” for reasons that will be clear later. However, there are some

important differences that distinguish our approach from MINE. First, because the encoder and

mutual information estimator are optimizing the same objective and require similar computations, we

share layers between these functions, so that

= f

◦ C

and

ψ,ω

= D

◦ g ◦ (C

, E

)

where

g is a function that combines the encoder output with the lower layer.

Second, as we are primarily interested in maximizing MI, and not concerned with its precise value,

we can rely on non-KL divergences which may offer favourable trade-offs. For example, one could

deﬁne a Jensen-Shannon MI estimator (following the formulation of Nowozin et al., 2016),

(JSD)

ω,ψ

(X; E

(X)) := E

[−sp(−T

ψ,ω

(x, E

(x)))] − E

P×

[sp(T

ψ,ω

, E

(x)))], (4)

where

is an input sample,

is an input sampled from

P = P

, and

sp(z) = log(1+e

)

is the softplus

function. A similar estimator appeared in Brakel & Bengio (2017) in the context of minimizing the

total correlation, and it amounts to the familiar binary cross-entropy. This is well-understood in terms

of neural network optimization and we ﬁnd works better in practice (e.g., is more stable) than the

DV-based objective (e.g., see App. A.3). Intuitively, the Jensen-Shannon-based estimator should

behave similarly to the DV-based estimator in Eq. 2, since both act like classiﬁers whose objectives

maximize the expected

log

-ratio of the joint over the product of marginals. We show in App. A.1 the

relationship between the JSD estimator and the formal deﬁnition of mutual information.

Noise-Contrastive Estimation (NCE, Gutmann & Hyv

arinen, 2010; 2012) was ﬁrst used as a bound

on MI in Oord et al. (and called “infoNCE”, 2018), and this loss can also be used with DIM by

maximizing:

(infoNCE)

ω,ψ

(X; E

(X)) := E

ψ,ω

(x, E

(x)) − E

log

ψ,ω

(x))

. (5)

For DIM, a key difference between the DV, JSD, and infoNCE formulations is whether an expectation

over

appears inside or outside of a

log

. In fact, the JSD-based objective mirrors the original

NCE formulation in Gutmann & Hyv

arinen (2010), which phrased unnormalized density estimation

as binary classiﬁcation between the data distribution and a noise distribution. DIM sets the noise

distribution to the product of marginals over

X/Y

, and the data distribution to the true joint. The

infoNCE formulation in Eq. 5 follows a softmax-based version of NCE (Jozefowicz et al., 2016),

similar to ones used in the language modeling community (Mnih & Kavukcuoglu, 2013; Mikolov et al.,

Here we slightly abuse the notation and use ψ for both parts of E

Published as a conference paper at ICLR 2019

2013), and which has strong connections to the binary cross-entropy in the context of noise-contrastive

learning (Ma & Collins, 2018). In practice, implementations of these estimators appear quite similar

and can reuse most of the same code. We investigate JSD and infoNCE in our experiments, and

ﬁnd that using infoNCE often outperforms JSD on downstream tasks, though this effect diminishes

with more challenging data. However, as we show in the App. (A.3), infoNCE and DV require a

large number of negative samples (samples from

) to be competitive. We generate negative samples

using all combinations of global and local features at all locations of the relevant feature map, across

all images in a batch. For a batch of size

, that gives

O(B × M

)

negative samples per positive

example, which quickly becomes cumbersome with increasing batch size. We found that DIM with

the JSD loss is insensitive to the number of negative samples, and in fact outperforms infoNCE as the

number of negative samples becomes smaller.

3.2 LOCAL MUTUAL INFORMATION MAXIMIZATION

The objective in Eq. 3 can be used to maximize MI between input and output, but ultimately this

may be undesirable depending on the task. For example, trivial pixel-level noise is useless for image

classiﬁcation, so a representation may not beneﬁt from encoding this information (e.g., in zero-shot

learning, transfer learning, etc.). In order to obtain a representation more suitable for classiﬁcation,

we can instead maximize the average MI between the high-level representation and local patches of

the image. Because the same representation is encouraged to have high MI with all the patches, this

favours encoding aspects of the data that are shared across patches.

Suppose the feature vector is of limited capacity (number of units and range) and assume the encoder

does not support inﬁnite output conﬁgurations. For maximizing the MI between the whole input and

the representation, the encoder can pick and choose what type of information in the input is passed

through the encoder, such as noise speciﬁc to local patches or pixels. However, if the encoder passes

information speciﬁc to only some parts of the input, this does not increase the MI with any of the

other patches that do not contain said noise. This encourages the encoder to prefer information that is

shared across the input, and this hypothesis is supported in our experiments below.

Our local DIM framework is presented in Figure 3. First we encode the input to a feature map,

(x) := {C

(i)

}

M×M

i=1

that reﬂects useful structure in the data (e.g., spatial locality), indexed in this

case by

. Next, we summarize this local feature map into a global feature,

(x) = f

◦ C

(x)

We then deﬁne our MI estimator on global/local pairs, maximizing the average estimated MI:

(ˆω,

ψ)

= arg max

ω,ψ

i=1

ω,ψ

(i)

(X); E

(X)). (6)

We found success optimizing this “local” objective with multiple easy-to-implement architectures,

and further implementation details are provided in the App. (A.2).

3.3 MATCHING REPRESENTATIONS TO A PRIOR DISTRIBUTION

Absolute magnitude of information is only one desirable property of a representation; depending on

the application, good representations can be compact (Gretton et al., 2012), independent (Hyv

arinen

& Oja, 2000; Hinton, 2002; Dinh et al., 2014; Brakel & Bengio, 2017), disentangled (Schmidhuber,

1992; Rifai et al., 2012; Bengio et al., 2013; Chen et al., 2018; Gonzalez-Garcia et al., 2018), or

independently controllable (Thomas et al., 2017). DIM imposes statistical constraints onto learned

representations by implicitly training the encoder so that the push-forward distribution,

ψ,P

, matches

a prior,

. This is done (see Figure 7 in the App. A.2) by training a discriminator,

: Y → R

, to

estimate the divergence, D(V||U

ψ,P

), then training the encoder to minimize this estimate:

(ˆω,

ψ)

= arg min

arg max

(V||U

ψ,P

) = E

[log D

(y)] + E

[log(1 − D

(x)))]. (7)

This approach is similar to what is done in adversarial autoencoders (AAE, Makhzani et al., 2015),

but without a generator. It is also similar to noise as targets (Bojanowski & Joulin, 2017), but trains

the encoder to match the noise implicitly rather than using a priori noise samples as targets.

剩余23页未读，继续阅读

评论收藏

内容反馈

实在想不出来了

粉丝: 23
资源: 318

Learning deep representations by mutual information estimation a

评论0

最新资源

Learning deep representations by mutual information estimation a

评论0

Learning Deep Features for Visual Recognition kaiminghe

Deep learning of representations_ looking forward

Deep Image Retrieval:Learning global representations for image search

learning representations by back-propagating errors

Learning Internal Representations by Error Progagation.pdf

Adversarial Manipulation of Deep Representations代码.rar

Deep Learning of Representations

图表示深度学习综述（Deep Learning for Learning Graph Representations）【清华大学朱文武老师】.zip

Understanding Deep Image Representations by Inverting Them.zip

Learning representations on graphs.pdf

DeepLearning MIT press

Learning representations by back propagating errors(1985)

Learning word representations by jointly modeling syntagmatic and paradigmatic relations

Deep Learning深度学习入门论文

Is Word Segmentation Necessary for Deep Learning of Chinese Representations.pdf

Learning_Representations_of_NLT_pdf.pdf

Learning Phrase Representations using RNN Encoder–Decoder

DeepWalk: Online Learning of Social Representations作者ppt

YeNet-tensorflow版

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

安全认证cisp教材全套

OpenVAS GVM 中文翻译补丁

2024最新：Hvv中常见的面试问题

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

全面的安全基线核查清单

最新资源