【免费】RepresentationLearning:AReviewandNewPerspectives资源-CSDN文库

Bengio

deeplearning

需积分: 0 108 浏览量 2017-06-13 18:03:43 上传评论收藏 1.8MB PDF 举报

资源推荐

资源详情

资源评论

Representation Learning: A Review and New

Perspectives

Yoshua Bengio

†

, Aaron Courville, and Pascal Vincent

†

Department of computer science and operations research, U. Montreal

† also, Canadian Institute for Advanced Research (CIFAR)

Abstract—

The success of machine learning algorithms generally depends on

data representation, and we hypothesize that this is because different

representations can entangle and hide more or less the different ex-

planatory factors of variation behind the data. Although speciﬁc domain

knowledge can be used to help design representations, learning with

generic priors can also be used, and the quest for AI is motivating

the design of more powerful representation-learning algorithms imple-

menting such priors. This paper reviews recent work in the area of

unsupervised feature learning and deep learning, covering advances

in probabilistic models, auto-encoders, manifold learning, and deep

networks. This motivates longer-term unanswered questions about the

appropriate objectives for learning good representations, for computing

representations (i.e., inference), and the geometrical connections be-

tween representation learning, density estimation and manifold learning.

Index Terms—Deep learning, representation learning, feature learning,

unsupervised learning, Boltzmann Machine, autoencoder, neural nets

1 INTRODUCTION

The performance of machine learning methods is heavily

dependent on the choice of data representation (or features)

on which they are applied. For that reason, much of the actual

effort in deploying machine learning algorithms goes into the

design of preprocessing pipelines and data transformations that

result in a representation of the data that can support effective

machine learning. Such feature engineering is important but

labor-intensive and highlights the weakness of current learning

algorithms: their inability to extract and organize the discrimi-

native information from the data. Feature engineering is a way

to take advantage of human ingenuity and prior knowledge to

compensate for that weakness. In order to expand the scope

and ease of applicability of machine learning, it would be

highly desirable to make learning algorithms less dependent

on feature engineering, so that novel applications could be

constructed faster, and more importantly, to make progress

towards Artiﬁcial Intelligence (AI). An AI must fundamentally

understand the world around us, and we argue that this can

only be achieved if it can learn to identify and disentangle the

underlying explanatory factors hidden in the observed milieu

of low-level sensory data.

This paper is about representation learning, i.e., learning

representations of the data that make it easier to extract useful

information when building classiﬁers or other predictors. In

the case of probabilistic models, a good representation is often

one that captures the posterior distribution of the underlying

explanatory factors for the observed input. A good representa-

tion is also one that is useful as input to a supervised predictor.

Among the various ways of learning representations, this paper

focuses on deep learning methods: those that are formed by

the composition of multiple non-linear transformations, with

the goal of yielding more abstract – and ultimately more useful

– representations. Here we survey this rapidly developing area

with special emphasis on recent progress. We consider some

of the fundamental questions that have been driving research

in this area. Speciﬁcally, what makes one representation better

than another? Given an example, how should we compute its

representation, i.e. perform feature extraction? Also, what are

appropriate objectives for learning good representations?

2 WHY SHOULD WE CARE ABOUT LEARNING

REPRESENTATIONS?

Representation learning has become a ﬁeld in itself in the

machine learning community, with regular workshops at the

leading conferences such as NIPS and ICML, and a new

conference dedicated to it, ICLR

, sometimes under the header

of Deep Learning or Feature Learning. Although depth is an

important part of the story, many other priors are interesting

and can be conveniently captured when the problem is cast as

one of learning a representation, as discussed in the next sec-

tion. The rapid increase in scientiﬁc activity on representation

learning has been accompanied and nourished by a remarkable

string of empirical successes both in academia and in industry.

Below, we brieﬂy highlight some of these high points.

Speech Recognition and Signal Processing

Speech was one of the early applications of neural networks,

in particular convolutional (or time-delay) neural networks

The recent revival of interest in neural networks, deep learning,

and representation learning has had a strong impact in the

area of speech recognition, with breakthrough results (Dahl

et al., 2010; Deng et al., 2010; Seide et al., 2011a; Mohamed

et al., 2012; Dahl et al., 2012; Hinton et al., 2012) obtained

by several academics as well as researchers at industrial labs

bringing these algorithms to a larger scale and into products.

For example, Microsoft has released in 2012 a new version

of their MAVIS (Microsoft Audio Video Indexing Service)

1. International Conference on Learning Representations

2. See Bengio (1993) for a review of early work in this area.

arXiv:1206.5538v3 [cs.LG] 23 Apr 2014

speech system based on deep learning (Seide et al., 2011a).

These authors managed to reduce the word error rate on

four major benchmarks by about 30% (e.g. from 27.4% to

18.5% on RT03S) compared to state-of-the-art models based

on Gaussian mixtures for the acoustic modeling and trained on

the same amount of data (309 hours of speech). The relative

improvement in error rate obtained by Dahl et al. (2012) on a

smaller large-vocabulary speech recognition benchmark (Bing

mobile business search dataset, with 40 hours of speech) is

between 16% and 23%.

Representation-learning algorithms have also been applied

to music, substantially beating the state-of-the-art in poly-

phonic transcription (Boulanger-Lewandowski et al., 2012),

with relative error improvement between 5% and 30% on a

standard benchmark of 4 datasets. Deep learning also helped

to win MIREX (Music Information Retrieval) competitions,

e.g. in 2011 on audio tagging (Hamel et al., 2011).

Object Recognition

The beginnings of deep learning in 2006 have focused on

the MNIST digit image classiﬁcation problem (Hinton et al.,

2006; Bengio et al., 2007), breaking the supremacy of SVMs

(1.4% error) on this dataset

. The latest records are still held

by deep networks: Ciresan et al. (2012) currently claims the

title of state-of-the-art for the unconstrained version of the task

(e.g., using a convolutional architecture), with 0.27% error,

and Rifai et al. (2011c) is state-of-the-art for the knowledge-

free version of MNIST, with 0.81% error.

In the last few years, deep learning has moved from

digits to object recognition in natural images, and the latest

breakthrough has been achieved on the ImageNet dataset

bringing down the state-of-the-art error rate from 26.1% to

15.3% (Krizhevsky et al., 2012).

Natural Language Processing

Besides speech recognition, there are many other Natural

Language Processing (NLP) applications of representation

learning. Distributed representations for symbolic data were

introduced by Hinton (1986), and ﬁrst developed in the

context of statistical language modeling by Bengio et al.

(2003) in so-called neural net language models (Bengio,

2008). They are all based on learning a distributed repre-

sentation for each word, called a word embedding. Adding a

convolutional architecture, Collobert et al. (2011) developed

the SENNA system

that shares representations across the

tasks of language modeling, part-of-speech tagging, chunking,

named entity recognition, semantic role labeling and syntactic

parsing. SENNA approaches or surpasses the state-of-the-art

on these tasks but is simpler and much faster than traditional

predictors. Learning word embeddings can be combined with

learning image representations in a way that allow to associate

text and images. This approach has been used successfully to

build Google’s image search, exploiting huge quantities of data

to map images and queries in the same space (Weston et al.,

3. for the knowledge-free version of the task, where no image-speciﬁc prior

is used, such as image deformations or convolutions

4. The 1000-class ImageNet benchmark, whose results are detailed here:

http://www.image-net.org/challenges/LSVRC/2012/results.html

5. downloadable from http://ml.nec-labs.com/senna/

2010) and it has recently been extended to deeper multi-modal

representations (Srivastava and Salakhutdinov, 2012).

The neural net language model was also improved by

adding recurrence to the hidden layers (Mikolov et al., 2011),

allowing it to beat the state-of-the-art (smoothed n-gram

models) not only in terms of perplexity (exponential of the

average negative log-likelihood of predicting the right next

word, going down from 140 to 102) but also in terms of

word error rate in speech recognition (since the language

model is an important component of a speech recognition

system), decreasing it from 17.2% (KN5 baseline) or 16.9%

(discriminative language model) to 14.4% on the Wall Street

Journal benchmark task. Similar models have been applied

in statistical machine translation (Schwenk et al., 2012; Le

et al., 2013), improving perplexity and BLEU scores. Re-

cursive auto-encoders (which generalize recurrent networks)

have also been used to beat the state-of-the-art in full sentence

paraphrase detection (Socher et al., 2011a) almost doubling the

F1 score for paraphrase detection. Representation learning can

also be used to perform word sense disambiguation (Bordes

et al., 2012), bringing up the accuracy from 67.8% to 70.2%

on the subset of Senseval-3 where the system could be applied

(with subject-verb-object sentences). Finally, it has also been

successfully used to surpass the state-of-the-art in sentiment

analysis (Glorot et al., 2011b; Socher et al., 2011b).

Multi-Task and Transfer Learning, Domain Adaptation

Transfer learning is the ability of a learning algorithm to

exploit commonalities between different learning tasks in order

to share statistical strength, and transfer knowledge across

tasks. As discussed below, we hypothesize that representation

learning algorithms have an advantage for such tasks because

they learn representations that capture underlying factors, a

subset of which may be relevant for each particular task, as

illustrated in Figure 1. This hypothesis seems conﬁrmed by a

number of empirical results showing the strengths of repre-

sentation learning algorithms in transfer learning scenarios.

raw input x

task 1

output y

task 3

output y

task 2

output y

Task%A%

Task%B%

Task%C%

%output%

%input%

%shared%

subsets%of%

factors%

Fig. 1. Illustration of representation-learning discovering ex-

planatory factors (middle hidden layer, in red), some explaining

the input (semi-supervised setting), and some explaining target

for each task. Because these subsets overlap, sharing of statis-

tical strength helps generalization.

Most impressive are the two transfer learning challenges

held in 2011 and won by representation learning algorithms.

First, the Transfer Learning Challenge, presented at an ICML

2011 workshop of the same name, was won using unsuper-

vised layer-wise pre-training (Bengio, 2011; Mesnil et al.,

2011). A second Transfer Learning Challenge was held the

same year and won by Goodfellow et al. (2011). Results

were presented at NIPS 2011’s Challenges in Learning Hier-

archical Models Workshop. In the related domain adaptation

setup, the target remains the same but the input distribution

changes (Glorot et al., 2011b; Chen et al., 2012). In the

multi-task learning setup, representation learning has also been

found advantageous Krizhevsky et al. (2012); Collobert et al.

(2011), because of shared factors across tasks.

3 WHAT MAKES A REPRESENTATION GOOD?

3.1 Priors for Representation Learning in AI

In Bengio and LeCun (2007), one of us introduced the

notion of AI-tasks, which are challenging for current machine

learning algorithms, and involve complex but highly structured

dependencies. One reason why explicitly dealing with repre-

sentations is interesting is because they can be convenient to

express many general priors about the world around us, i.e.,

priors that are not task-speciﬁc but would be likely to be useful

for a learning machine to solve AI-tasks. Examples of such

general-purpose priors are the following:

• Smoothness: assumes the function to be learned f is s.t.

x ≈ y generally implies f(x) ≈ f(y). This most basic prior

is present in most machine learning, but is insufﬁcient to get

around the curse of dimensionality, see Section 3.2.

• Multiple explanatory factors: the data generating distribu-

tion is generated by different underlying factors, and for the

most part what one learns about one factor generalizes in many

conﬁgurations of the other factors. The objective to recover

or at least disentangle these underlying factors of variation is

discussed in Section 3.5. This assumption is behind the idea of

distributed representations, discussed in Section 3.3 below.

• A hierarchical organization of explanatory factors: the

concepts that are useful for describing the world around us

can be deﬁned in terms of other concepts, in a hierarchy, with

more abstract concepts higher in the hierarchy, deﬁned in

terms of less abstract ones. This assumption is exploited with

deep representations, elaborated in Section 3.4 below.

• Semi-supervised learning: with inputs X and target Y to

predict, a subset of the factors explaining X’s distribution

explain much of Y , given X. Hence representations that are

useful for P (X) tend to be useful when learning P (Y |X),

allowing sharing of statistical strength between the unsuper-

vised and supervised learning tasks, see Section 4.

• Shared factors across tasks: with many Y ’s of interest or

many learning tasks in general, tasks (e.g., the corresponding

P (Y |X, task)) are explained by factors that are shared with

other tasks, allowing sharing of statistical strengths across

tasks, as discussed in the previous section (Multi-Task and

Transfer Learning, Domain Adaptation).

• Manifolds: probability mass concentrates near regions that

have a much smaller dimensionality than the original space

where the data lives. This is explicitly exploited in some

of the auto-encoder algorithms and other manifold-inspired

algorithms described respectively in Sections 7.2 and 8.

• Natural clustering: different values of categorical variables

such as object classes are associated with separate manifolds.

More precisely, the local variations on the manifold tend to

preserve the value of a category, and a linear interpolation

between examples of different classes in general involves

going through a low density region, i.e., P (X|Y = i) for

different i tend to be well separated and not overlap much. For

example, this is exploited in the Manifold Tangent Classiﬁer

discussed in Section 8.3. This hypothesis is consistent with the

idea that humans have named categories and classes because

of such statistical structure (discovered by their brain and

propagated by their culture), and machine learning tasks often

involves predicting such categorical variables.

• Temporal and spatial coherence: consecutive (from a se-

quence) or spatially nearby observations tend to be associated

with the same value of relevant categorical concepts, or result

in a small move on the surface of the high-density manifold.

More generally, different factors change at different temporal

and spatial scales, and many categorical concepts of interest

change slowly. When attempting to capture such categorical

variables, this prior can be enforced by making the associated

representations slowly changing, i.e., penalizing changes in

values over time or space. This prior was introduced in Becker

and Hinton (1992) and is discussed in Section 11.3.

• Sparsity: for any given observation x, only a small fraction

of the possible factors are relevant. In terms of representation,

this could be represented by features that are often zero (as

initially proposed by Olshausen and Field (1996)), or by the

fact that most of the extracted features are insensitive to small

variations of x. This can be achieved with certain forms of

priors on latent variables (peaked at 0), or by using a non-

linearity whose value is often ﬂat at 0 (i.e., 0 and with a

0 derivative), or simply by penalizing the magnitude of the

Jacobian matrix (of derivatives) of the function mapping input

to representation. This is discussed in Sections 6.1.1 and 7.2.

• Simplicity of Factor Dependencies: in good high-level

representations, the factors are related to each other through

simple, typically linear dependencies. This can be seen in

many laws of physics, and is assumed when plugging a linear

predictor on top of a learned representation.

We can view many of the above priors as ways to help the

learner discover and disentangle some of the underlying (and

a priori unknown) factors of variation that the data may reveal.

This idea is pursued further in Sections 3.5 and 11.4.

3.2 Smoothness and the Curse of Dimensionality

For AI-tasks, such as vision and NLP, it seems hopeless to

rely only on simple parametric models (such as linear models)

because they cannot capture enough of the complexity of in-

terest unless provided with the appropriate feature space. Con-

versely, machine learning researchers have sought ﬂexibility in

local

non-parametric learners such as kernel machines with

a ﬁxed generic local-response kernel (such as the Gaussian

kernel). Unfortunately, as argued at length by Bengio and

Monperrus (2005); Bengio et al. (2006a); Bengio and LeCun

(2007); Bengio (2009); Bengio et al. (2010), most of these

algorithms only exploit the principle of local generalization,

i.e., the assumption that the target function (to be learned)

is smooth enough, so they rely on examples to explicitly

map out the wrinkles of the target function. Generalization

6. local in the sense that the value of the learned function at x depends

mostly on training examples x

(t)

’s close to x

is mostly achieved by a form of local interpolation between

neighboring training examples. Although smoothness can be

a useful assumption, it is insufﬁcient to deal with the curse

of dimensionality, because the number of such wrinkles (ups

and downs of the target function) may grow exponentially

with the number of relevant interacting factors, when the data

are represented in raw input space. We advocate learning

algorithms that are ﬂexible and non-parametric

but do not

rely exclusively on the smoothness assumption. Instead, we

propose to incorporate generic priors such as those enumerated

above into representation-learning algorithms. Smoothness-

based learners (such as kernel machines) and linear models

can still be useful on top of such learned representations. In

fact, the combination of learning a representation and kernel

machine is equivalent to learning the kernel, i.e., the feature

space. Kernel machines are useful, but they depend on a prior

deﬁnition of a suitable similarity metric, or a feature space

in which naive similarity metrics sufﬁce. We would like to

use the data, along with very generic priors, to discover those

features, or equivalently, a similarity function.

3.3 Distributed representations

Good representations are expressive, meaning that a

reasonably-sized learned representation can capture a huge

number of possible input conﬁgurations. A simple counting

argument helps us to assess the expressiveness of a model pro-

ducing a representation: how many parameters does it require

compared to the number of input regions (or conﬁgurations) it

can distinguish? Learners of one-hot representations, such as

traditional clustering algorithms, Gaussian mixtures, nearest-

neighbor algorithms, decision trees, or Gaussian SVMs all re-

quire O(N ) parameters (and/or O(N) examples) to distinguish

O(N) input regions. One could naively believe that one cannot

do better. However, RBMs, sparse coding, auto-encoders or

multi-layer neural networks can all represent up to O(2

) input

regions using only O(N) parameters (with k the number of

non-zero elements in a sparse representation, and k = N in

non-sparse RBMs and other dense representations). These are

all distributed

or sparse

representations. The generalization

of clustering to distributed representations is multi-clustering,

where either several clusterings take place in parallel or the

same clustering is applied on different parts of the input,

such as in the very popular hierarchical feature extraction for

object recognition based on a histogram of cluster categories

detected in different patches of an image (Lazebnik et al.,

2006; Coates and Ng, 2011a). The exponential gain from

distributed or sparse representations is discussed further in

section 3.2 (and Figure 3.2) of Bengio (2009). It comes

about because each parameter (e.g. the parameters of one of

the units in a sparse code, or one of the units in a Restricted

7. We understand non-parametric as including all learning algorithms

whose capacity can be increased appropriately as the amount of data and its

complexity demands it, e.g. including mixture models and neural networks

where the number of parameters is a data-selected hyper-parameter.

8. Distributed representations: where k out of N representation elements

or feature values can be independently varied, e.g., they are not mutually

exclusive. Each concept is represented by having k features being turned on

or active, while each feature is involved in representing many concepts.

9. Sparse representations: distributed representations where only a few of

the elements can be varied at a time, i.e., k < N .

Boltzmann Machine) can be re-used in many examples that are

not simply near neighbors of each other, whereas with local

generalization, different regions in input space are basically

associated with their own private set of parameters, e.g., as

in decision trees, nearest-neighbors, Gaussian SVMs, etc. In

a distributed representation, an exponentially large number of

possible subsets of features or hidden units can be activated

in response to a given input. In a single-layer model, each

feature is typically associated with a preferred input direction,

corresponding to a hyperplane in input space, and the code

or representation associated with that input is precisely the

pattern of activation (which features respond to the input,

and how much). This is in contrast with a non-distributed

representation such as the one learned by most clustering

algorithms, e.g., k-means, in which the representation of a

given input vector is a one-hot code identifying which one of

a small number of cluster centroids best represents the input

3.4 Depth and abstraction

Depth is a key aspect to representation learning strategies we

consider in this paper. As we will discuss, deep architectures

are often challenging to train effectively and this has been

the subject of much recent research and progress. However,

despite these challenges, they carry two signiﬁcant advantages

that motivate our long-term interest in discovering successful

training strategies for deep architectures. These advantages

are: (1) deep architectures promote the re-use of features, and

(2) deep architectures can potentially lead to progressively

more abstract features at higher layers of representations

(more removed from the data).

Feature re-use. The notion of re-use, which explains the

power of distributed representations, is also at the heart of the

theoretical advantages behind deep learning, i.e., constructing

multiple levels of representation or learning a hierarchy of

features. The depth of a circuit is the length of the longest

path from an input node of the circuit to an output node of

the circuit. The crucial property of a deep circuit is that its

number of paths, i.e., ways to re-use different parts, can grow

exponentially with its depth. Formally, one can change the

depth of a given circuit by changing the deﬁnition of what

each node can compute, but only by a constant factor. The

typical computations we allow in each node include: weighted

sum, product, artiﬁcial neuron model (such as a monotone non-

linearity on top of an afﬁne transformation), computation of a

kernel, or logic gates. Theoretical results clearly show families

of functions where a deep representation can be exponentially

more efﬁcient than one that is insufﬁciently deep (H

astad,

1986; H

astad and Goldmann, 1991; Bengio et al., 2006a;

Bengio and LeCun, 2007; Bengio and Delalleau, 2011). If

the same family of functions can be represented with fewer

10. As discussed in (Bengio, 2009), things are only slightly better when

allowing continuous-valued membership values, e.g., in ordinary mixture

models (with separate parameters for each mixture component), but the

difference in representational power is still exponential (Montufar and Morton,

2012). The situation may also seem better with a decision tree, where each

given input is associated with a one-hot code over the tree leaves, which

deterministically selects associated ancestors (the path from root to node).

Unfortunately, the number of different regions represented (equal to the

number of leaves of the tree) still only grows linearly with the number of

parameters used to specify it (Bengio and Delalleau, 2011).

parameters (or more precisely with a smaller VC-dimension),

learning theory would suggest that it can be learned with

fewer examples, yielding improvements in both computational

efﬁciency (less nodes to visit) and statistical efﬁciency (less

parameters to learn, and re-use of these parameters over many

different kinds of inputs).

Abstraction and invariance. Deep architectures can lead

to abstract representations because more abstract concepts can

often be constructed in terms of less abstract ones. In some

cases, such as in the convolutional neural network (LeCun

et al., 1998b), we build this abstraction in explicitly via a

pooling mechanism (see section 11.2). More abstract concepts

are generally invariant to most local changes of the input. That

makes the representations that capture these concepts generally

highly non-linear functions of the raw input. This is obviously

true of categorical concepts, where more abstract representa-

tions detect categories that cover more varied phenomena (e.g.

larger manifolds with more wrinkles) and thus they potentially

have greater predictive power. Abstraction can also appear in

high-level continuous-valued attributes that are only sensitive

to some very speciﬁc types of changes in the input. Learning

these sorts of invariant features has been a long-standing goal

in pattern recognition.

3.5 Disentangling Factors of Variation

Beyond being distributed and invariant, we would like our rep-

resentations to disentangle the factors of variation. Different

explanatory factors of the data tend to change independently

of each other in the input distribution, and only a few at a time

tend to change when one considers a sequence of consecutive

real-world inputs.

Complex data arise from the rich interaction of many

sources. These factors interact in a complex web that can

complicate AI-related tasks such as object classiﬁcation. For

example, an image is composed of the interaction between one

or more light sources, the object shapes and the material prop-

erties of the various surfaces present in the image. Shadows

from objects in the scene can fall on each other in complex

patterns, creating the illusion of object boundaries where there

are none and dramatically effect the perceived object shape.

How can we cope with these complex interactions? How can

we disentangle the objects and their shadows? Ultimately,

we believe the approach we adopt for overcoming these

challenges must leverage the data itself, using vast quantities

of unlabeled examples, to learn representations that separate

the various explanatory sources. Doing so should give rise to

a representation signiﬁcantly more robust to the complex and

richly structured variations extant in natural data sources for

AI-related tasks.

It is important to distinguish between the related but distinct

goals of learning invariant features and learning to disentangle

explanatory factors. The central difference is the preservation

of information. Invariant features, by deﬁnition, have reduced

sensitivity in the direction of invariance. This is the goal of

building features that are insensitive to variation in the data

that are uninformative to the task at hand. Unfortunately, it

is often difﬁcult to determine a priori which set of features

and variations will ultimately be relevant to the task at hand.

Further, as is often the case in the context of deep learning

methods, the feature set being trained may be destined to

be used in multiple tasks that may have distinct subsets of

relevant features. Considerations such as these lead us to the

conclusion that the most robust approach to feature learning

is to disentangle as many factors as possible, discarding as

little information about the data as is practical. If some form

of dimensionality reduction is desirable, then we hypothesize

that the local directions of variation least represented in the

training data should be ﬁrst to be pruned out (as in PCA,

for example, which does it globally instead of around each

example).

3.6 Good criteria for learning representations?

One of the challenges of representation learning that distin-

guishes it from other machine learning tasks such as classi-

ﬁcation is the difﬁculty in establishing a clear objective, or

target for training. In the case of classiﬁcation, the objective

is (at least conceptually) obvious, we want to minimize the

number of misclassiﬁcations on the training dataset. In the

case of representation learning, our objective is far-removed

from the ultimate objective, which is typically learning a

classiﬁer or some other predictor. Our problem is reminiscent

of the credit assignment problem encountered in reinforcement

learning. We have proposed that a good representation is one

that disentangles the underlying factors of variation, but how

do we translate that into appropriate training criteria? Is it even

necessary to do anything but maximize likelihood under a good

model or can we introduce priors such as those enumerated

above (possibly data-dependent ones) that help the representa-

tion better do this disentangling? This question remains clearly

open but is discussed in more detail in Sections 3.5 and 11.4.

4 BUILDING DEEP REPRESENTATIONS

In 2006, a breakthrough in feature learning and deep learning

was initiated by Geoff Hinton and quickly followed up in

the same year (Hinton et al., 2006; Bengio et al., 2007;

Ranzato et al., 2007), and soon after by Lee et al. (2008)

and many more later. It has been extensively reviewed and

discussed in Bengio (2009). A central idea, referred to as

greedy layerwise unsupervised pre-training, was to learn a

hierarchy of features one level at a time, using unsupervised

feature learning to learn a new transformation at each level

to be composed with the previously learned transformations;

essentially, each iteration of unsupervised feature learning adds

one layer of weights to a deep neural network. Finally, the set

of layers could be combined to initialize a deep supervised pre-

dictor, such as a neural network classiﬁer, or a deep generative

model, such as a Deep Boltzmann Machine (Salakhutdinov

and Hinton, 2009).

This paper is mostly about feature learning algorithms

that can be used to form deep architectures. In particular, it

was empirically observed that layerwise stacking of feature

extraction often yielded better representations, e.g., in terms

of classiﬁcation error (Larochelle et al., 2009; Erhan et al.,

2010b), quality of the samples generated by a probabilistic

model (Salakhutdinov and Hinton, 2009) or in terms of the

invariance properties of the learned features (Goodfellow

剩余29页未读，继续阅读

评论收藏

内容反馈

wblgers1234

粉丝: 142
资源: 16

Representation Learning: A Review and New Perspectives

Representation Learning A Review and New Perspectives

Representation Learning -A Review and New Perspectives

Representation Learning A Review and New Perspectives表示学习.docx

Knowledge Representation Learning：A Quantitative Review.pdf

Deep Learning深度学习入门论文

Representation learning. A review and new perspec.pdf

【GNN综述_2020_8】Heterogeneous Network Representation Learning: ...

Challenges in Representation Learning: Facial Expression Recognition Challenge（表征学习中的挑战：面部表情识别挑战）-数据集

Presentazione MCC（Minutia Cylinder-Code: A New Representation and Matching Tech）

Defeats GAN：A Simpler Model Outperforms in Knowledge Representation Learning.pdf

Self-Supervised Video Representation Learning by Context and Mot

机器学习 -- Unsupervised Learning: Principle Component Analysis

Statistical Pattern Recognition:A Review

Local Zernike moments：a new representation for face recogniton

Representation Learning for Word, Sense, Phrase, Document and Knowledge-刘知远

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOV5 + 双目相机实现三维测距（新版本）

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

全新的SOTA模型YOLOv9

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

最新资源