RepresentationLearningAReviewandNewPerspectives_"Representationlearning"资源-CSDN文库

DeepLearning

需积分: 31 68 浏览量 2015-05-21 10:00:16 上传评论收藏 1.16MB PDF 举报

资源推荐

资源详情

资源评论

Representation Learning:

A Review and New Perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent

Abstract—The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is

because different representations can entangle and hide more or less the different explanatory factors of variation behind the data.

Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the

quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews

recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders,

manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning

good representations, for computing representations (i.e., inference), and the geometrical connections between representation

learning, density estimation, and manifold learning.

Index Terms—Deep learning, representation learning, feature learning, unsupervised learning, Boltzmann machine, autoencoder,

neural nets

1INTRODUCTION

HE performance of machine learning methods is heavily

dependent on the choice of data representation (or

features) on which they are applied. For that reason, much

of the actual effort in deploying machine learning algo-

rithms goes into the design of preprocessing pipelines and

data transformations that result in a representation of the

data that can support effective machine learning. Such

feature engineering is important but labor intensive and

highlights the weakness of current learning algorithms:

Their inability to extract and organize the discriminative

information from the data. Feature engineering is a way to

take advantage of human ingenuity and prior knowledge to

compensate for that weakness. To expand the scope and

ease of applicability of machine learning, it would be highly

desirable to make learning algorithms less dependent on

feature engineering so that novel applications could be

constructed faster, and more importantly, to make progress

toward artificial intelligence (AI). An AI must fundamen-

tally understand the world around us, and we argue that this

can only be achieved if it can learn to identify and

disentangle the underlying explanatory factors hidden in

the observed milieu of low-level sensory data.

This paper is about representation learning, i.e., learning

representations of the data that make it easier to extract

useful information when building classifiers or other

predictors. In the case of probabilistic models, a good

representation is often one that captures the posterior

distribution of the underlying explanatory factors for the

observed input. A good representation is also one that is

useful as input to a supervised predictor. Among the

various ways of learning representations, this paper focuses

on deep learning methods: those that are formed by the

composition of multiple nonlinear transformations with

the goal of yielding more abstract—and ultimately more

useful—representations. Here, we survey this rapidly

developing area with special emphasis on recent progress.

We consider some of the fundamental questions that have

been driving research in this area. Specifically, what makes

one representation better than another? Given an example,

how should we compute its representation, i.e., perform

feature extraction? Also, what are appropriate objectives for

learning good representations?

2WHY SHOULD W E CARE ABOUT LEARNING

REPRESENTATIONS?

Representation learning has become a field in itself in the

machine learning community, with regular workshops at

the leading conferences such as NIPS and ICML, and a new

conference dedicated to it, ICLR,

sometimes under the

header of Deep Learning or Feature Learning. Although depth

is an important part of the story, many other priors are

interesting and can be conveniently captured when the

problem is cast as one of learning a representation, as

discussed in the next section. The rapid increase in scientific

activity on representation learning has been accompanied

and nourished by a remarkable string of empirical successes

both in academia and in industry. Below, we briefly

highlight some of these high points.

2.1 Speech Recognition and Signal Processing

Speech was one of the early applications of neural

networks, in particular convolutional (or time-delay) neural

1798 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

. The authors are with the Department of Computer Science and Operations

Research, Universite

de Montre

al, PO Box 6128, Succ. Centre-Ville,

Montreal, Quebec H3C 3J7, Canada.

Manuscript received 9 Apr. 2012; revised 17 Oct. 2012; accepted 24 Feb. 2013;

published online 28 Feb. 2013.

Recommended for acceptance by S. Bengio, L. Deng, H. Larochelle, H. Lee, and

R. Salakhutdinov.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMISI-2012-04-0260.

Digital Object Identifier no. 10.1109/TPAMI.2013.50.

1. International Conference on Learning Representations.

0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

networks.

The recent revival of interest in neural networks,

deep learning, and representation learning has had a strong

impact in the area of speech recognition, with breakthrough

results [54], [56], [183], [148], [55], [86] obtained by several

academics as well as researchers at industrial labs bringing

these algorithms to a larger scale and into products. For

example, Microsoft released in 2012 a new version of their

Microsoft Audio Video Indexing Service speech system

based on deep learning [183]. These authors managed to

reduce the word error rate on four major benchmarks by

about 30 percent (e.g., from 27.4 to 18.5 percent on RT03S)

compared to state-of-the-art models based on Gaussian

mixtures for the acoustic modeling and trained on the same

amount of data (309 hours of speech). The relative

improvement in error rate obtained by Dahl et al. [55] on

a smaller large-vocabulary speech recognition benchmark

(Bing mobile business search dataset, with 40 hours of

speech) is between 16 and 23 percent.

Representation-learning algorithms have also been ap-

plied to music, substantially beating the state of the art in

polyphonic transcription [34], with relative error improve-

ment between 5 and 30 percent on a standard benchmark of

four datasets. Deep learning also helped to win MIREX

(music information retrieval) competitions, for example, in

2011 on audio tagging [81].

2.2 Object Recognition

The beginnings of deep learning in 2006 focused on the

MNIST digit image classification problem [94], [23], break-

ing the supremacy of SVMs (1.4 percent error) on this

dataset.

The latest records are still held by deep networks:

Ciresan et al. [46] currently claim the title of state of the art

for the unconstrained version of the task (e.g., using a

convolutional architecture), with 0.27 percent error, and

Rifai et al. [169] is state of the art for the knowledge-free

version of MNIST, with 0.81 percent error.

In the last few years, deep learning has moved from

digits to object recognition in natural images, and the latest

breakthrough has been achieved on the ImageNet dataset,

bringing down the state-of-the-art error rate from 26.1 to

15.3 percent [117].

2.3 Natural Language Processing (NLP)

Besides speech recognition, there are many other NLP

applications of representation learning. Distributed represen-

tations for symbolic data were introduced by Hinton [87],

and first developed in the context of statistical language

modeling by Bengio et al. [19] in the so-called neural net

language models [10]. They are all based on learning a

distributed representation for each word, called a word

embedding. Adding a convolutional architecture, Collobert

et al. [51] developed the SENNA system

that shares

representations across the tasks of language modeling,

part-of-speech tagging, chunking, named entity recognition,

semantic role labeling, and syntactic parsing. SENNA

approaches or surpasses the state of the art on these tasks,

but is simpler and much faster than traditional predictors.

Learning word embeddings can be combined with learning

image representations in a way that allows associating text

and images. This approach has been used successfully to

build Google’s image search, exploiting huge quantities of

data to map images and queries in the same space [218],

and it has recently been extended to deeper multimodal

representations [194].

The neural net language model was also improved by

adding recurrence to the hidden layers [146], allowing it to

beat the state of the art (smoothed n-gram models) not

only in terms of perplexity (exponential of the average

negative log likelihood of predicting the right next word,

going down from 140 to 102), but also in terms of word

error rate in speech recognition (since the language model

is an important component of a speech recognition

system), decreasing it from 17.2 percent (KN5 baseline)

or 16.9 percent (discriminative language model) to

14.4 percent on the Wall Street Journal benchmark task.

Similar models have been applied in statistical machine

translation [182], [121], improving perplexity and BLEU

scores. Recursive autoencoders (which generalize recurrent

networks) have also been used to beat the state of the art

in full sentence paraphrase detection [192], almost dou-

bling the F1 score for paraphrase detection. Representation

learning can also be used to perform word sense

disambiguation [33], bringing up the accuracy from 67.8

to 70.2 percent on the subset of Senseval-3 where the

system could be applied (with subject-verb-object sen-

tences). Finally, it has also been successfully used to

surpass the state of the art in sentiment analysis [70], [193].

2.4 Multitask and Transfer Learning, Domain

Adaptation

Transfer learning is the ability of a learning algorithm to

exploit commonalities between different learning tasks to

share statistical strength and transfer knowledge across tasks.

As discussed below, we hypothesize that representation

learning algorithms have an advantage for such tasks

because they learn representations that capture underlying

factors,asubsetofwhichmayberelevantforeach

particular task, as illustrated in Fig. 1. This hypothesis

seems confirmed by a number of empirical results showing

BENGIO ET AL.: REPRESENTATION LEARNING: A REVIEW AND NEW PERSPECTIVES 1799

Fig. 1. Illustration of representation-learning discovering explanatory

factors (middle hidden layer, in red), some explaining the input (semi-

supervised setting), and some explaining target for each task. Because

these subsets overlap, sharing of statistical strength helps generalization.

2. See [9] for a review of early work in this area.

3. For the knowledge-free version of the task where no image-specific

prior is used, such as image deformations or convolutions.

4. The 1,000-class ImageNet benchmark, whose results are detailed here:

http://www.image-net.org/challenges/LSVRC/2012/results.html.

5. Downloadable from http://ml.nec-labs.com/senna/.

the strengths of representation learning algorithms in

transfer learning scenarios.

Most impressive are the two transfer learning challenges

held in 2011 and won by representation learning algo-

rithms. First, the transfer learning challenge, presented at

an ICML 2011 workshop of the same name, was won using

unsupervised layerwise pretraining [12], [145]. A second

transfer learning challenge was held the same year and

won by Goodfellow et al. [72]. Results were presented at

NIPS 2011’s Challenges in Learning Hierarchical Models

Workshop. In the related domain adaptation setup, the target

remains the same but the input distribution changes [70],

[43]. In the multitask learning setup, representation learning

has also been found to be advantageous [117], [51] because

of shared factors across tasks.

3WHAT MAKES A REPRESENTATION GOOD?

3.1 Priors for Representation Learning in AI

In [16], one of us introduced the notion of AI tasks, which

are challenging for current machine learning algorithms

and involve complex but highly structured dependencies.

One reason why explicitly dealing with representations is

interesting is because they can be convenient to express

many general priors about the world around us, i.e., priors

that are not task specific but would be likely to be useful for

a learning machine to solve AI tasks. Examples of such

general-purpose priors are the following:

. Smoothness: Assumes that the function to be learned

f is s.t. x  y generally implies fðxÞfðyÞ. This

most basic prior is present in most machine learning,

butisinsufficienttogetaroundthecurseof

dimensionality; see Section 3.2.

. Multiple explanatory factors: The data generating

distribution is generated by different underlying

factors, and for the most part, what one learns about

one factor generalizes in many configurations of the

other factors. The objective to recover or at least

disentangle these underlying factors of variation is

discussed in Section 3.5. This assumption is behind

the idea of distributed representations, discussed in

Section 3.3.

. A hierarchical organization of explanatory factors: The

concepts that are useful for describing the world

around us can be defined in terms of other concepts,

in a hierarchy, with more abstract concepts higher in

the hierarchy, defined in terms of less abstract ones.

This assumption is exploited with deep representa-

tions, elaborated in Section 3.4.

. Semi-supervised learning: With inputs X and target Y

to predict, a subset of the factors explaining X’s

distribution explain much of Y , given X. Hence,

representations that are useful for PðXÞ tend to be

useful when learning P ðY j XÞ, allowing sharing of

statistical strength between the unsupervised and

supervised learning tasks, see Section 4.

. Shared factors across tasks: With many Y s of interest or

many learning tasks in general, tasks (e.g., the

corresponding P ðY j X; taskÞ)areexplainedby

factors that are shared with other tasks, allowing

sharing of statistical strengths across tasks, as

discussed in the previous section (multitask and

transfer learning, domain adaptation).

. Manifolds: Probability mass concentrates near re-

gions that have a much smaller dimensionality than

the original space where the data live. This is

explicitly exploited in some of the autoencoder

algorithms and other manifold-inspired algorithms

described, respectively, in Sections 7.2 and 8.

. Natural clustering: Different values of categorical

variables such as object classes are associated with

separate manifolds. More precisely, the local varia-

tions on the manifold tend to preserve the value of a

category, and a linear interpolation between exam-

ples of different classes in general involves going

through a low-density region, i.e., P ðX j Y ¼ iÞ for

different i tend to be well separated and not overlap

much. For example, this is exploited in the manifold

tangent classifier (MTC) discussed in Section 8.3.

This hypothesis is consistent with the idea that

humans have named categories and classes because

of such statistical structure (discovered by their

brain and propagated by their culture), and machine

learning tasks often involve predicting such catego-

rical variables.

. Temporal and spatial coherence: Consecutive (from a

sequence) or spatially nearby observations tend to be

associated with the same value of relevant catego-

rical concepts or result in a small move on the

surface of the high-density manifold. More gener-

ally, different factors change at different temporal

and spatial scales, and many categorical concepts of

interest change slowly. When attempting to capture

such categorical variables, this prior can be enforced

by making the associated representations slowly

changing, i.e., penalizing changes in values over

time or space. This prior was introduced in [6] and is

discussed in Section 11.3.

. Sparsity: For any given observation x, only a small

fraction of the possible factors are relevant. In terms

of representation, this could be represented by

features that are often zero (as initially proposed

by Olshausen and Field [155]), or by the fact that

most of the extracted features are insensitive to small

variations of x. This can be achieved with certain

forms of priors on latent variables (peaked at 0), or

by using a nonlinearity whose value is often flat at 0

(i.e., 0 and with a 0 derivative), or simply by

penalizing the magnitude of the Jacobian matrix (of

derivatives) of the function mapping input to

representation. This is discussed in Sections 6.1.1

and 7.2.

. Simplicity of factor dependencies: In good high-level

representations, the factors are related to each other

through simple, typically linear dependencies. This

can be seen in many laws of physics and is assumed

when plugging a linear predictor on top of a learned

representation.

We can view many of the above priors as ways to help the

learner discover and disentangle some of the underlying (and

a priori unknown) factors of variation that the data may

reveal. This idea is pursued further in Sections 3.5 and 11.4.

1800 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

3.2 Smoothness and the Curse of Dimensionality

For AI tasks such as vision and NLP, it seems hopeless to

rely only on simple parametric models (such as linear

models) because they cannot capture enough of the

complexity of interest unless provided with the appropriate

feature space. Conversely, machine learning researchers

have sought flexibility in local

nonparametric learners such as

kernel machines with a fixed generic local-response kernel

(such as the Gaussian kernel). Unfortunately, as argued at

length by Bengio and Monperrus [17], Bengio et al. [21],

Bengio and LeCun [16], Bengio [11], and Bengio et al. [25],

most of these algorithms only exploit the principle of local

generalization, i.e., the assumption that the target function (to

be learned) is smooth enough, so they rely on examples to

explicitly map out the wrinkles of the target function. General-

ization is mostly achieved by a form of local interpolation

between neighboring training examples. Although smooth-

ness can be a useful assumption, it is insufficient to deal

with the curse of dimensionality because the number of such

wrinkles (ups and downs of the target function) may grow

exponentially with the number of relevant interacting

factors when the data are represented in raw input space.

We advocate learning algorithms that are flexible and

nonparametric,

but do not rely exclusively on the smooth-

ness assumption. Instead, we propose to incorporate generic

priors such as those enumerated above into representation-

learning algorithms. Smoothness-based learners (such as

kernel machines) and linear models can still be useful on top

of such learned representations. In fact, the combination of

learning a representation and kernel machine is equivalent

to learning the kernel, i.e., the feature space. Kernel machines

are useful, but they depend on a prior definition of a suitable

similarity metric or a feature space in which naive similarity

metrics suffice. We would like to use the data, along with

very generic priors, to discover those features or, equiva-

lently, a similarity function.

3.3 Distributed Representations

Good representations are expressive, meaning that a reason-

ably sized learned representation can capture a huge

number of possible input configurations. A simple counting

argument helps us to assess the expressiveness of a model

producing a representation: How many parameters does it

require compared to the number of input regions (or

configurations) it can distinguish? Learners of one-hot

representations, such as traditional clustering algorithms,

Gaussian mixtures, nearest-neighbor algorithms, decision

trees, or Gaussian SVMs, all require OðNÞ parameters (and/

or OðNÞ examples) to distinguish OðNÞ input regions. One

could naively believe that one cannot do better. However,

restricted Boltzmann machines (RBMs), sparse coding,

autoencoders, or multilayer neural networks can all

represent up to Oð2

Þ input regions using only OðNÞ

parameters (with k the number of nonzero elements in a

sparse representation, and k ¼ N in nonsparse RBMs and

other dense representations). These are all distributed

sparse

representations. The generalization of clustering to

distributed representations is multiclustering, where either

several clusterings take place in parallel or the same

clustering is applied on different parts of the input, such

as in the very popular hierarchical feature extraction for

object recognition based on a histogram of cluster categories

detected in different patches of an image [120], [48]. The

exponential gain from distributed or sparse representations

is discussed further in [11, Section 3.2, Fig. 3.2]. It comes

about because each parameter (e.g., the parameters of one of

the units in a sparse code or one of the units in an RBM) can

be reused in many examples that are not simply near

neighbors of each other, whereas with local generalization,

different regions in input space are basically associated with

their own private set of parameters, for example, as in

decision trees, nearest neighbors, Gaussian SVMs, and so

on. In a distributed representation, an exponentially large

number of possible subsets of features or hidden units can be

activated in response to a given input. In a single-layer

model, each feature is typically associated with a preferred

input direction, corresponding to a hyperplane in input

space, and the code or representation associated with that

input is precisely the pattern of activation (which features

respond to the input, and how much). This is in contrast

with a nondistributed representation such as the one

learned by most clustering algorithms, for example,

k-means, in which the representation of a given input

vector is a one-hot code identifying which one of a small

number of cluster centroids best represents the input.

3.4 Depth and Abstraction

Depth is a key aspect to the representation learning strategies

we consider in this paper. As we will discuss, deep

architectures are often challenging to train effectively, and

this has been the subject of much recent research and

progress. However, despite these challenges, they carry two

significant advantages that motivate our long-term interest in

discovering successful training strategies for deep architec-

tures. These advantages are: 1) deep architectures promote

the reuse of features, and 2) deep architectures can potentially

lead to progressively more abstract features at higher layers of

representations (more removed from the data).

Feature reuse. The notion of reuse, which explains the

power of distributed representations, is also at the heart of

the theoretical advantages behind deep learning, i.e.,

BENGIO ET AL.: REPRESENTATION LEARNING: A REVIEW AND NEW PERSPECTIVES 1801

6. Local in the sense that the value of the learned function at x depends

mostly on training examples x

ðtÞ

s close to x.

7. We understand nonparametric as including all learning algorithms

whose capacity can be increased appropriately as the amount of data

and its complexity demands it, for example, including mixture models

and neural networks where the number of parameters is a data-

selected hyperparameter.

8. Distributed representations, where k out of N representation

elements or feature values can be independently varied; for example,

they are not mutually exclusive. Each concept is represented by having k

features being turned on or active, while each feature is involved in

representing many concepts.

9. Sparse representations: distributed representations where only a few

of the elements can be varied at a time, i.e., k<N.

10. As discussed in [11], things are only slightly better when allowing

continuous-valued membership values, for example, in ordinary mixture

models (with separate parameters for each mixture component), but the

difference in representational power is still exponential [149]. The situation

may also seem better with a decision tree, where each given input is

associated with a one-hot code over the tree leaves, which deterministically

selects associated ancestors (the path from root to node). Unfortunately, the

number of different regions represented (equal to the number of leaves of

the tree) still only grows linearly with the number of parameters used to

specify it [15].

constructing multiple levels of representation or learning a

hierarchy of features. The depth of a circuit is the length of

the longest path from an input node of the circuit to an

output node of the circuit. The crucial property of a deep

circuit is that its number of paths, i.e., ways to reuse different

parts, can grow exponentially with its depth. Formally, one

can change the depth of a given circuit by changing the

definition of what each node can compute, but only by a

constant factor. The typical computations we allow in each

node include: weighted sum, product, artificial neuron

model (such as a monotone nonlinearity on top of an

affine transformation), computation of a kernel, or logic

gates. Theoretical results clearly show families of functions

where a deep representation can be exponentially more

efficient than one that is insufficiently deep [82], [83], [21],

[16], [15]. If the same family of functions can be

represented with fewer parameters (or more precisely

with a smaller VC dimension), learning theory would

suggest that it can be learned with fewer examples,

yielding improvements in both computational efficiency

(less nodes to visit) and statistical efficiency (fewer

parameters to learn, and reuse of these parameters over

many different kinds of inputs).

Abstraction and invariance. Deep architectures can lead to

abstract representations because more abstract concepts can

often be constructed in terms of less abstract ones. In some

cases, such as in the convolutional neural network [133], we

explicitly build this abstraction in via a pooling mechanism

(see Section 11.2). More abstract concepts are generally

invariant to most local changes of the input. That makes the

representations that capture these concepts generally highly

nonlinear functions of the raw input. This is obviously true

of categorical concepts, where more abstract representations

detect categories that cover more varied phenomena (e.g.,

larger manifolds with more wrinkles), and thus they

potentially have greater predictive power. Abstraction can

also appear in high-level continuous-valued attributes that

are only sensitive to some very specific types of changes in

the input. Learning these sorts of invariant features has

been a long-standing goal in pattern recognition.

3.5 Disentangling Factors of Variation

Beyond being distributed and invariant, we would like our

representations to disentangle the factors of variation. Different

explanatory factors of the data tend to change indepen-

dently of each other in the input distribution, and only a

few at a time tend to change when one considers a sequence

of consecutive real-world inputs.

Complex data arise from the rich interaction of many

sources. These factors interact in a complex web that can

complicate AI-related tasks such as object classification. For

example, an image is composed of the interaction between

one or more light sources, the object shapes, and the

material properties of the various surfaces present in the

image. Shadows from objects in the scene can fall on each

other in complex patterns, creating the illusion of object

boundaries where there are none, and can dramatically

effect the perceived object shape. How can we cope with

these complex interactions? How can we disentangle the

objects and their shadows? Ultimately, we believe the

approach we adopt for overcoming these challenges must

leverage the data itself, using vast quantities of unlabeled

examples, to learn representations that separate the various

explanatory sources. Doing so should give rise to a

representation significantly more robust to the complex

and richly structured variations extant in natural data

sources for AI-related tasks.

It is important to distinguish between the related but

distinct goals of learning invariant features and learning to

disentangle explanatory factors. The central difference is

the preservation of information. Invariant features, by

definition, have reduced sensitivity in the direction of

invariance. This is the goal of building features that are

insensitive to variation in the data that are uninformative

to the task at hand. Unfortunately, it is often difficult to

determine a priori which set of features and variations will

ultimately be relevant to the task at hand. Further, as is

often the case in the context of deep learning methods, the

feature set being trained may be destined to be used in

multiple tasks that may have distinct subsets of relevant

features. Considerations such as these lead us to the

conclusion that the most robust approach to feature

learning is to disentangle as many factors as possible, discarding

as little information about the data as is practical. If some form

of dimensionality reduction is desirable, then we hypothe-

size that the local directions of variation least represented

in the training data should be first to be pruned out (as in

principal components analysis (PCA), for example, which

does it globally instead of around each example).

3.6 Good Criteria for Learning Representations?

One of the challenges of representation learning that

distinguishes it from other machine learning tasks such

as classification is the difficulty in establishing a clear

objective or target for training. In the case of classification,

the objective is (at least conceptually) obvious; we want to

minimize the number of misclassifications on the training

dataset. In the case of representation learning, our objective

is far removed from the ultimate objective, which typically

is learning a classifier or some other predictor. Our

problem is reminiscent of the credit assignment problem

encountered in reinforcement learning. We have proposed

that a good representation is one that disentangles the

underlying factors of variation, but how do we translate

that into appropriate training criteria? Is it even necessary

to do anything but maximize likelihood under a good

model or can we introduce priors such as those enumer-

ated above (possibly data-dependent ones) that help the

representation better do this disentangling? This question

remains clearly open but is discussed in more detail in

Sections 3.5 and 11.4.

4BUILDING DEEP REPRESENTATIONS

In 2006, a breakthrough in feature learning and deep

learning was initiated by Geoff Hinton and quickly

followed up in the same year [94], [23], [161], and soon

after by Lee et al. [134] and many more later. It has been

extensively reviewed and discussed in [11]. A central idea,

referred to as greedy layerwise unsupervised pretraining, was to

learn a hierarchy of features one level at a time, using

unsupervised feature learning to learn a new transforma-

tion at each level to be composed with the previously

learned transformations; essentially, each iteration of

1802 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 8, AUGUST 2013

剩余30页未读，继续阅读

评论收藏

内容反馈

lengwuqin

粉丝: 139
资源: 333

Representation Learning A Review and New Perspectives

最新资源

Representation Learning A Review and New Perspectives

Representation Learning: A Review and New Perspectives

Representation Learning A Review and New Perspectives表示学习.docx

Representation Learning -A Review and New Perspectives

Knowledge Representation Learning：A Quantitative Review.pdf

Deep Learning based Recommender System- A Survey and New Perspectives.pdf

Deep Learning based Recommender System: A Survey and New Perspectives

Representation learning. A review and new perspec.pdf

Self-Supervised Video Representation Learning by Context and Mot

Representation Learning for Word, Sense, Phrase, Document and Knowledge-刘知远

Image-embodied Knowledge Representation Learning

Disentangled Representation Learning GAN for Pose-Invariant Face Recognition

Defeats GAN：A Simpler Model Outperforms in Knowledge Representation Learning.pdf

CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM

M2GRL_A Multi-task Multi-view Graph Representation Learning Framework for Web-sc

Cross-Project Transfer Representation Learning for Vulnerable Function Discovery

Deep Learning深度学习入门论文

图表示学习（Graph representation learning） （part0 and part1）.zip

Exploring Simple Siamese Representation Learning自监督论文

Innovations in Graph Representation Learning.pdf

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

【纯干货啊】华为IPD流程管理(完整版).pptx

可编程语言标准IEC61131-3中文版.pdf

OFDM完整仿真过程与教程.zip

信号与系统——保研复习资料.pdf

Landsat_WRS2.zip

最新资源

图表示学习（Graph representation learning）（part0 and part1）.zip

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar