1
Li Deng and Dong Yu
Microsoft Research
One Microsoft Way
Redmond, WA 98052
NOW PUBLISHERS, 2014
DEEP LEARNING:
METHODS AND APPLICATIONS
2
Table of Contents
Chapter 1 Introduction .................................................................................................................... 5
1.1 Definitions and Background............................................................................................. 5
1.2 Organization of This Book ............................................................................................... 8
Chapter 2 Some Historical Context of Deep Learning ................................................................ 11
Chapter 3 Three Classes of Deep Learning Networks ................................................................. 18
3.1 A Three-Way Categorization ......................................................................................... 18
3.2 Deep Networks for Unsupervised or Generative Learning ............................................ 21
3.3 Deep Networks for Supervised Learning ....................................................................... 24
3.4 Hybrid Deep Networks................................................................................................... 26
Chapter 4 Deep Autoencoders --- Unsupervised Learning ........................................................... 29
4.1 Introduction .................................................................................................................... 29
4.2 Use of Deep Autoencoders to Extract Speech Features ................................................. 30
4.3 Stacked Denoising Autoencoders................................................................................... 35
4.4 Transforming Autoencoders ........................................................................................... 35
Chapter 5 Pre-Trained Deep Neural Networks --- A Hybrid ...................................................... 37
5.1 Restricted Boltzmann Machines..................................................................................... 37
5.2 Unsupervised Layer-wise Pretraining ............................................................................ 40
5.3 Interfacing DNNs with HMMs ...................................................................................... 42
Chapter 6 Deep Stacking Networks and Variants --- Supervised Learning ................................ 44
6.1 Introduction .................................................................................................................... 44
6.2 A Basic Architecture of the Deep Stacking Network .................................................... 45
6.3 A Method for Learning the DSN Weights ..................................................................... 46
6.4 The Tensor Deep Stacking Network .............................................................................. 48
6.5 The Kernelized Deep Stacking Network ........................................................................ 50
Chapter 7 Selected Applications in Speech and Audio Processing ............................................. 53
7.1 Acoustic Modeling for Speech Recognition................................................................... 53
7.1.1 Back to primitive spectral features of speech................................................................. 54
7.1.2 The DNN-HMM architecture vs. use of DNN-derived features .................................... 56
7.1.3 Noise robustness by deep learning ................................................................................. 59
7.1.4 Output representations in the DNN ................................................................................ 60
7.1.5 Adaptation of the DNN-based speech recognizers ........................................................ 62
7.1.6 Better architectures and nonlinear units ......................................................................... 63
7.1.7 Better optimization and regularization …………………………………………………67
7.2 Speech Synthesis ............................................................................................................ 70
3
7.3 Audio and Music Processing .......................................................................................... 71
Chapter 8 Selected Applications in Language Modeling and Natural Language Processing ...... 73
8.1 Language Modeling........................................................................................................ 73
8.2 Natural Language Processing ......................................................................................... 77
Chapter 9 Selected Applications in Information Retrieval .......................................................... 84
9.1 A Brief Introduction to Information Retrieval ............................................................... 84
9.2 Semantic Hashing with Deep Autoencoders for Document Indexing and Retrieval ..... 85
9.3 Deep-Structured Semantic Modeling for Document Retrieval ...................................... 86
9.4 Use of Deep Stacking Networks for Information Retrieval ........................................... 91
Chapter 10 Selected Applications in Object Recognition and Computer Vision ........................ 92
10.1 Unsupervised or Generative Feature Learning............................................................... 92
10.2 Supervised Feature Learning and Classification ............................................................ 94
Chapter 11 Selected Applications in Multi-modal and Multi-task Learning ............................. 101
11.1 Multi-Modalities: Text and Image ............................................................................... 101
11.2 Multi-Modalities: Speech and Image ........................................................................... 104
11.3 Multi-Task Learning within the Speech, NLP or Image Domain ................................ 106
Chapter 12 Epilogues ................................................................................................................. 110
BIBLIOGRAPHY ....................................................................................................................... 114
4
Abstract
This book is aimed to provide an overview of general deep learning methodology and its
applications to a variety of signal and information processing tasks. The application areas are
chosen with the following three criteria: 1) expertise or knowledge of the authors; 2) the
application areas that have already been transformed by the successful use of deep learning
technology, such as speech recognition and computer vision; and 3) the application areas that have
the potential to be impacted significantly by deep learning and that have gained concentrated
research efforts, including natural language and text processing, information retrieval, and
multimodal information processing empowered by multi-task deep learning.
In Chapter 1, we provide the background of deep learning, as intrinsically connected to the use of
multiple layers of nonlinear transformations to derive features from the sensory signals such as
speech and visual images. In the most recent literature, deep learning is embodied also as
representation learning, which involves a hierarchy of features or concepts where higher-level
representations of them are defined from lower-level ones and where the same lower-level
representations help to define higher-level ones. In Chapter 2, a brief historical account of deep
learning is presented. In particular, selected chronological development of speech recognition is
used to illustrate the recent impact of deep learning that has become a dominant technology in
speech recognition industry within only a few years since the start of a collaboration between
academic and industrial researchers in applying deep learning to speech recognition. In Chapter 3,
a three-way classification scheme for a large body of work in deep learning is developed. We
classify a growing number of deep learning techniques into unsupervised, supervised, and hybrid
categories, and present qualitative descriptions and a literature survey for each category. From
Chapter 4 to Chapter 6, we discuss in detail three popular deep networks and related learning
methods, one in each category. Chapter 4 is devoted to deep autoencoders as a prominent example
of the unsupervised deep learning techniques. Chapter 5 gives a major example in the hybrid deep
network category, which is the discriminative feed-forward neural network for supervised learning
with many layers initialized using layer-by-layer generative, unsupervised pre-training. In Chapter
6, deep stacking networks and several of the variants are discussed in detail, which exemplify the
discriminative or supervised deep learning techniques in the three-way categorization scheme.
In Chapters 7-11, we select a set of typical and successful applications of deep learning in diverse
areas of signal and information processing and of applied artificial intelligence. In Chapter 7, we
review the applications of deep learning to speech and audio processing, with emphasis on speech
recognition organized according to several prominent themes. In Chapters 8, we present recent
results of applying deep learning to language modeling and natural language processing. Chapter
9 is devoted to selected applications of deep learning to information retrieval including Web search.
In Chapter 10, we cover selected applications of deep learning to image object recognition in
computer vision. Selected applications of deep learning to multi-modal processing and multi-task
learning are reviewed in Chapter 11. Finally, an epilogue is given in Chapter 12 to summarize
what we presented in earlier chapters and to discuss future challenges and directions.
5
CHAPTER 1
INTRODUCTION
1.1 Definitions and Background
Since 2006, deep structured learning, or more commonly called deep learning or hierarchical
learning, has emerged as a new area of machine learning research (Hinton et al., 2006; Bengio,
2009). During the past several years, the techniques developed from deep learning research have
already been impacting a wide range of signal and information processing work within the
traditional and the new, widened scopes including key aspects of machine learning and artificial
intelligence; see overview articles in (Bengio, 2009; Arel et al., 2010; Yu and Deng, 2011; Deng,
2011, 2013; Hinton et al., 2012; Bengio et al., 2013a), and also the media coverage of this progress
in (Markoff, 2012; Anthes, 2013). A series of workshops, tutorials, and special issues or
conference special sessions in recent years have been devoted exclusively to deep learning and its
applications to various signal and information processing areas. These include:
2008 NIPS Deep Learning Workshop;
2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications;
2009 ICML Workshop on Learning Feature Hierarchies;
2011 ICML Workshop on Learning Architectures, Representations, and Optimization for
Speech and Visual Information Processing;
2012 ICASSP Tutorial on Deep Learning for Signal and Information Processing;
2012 ICML Workshop on Representation Learning;
2012 Special Section on Deep Learning for Speech and Language Processing in IEEE
Transactions on Audio, Speech, and Language Processing (T-ASLP, January);
2010, 2011, and 2012 NIPS Workshops on Deep Learning and Unsupervised Feature
Learning;
2013 NIPS Workshops on Deep Learning and on Output Representation Learning;