百度DeepSpeech架构的TensorFlow实现DeepSpeech论文

共1个文件

pdf：1个

tensorflow

毕业设计

需积分: 1 185 浏览量 2024-03-05 18:56:58 上传评论收藏 463KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

百度 DeepSpeech 架构的 TensorFlow 实现 DeepSpeech.zip （1个子文件）

百度 DeepSpeech 架构的 TensorFlow 实现 DeepSpeech.pdf 514KB

Deep Speech: Scaling up end-to-end

speech recognition

Awni Hannun

∗

, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,

Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng

Baidu Research – Silicon Valley AI Lab

Abstract

We present a state-of-the-art speech recognition system developed using end-to-

end deep learning. Our architecture is signiﬁcantly simpler than traditional speech

systems, which rely on laboriously engineered processing pipelines; these tradi-

tional systems also tend to perform poorly when used in noisy environments. In

contrast, our system does not need hand-designed components to model back-

ground noise, reverberation, or speaker variation, but instead directly learns a

function that is robust to such effects. We do not need a phoneme dictionary,

nor even the concept of a “phoneme.” Key to our approach is a well-optimized

RNN training system that uses multiple GPUs, as well as a set of novel data syn-

thesis techniques that allow us to efﬁciently obtain a large amount of varied data

for training. Our system, called Deep Speech, outperforms previously published

results on the widely studied Switchboard Hub5’00, achieving 16.0% error on the

full test set. Deep Speech also handles challenging noisy environments better than

widely used, state-of-the-art commercial speech systems.

1 Introduction

Top speech recognition systems rely on sophisticated pipelines composed of multiple algorithms

and hand-engineered processing stages. In this paper, we describe an end-to-end speech system,

called “Deep Speech”, where deep learning supersedes these processing stages. Combined with a

language model, this approach achieves higher performance than traditional methods on hard speech

recognition tasks while also being much simpler. These results are made possible by training a large

recurrent neural network (RNN) using multiple GPUs and thousands of hours of data. Because this

system learns directly from data, we do not require specialized components for speaker adaptation

or noise ﬁltering. In fact, in settings where robustness to speaker variation and noise are critical,

our system excels: Deep Speech outperforms previously published methods on the Switchboard

Hub5’00 corpus, achieving 16.0% error, and performs better than commercial systems in noisy

speech recognition tests.

Traditional speech systems use many heavily engineered processing stages, including specialized

input features, acoustic models, and Hidden Markov Models (HMMs). To improve these pipelines,

domain experts must invest a great deal of effort tuning their features and models. The introduction

of deep learning algorithms [27, 30, 15, 18, 9] has improved speech system performance, usually

by improving acoustic models. While this improvement has been signiﬁcant, deep learning still

plays only a limited role in traditional speech pipelines. As a result, to improve performance on a

task such as recognizing speech in a noisy environment, one must laboriously engineer the rest of

the system for robustness. In contrast, our system applies deep learning end-to-end using recurrent

neural networks. We take advantage of the capacity provided by deep learning systems to learn

from large datasets to improve our overall performance. Our model is trained end-to-end to produce

∗

Contact author: awnihannun@baidu.com

arXiv:1412.5567v2 [cs.CL] 19 Dec 2014

transcriptions and thus, with sufﬁcient data and computing power, can learn robustness to noise or

speaker variation on its own.

Tapping the beneﬁts of end-to-end deep learning, however, poses several challenges: (i) we must

ﬁnd innovative ways to build large, labeled training sets and (ii) we must be able to train networks

that are large enough to effectively utilize all of this data. One challenge for handling labeled data

in speech systems is ﬁnding the alignment of text transcripts with input speech. This problem has

been addressed by Graves, Fern

andez, Gomez and Schmidhuber [13], thus enabling neural net-

works to easily consume unaligned, transcribed audio during training. Meanwhile, rapid training of

large neural networks has been tackled by Coates et al. [7], demonstrating the speed advantages of

multi-GPU computation. We aim to leverage these insights to fulﬁll the vision of a generic learning

system, based on large speech datasets and scalable RNN training, that can surpass more compli-

cated traditional methods. This vision is inspired partly by the work of Lee et. al. [27] who applied

early unsupervised feature learning techniques to replace hand-built speech features.

We have chosen our RNN model speciﬁcally to map well to GPUs and we use a novel model par-

tition scheme to improve parallelization. Additionally, we propose a process for assembling large

quantities of labeled speech data exhibiting the distortions that our system should learn to handle.

Using a combination of collected and synthesized data, our system learns robustness to realistic

noise and speaker variation (including Lombard Effect [20]). Taken together, these ideas sufﬁce to

build an end-to-end speech system that is at once simpler than traditional pipelines yet also performs

better on difﬁcult speech tasks. Deep Speech achieves an error rate of 16.0% on the full Switchboard

Hub5’00 test set—the best published result. Further, on a new noisy speech recognition dataset of

our own construction, our system achieves a word error rate of 19.1% where the best commercial

systems achieve 30.5% error.

In the remainder of this paper, we will introduce the key ideas behind our speech recognition system.

We begin by describing the basic recurrent neural network model and training framework that we

use in Section 2, followed by a discussion of GPU optimizations (Section 3), and our data capture

and synthesis strategy (Section 4). We conclude with our experimental results demonstrating the

state-of-the-art performance of Deep Speech (Section 5), followed by a discussion of related work

and our conclusions.

2 RNN Training Setup

The core of our system is a recurrent neural network (RNN) trained to ingest speech spectrograms

and generate English text transcriptions. Let a single utterance x and label y be sampled from a

training set X = {(x

(1)

, y

(1)

), (x

(2)

, y

(2)

), . . .}. Each utterance, x

(i)

, is a time-series of length T

(i)

where every time-slice is a vector of audio features, x

(i)

, t = 1, . . . , T

(i)

. We use spectrograms as

our features, so x

(i)

t,p

denotes the power of the p’th frequency bin in the audio frame at time t. The

goal of our RNN is to convert an input sequence x into a sequence of character probabilities for the

transcription y, with ˆy

= P(c

|x), where c

∈ {a,b,c, . . . , z, space, apostrophe, blank}.

Our RNN model is composed of 5 layers of hidden units. For an input x, the hidden units at layer

l are denoted h

(l)

with the convention that h

(0)

is the input. The ﬁrst three layers are not recurrent.

For the ﬁrst layer, at each time t, the output depends on the spectrogram frame x

along with a

context of C frames on each side.

The remaining non-recurrent layers operate on independent data

for each time step. Thus, for each time t, the ﬁrst 3 layers are computed by:

(l)

= g(W

(l)

(l−1)

+ b

(l)

)

where g(z) = min{max{0, z}, 20} is the clipped rectiﬁed-linear (ReLu) activation function and

(l)

, b

(l)

are the weight matrix and bias parameters for layer l.

The fourth layer is a bi-directional

recurrent layer [38]. This layer includes two sets of hidden units: a set with forward recurrence,

We typically use C ∈ {5, 7, 9} for our experiments.

The ReLu units are clipped in order to keep the activations in the recurrent layer from exploding; in practice

the units rarely saturate at the upper bound.

(f)

, and a set with backward recurrence h

(b)

(f)

= g(W

(4)

(3)

+ W

(f)

t−1

+ b

(4)

)

(b)

= g(W

(4)

(3)

+ W

(b)

t+1

+ b

(4)

)

Note that h

(f)

must be computed sequentially from t = 1 to t = T

(i)

for the i’th utterance, while

the units h

(b)

must be computed sequentially in reverse from t = T

(i)

to t = 1.

The ﬁfth (non-recurrent) layer takes both the forward and backward units as inputs h

(5)

g(W

(5)

(4)

+ b

(5)

) where h

(4)

= h

(f)

+ h

(b)

. The output layer is a standard softmax function

that yields the predicted character probabilities for each time slice t and character k in the alphabet:

(6)

t,k

= ˆy

t,k

≡ P(c

= k|x) =

exp(W

(6)

(5)

+ b

(6)

)

exp(W

(6)

(5)

+ b

(6)

)

Here W

(6)

and b

(6)

denote the k’th column of the weight matrix and k’th bias, respectively.

Once we have computed a prediction for P(c

|x), we compute the CTC loss [13] L(ˆy, y) to measure

the error in prediction. During training, we can evaluate the gradient ∇

ˆy

L(ˆy, y) with respect to

the network outputs given the ground-truth character sequence y. From this point, computing the

gradient with respect to all of the model parameters may be done via back-propagation through the

rest of the network. We use Nesterov’s Accelerated gradient method for training [41].

Figure 1: Structure of our RNN model and notation.

The complete RNN model is illustrated in Figure 1. Note that its structure is considerably simpler

than related models from the literature [14]—we have limited ourselves to a single recurrent layer

(which is the hardest to parallelize) and we do not use Long-Short-Term-Memory (LSTM) circuits.

One disadvantage of LSTM cells is that they require computing and storing multiple gating neu-

ron responses at each step. Since the forward and backward recurrences are sequential, this small

additional cost can become a computational bottleneck. By using a homogeneous model we have

made the computation of the recurrent activations as efﬁcient as possible: computing the ReLu out-

puts involves only a few highly optimized BLAS operations on the GPU and a single point-wise

nonlinearity.

We use momentum of 0.99 and anneal the learning rate by a constant factor, chosen to yield the fastest

convergence, after each epoch through the data.

2.1 Regularization

While we have gone to signiﬁcant lengths to expand our datasets (c.f. Section 4), the recurrent

networks we use are still adept at ﬁtting the training data. In order to reduce variance further, we use

several techniques.

During training we apply a dropout [19] rate between 5% - 10%. We apply dropout in the feed-

forward layers but not to the recurrent hidden activations.

A commonly employed technique in computer vision during network evaluation is to randomly

jitter inputs by translations or reﬂections, feed each jittered version through the network, and vote

or average the results [23]. Such jittering is not common in ASR, however we found it beneﬁcial to

translate the raw audio ﬁles by 5ms (half the ﬁlter bank step size) to the left and right, then forward

propagate the recomputed features and average the output probabilities. At test time we also use an

ensemble of several RNNs, averaging their outputs in the same way.

2.2 Language Model

When trained from large quantities of labeled speech data, the RNN model can learn to produce

readable character-level transcriptions. Indeed for many of the transcriptions, the most likely char-

acter sequence predicted by the RNN is exactly correct without external language constraints. The

errors made by the RNN in this case tend to be phonetically plausible renderings of English words—

Table 1 shows some examples. Many of the errors occur on words that rarely or never appear in our

training set. In practice, this is hard to avoid: training from enough speech data to hear all of the

words or language constructions we might need to know is impractical. Therefore, we integrate our

system with an N-gram language model since these models are easily trained from huge unlabeled

text corpora. For comparison, while our speech datasets typically include up to 3 million utterances,

the N-gram language model used for the experiments in Section 5.2 is trained from a corpus of 220

million phrases, supporting a vocabulary of 495,000 words.

RNN output Decoded Transcription

what is the weather like in bostin right now what is the weather like in boston right now

prime miniter nerenr modi prime minister narendra modi

arther n tickets for the game are there any tickets for the game

Table 1: Examples of transcriptions directly from the RNN (left) with errors that are ﬁxed by addi-

tion of a language model (right).

Given the output P(c|x) of our RNN we perform a search to ﬁnd the sequence of characters c

, c

, . . .

that is most probable according to both the RNN output and the language model (where the language

model interprets the string of characters as words). Speciﬁcally, we aim to ﬁnd a sequence c that

maximizes the combined objective:

Q(c) = log(P(c|x)) + α log(P

(c)) + β word count(c)

where α and β are tunable parameters (set by cross-validation) that control the trade-off between

the RNN, the language model constraint and the length of the sentence. The term P

denotes the

probability of the sequence c according to the N-gram model. We maximize this objective using a

highly optimized beam search algorithm, with a typical beam size in the range 1000-8000—similar

to the approach described by Hannun et al. [16].

3 Optimizations

As noted above, we have made several design decisions to make our networks amenable to high-

speed execution (and thus fast training). For example, we have opted for homogeneous rectiﬁed-

linear networks that are simple to implement and depend on just a few highly-optimized BLAS

calls. When fully unrolled, our networks include almost 5 billion connections for a typical utterance

We use the KenLM toolkit [17] to train the N-gram language models in our experiments.

评论收藏

内容反馈

交叉编译之王hahaha

粉丝: 406
资源: 45

百度 DeepSpeech 架构的 TensorFlow 实现 DeepSpeech 论文

最新资源

百度 DeepSpeech 架构的 TensorFlow 实现 DeepSpeech 论文

百度DeepSpeech架构的TensorFlow实现-Python开发

Deep learning with tensorflow

Pro Deep Learning with TensorFlow

Deep Learning with TensorFlow

《Deep Learning with TensorFlow》[随书源代码，2017]

Deep Learning with Tensorflow

DeepSpeech2训练thchs30数据集训练的模型

《deep learning with tensorflow》随书源码

Pro Deep Learning with TensorFlow高清英文版

Hands-On Deep Learning with TensorFlow 随书代码

deepspeech.mxnet:百度DeepSpeech架构的MXNet实现

DeepSpeech2训练aishell数据集训练的模型

Pro deep learning with tensorflow

deepFM实现基于TensorFlow

Deep Learning with TensorFlow, 2nd Edition

Pro Deep Learning with TensorFlow.

34个经典javaweb项目实例.zip

毕业设计 springBoot人力资源管理系统+毕业论文+前后端源代码

项目源码：基于Hadoop+Spark招聘推荐可视化系统 大数据项目 计算机毕业设计

毕业设计：舆情监测系统（SpringBoot+NLP）

基于spring boot的小区物业管理系统源码+论文+答辩ppt

计算机毕业设计：Flask股票数据采集分析可视化系统 python+爬虫+金融数据

人脸识别系统OpenCV+dlib+python（含数据库）Pyqt5界面设计 项目源码 毕业设计

毕业设计-基于JAVA的springboot超市进销存系统(源代码+论文）

基于51单片机的智能电子秤系统设计(含代码仿真及论文)无需积分！

Python爬取智联招聘网站数据，2023.10.31测试，可跑

OpenCV和YOLOv8 实时车速检测+车辆检测跟踪系统 深度学习 测速 计算机视觉 计算机毕业设计

不错的可用来练手、课程设计、毕业设计的Javaweb项目源码：仓库管理系统.rar

最新资源

项目源码：基于Hadoop+Spark招聘推荐可视化系统大数据项目计算机毕业设计

人脸识别系统OpenCV+dlib+python（含数据库）Pyqt5界面设计项目源码毕业设计

OpenCV和YOLOv8 实时车速检测+车辆检测跟踪系统深度学习测速计算机视觉计算机毕业设计