【免费】TransferLearning论文阅读.pdf资源-CSDN文库

需积分: 0 115 浏览量 2021-10-20 18:55:57 上传评论收藏 1MB PDF 举报

《Transfer Learning在多说话人文本转语音合成中的应用》深度学习领域的研究不断推动着语音技术的进步，尤其是在文本转语音（Text-to-Speech, TTS）合成方面。本文主要探讨了如何利用迁移学习（Transfer Learning）技术，从说话人验证任务中提取的特征，应用于多说话人的TTS系统，实现对未在训练中出现过的说话人声音的自然合成。文章介绍了系统的核心组成部分： 1. **说话人编码器网络**：这是一个基于深度学习的模型，专门用于说话人验证任务。它使用独立的大量、多样化的无文本噪声语音数据集进行训练，能从目标说话人的短短几秒钟参考语音中生成一个固定维度的嵌入向量。这个嵌入向量捕捉到了说话人的独特声学特性。 2. **序列到序列合成网络**：该网络基于Tacotron 2架构，能根据文本和说话人嵌入向量生成mel光谱图。Tacotron 2是一种先进的端到端TTS模型，能将文本信息转化为声学特征。 3. **自回归WaveNet基底的声码器网络**：该网络将mel光谱图转换为时域波形样本，生成实际的音频信号。WaveNet是一种深度卷积神经网络，特别适合生成高保真音频。通过这样的三部分结构，系统能够在没有见过的说话人上进行零样本学习（zero-shot learning），即仅用少量参考音频就能生成新的语音。研究表明，说话人编码器在大规模、多样化的说话人集合上的训练对于泛化性能至关重要。此外，文章还展示了随机采样的说话人嵌入向量可以用于合成与训练中不同的新型说话人声音，这表明模型已经学习到了高质量的说话人表示。这种能力对于扩展TTS系统的适用范围和适应性具有重要意义。这项工作揭示了迁移学习在处理说话人变异性问题上的潜力，为构建高效且能适应多种语音风格的TTS系统提供了新的途径。未来的研究可能将进一步探索如何优化这些组件，提高合成语音的自然度和真实感，以及如何扩展到更多的语言和语境中。

资源推荐

资源详情

资源评论

Transfer Learning from Speaker Veriﬁcation to

Multispeaker Text-To-Speech Synthesis

Ye Jia

∗

Yu Zhang

∗

Ron J. Weiss

∗

Quan Wang Jonathan Shen Fei Ren

Zhifeng Chen Patrick Nguyen Ruoming Pang Ignacio Lopez Moreno Yonghui Wu

Google Inc.

{jiaye,ngyuzh,ronw}@google.com

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesis that

is able to generate speech audio in the voice of different speakers, including those

unseen during training. Our system consists of three independently trained compo-

nents: (1) a speaker encoder network, trained on a speaker veriﬁcation task using an

independent dataset of noisy speech without transcripts from thousands of speakers,

to generate a ﬁxed-dimensional embedding vector from only seconds of reference

speech from a target speaker; (2) a sequence-to-sequence synthesis network based

on Tacotron 2 that generates a mel spectrogram from text, conditioned on the

speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that

converts the mel spectrogram into time domain waveform samples. We demonstrate

that the proposed model is able to transfer the knowledge of speaker variability

learned by the discriminatively-trained speaker encoder to the multispeaker TTS

task, and is able to synthesize natural speech from speakers unseen during training.

We quantify the importance of training the speaker encoder on a large and diverse

speaker set in order to obtain the best generalization performance. Finally, we show

that randomly sampled speaker embeddings can be used to synthesize speech in

the voice of novel speakers dissimilar from those used in training, indicating that

the model has learned a high quality speaker representation.

1 Introduction

The goal of this work is to build a TTS system which can generate natural speech for a variety of

speakers in a data efﬁcient manner. We speciﬁcally address a zero-shot learning setting, where a

few seconds of untranscribed reference audio from a target speaker is used to synthesize new speech

in that speaker’s voice, without updating any model parameters. Such systems have accessibility

applications, such as restoring the ability to communicate naturally to users who have lost their

voice and are therefore unable to provide many new training examples. They could also enable

new applications, such as transferring a voice across languages for more natural speech-to-speech

translation, or generating realistic speech from text in low resource settings. However, it is also

important to note the potential for misuse of this technology, for example impersonating someone’s

voice without their consent. In order to address safety concerns consistent with principles such as [

we verify that voices generated by the proposed model can easily be distinguished from real voices.

Synthesizing natural speech requires training on a large number of high quality speech-transcript

pairs, and supporting many speakers usually uses tens of minutes of training data per speaker [

Recording a large amount of high quality data for many speakers is impractical. Our approach is to

decouple speaker modeling from speech synthesis by independently training a speaker-discriminative

embedding network that captures the space of speaker characteristics and training a high quality TTS

∗

Equal contribution.

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

arXiv:1806.04558v4 [cs.CL] 2 Jan 2019

model on a smaller dataset conditioned on the representation learned by the ﬁrst network. Decoupling

the networks enables them to be trained on independent data, which reduces the need to obtain high

quality multispeaker training data. We train the speaker embedding network on a speaker veriﬁcation

task to determine if two different utterances were spoken by the same speaker. In contrast to the

subsequent TTS model, this network is trained on untranscribed speech containing reverberation and

background noise from a large number of speakers.

We demonstrate that the speaker encoder and synthesis networks can be trained on unbalanced and

disjoint sets of speakers and still generalize well. We train the synthesis network on 1.2K speakers

and show that training the encoder on a much larger set of 18K speakers improves adaptation quality,

and further enables synthesis of completely novel speakers by sampling from the embedding prior.

There has been signiﬁcant interest in end-to-end training of TTS models, which are trained directly

from text-audio pairs, without depending on hand crafted intermediate representations [

Tacotron 2 [

] used WaveNet [

] as a vocoder to invert spectrograms generated by an encoder-

decoder architecture with attention [

], obtaining naturalness approaching that of human speech by

combining Tacotron’s [

] prosody with WaveNet’s audio quality. It only supported a single speaker.

Gibiansky et al. [

] introduced a multispeaker variation of Tacotron which learned low-dimensional

speaker embedding for each training speaker. Deep Voice 3 [

] proposed a fully convolutional

encoder-decoder architecture which scaled up to support over 2,400 speakers from LibriSpeech [

These systems learn a ﬁxed set of speaker embeddings and therefore only support synthesis of voices

seen during training. In contrast, VoiceLoop [

] proposed a novel architecture based on a ﬁxed

size memory buffer which can generate speech from voices unseen during training. Obtaining good

results required tens of minutes of enrollment speech and transcripts for a new speaker.

Recent extensions have enabled few-shot speaker adaptation where only a few seconds of speech

per speaker (without transcripts) can be used to generate new speech in that speaker’s voice. [

]

extends Deep Voice 3, comparing a speaker adaptation method similar to [

] where the model

parameters (including speaker embedding) are ﬁne-tuned on a small amount of adaptation data to a

speaker encoding method which uses a neural network to predict speaker embedding directly from a

spectrogram. The latter approach is signiﬁcantly more data efﬁcient, obtaining higher naturalness

using small amounts of adaptation data, in as few as one or two utterances. It is also signiﬁcantly

more computationally efﬁcient since it does not require hundreds of backpropagation iterations.

Nachmani et al. [

] similarly extended VoiceLoop to utilize a target speaker encoding network to

predict a speaker embedding. This network is trained jointly with the synthesis network using a

contrastive triplet loss to ensure that embeddings predicted from utterances by the same speaker are

closer than embeddings computed from different speakers. In addition, a cycle-consistency loss is

used to ensure that the synthesized speech encodes to a similar embedding as the adaptation utterance.

A similar spectrogram encoder network, trained without a triplet loss, was shown to work for

transferring target prosody to synthesized speech [

]. In this paper we demonstrate that training a

similar encoder to discriminate between speakers leads to reliable transfer of speaker characteristics.

Our work is most similar to the speaker encoding models in [

], except that we utilize a network

independently-trained for a speaker veriﬁcation task on a large dataset of untranscribed audio from tens

of thousands of speakers, using a state-of-the-art generalized end-to-end loss [

]. [

] incorporated

a similar speaker-discriminative representation into their model, however all components were trained

jointly. In contrast, we explore transfer learning from a pre-trained speaker veriﬁcation model.

Doddipatla et al. [

] used a similar transfer learning conﬁguration where a speaker embedding

computed from a pre-trained speaker classiﬁer was used to condition a TTS system. In this paper we

utilize an end-to-end synthesis network which does not rely on intermediate linguistic features, and a

substantially different speaker embedding network which is not limited to a closed set of speakers.

Furthermore, we analyze how quality varies with the number of speakers in the training set, and ﬁnd

that zero-shot transfer requires training on thousands of speakers, many more than were used in [7].

2 Multispeaker speech synthesis model

Our system is composed of three independently trained neural networks, illustrated in Figure 1: (1) a

recurrent speaker encoder, based on [

], which computes a ﬁxed dimensional vector from a speech

speaker

reference

waveform

Speaker

Encoder

grapheme or

phoneme

sequence

Encoder

concat

Attention Decoder

Synthesizer

Vocoder

waveform

speaker

embedding

log-mel

spectrogram

Figure 1: Model overview. Each of the three components are trained independently.

signal, (2) a sequence-to-sequence synthesizer, based on [

], which predicts a mel spectrogram from

a sequence of grapheme or phoneme inputs, conditioned on the speaker embedding vector, and (3) an

autoregressive WaveNet [

] vocoder, which converts the spectrogram into time domain waveforms.

2.1 Speaker encoder

The speaker encoder is used to condition the synthesis network on a reference speech signal from the

desired target speaker. Critical to good generalization is the use of a representation which captures the

characteristics of different speakers, and the ability to identify these characteristics using only a short

adaptation signal, independent of its phonetic content and background noise. These requirements are

satisﬁed using a speaker-discriminative model trained on a text-independent speaker veriﬁcation task.

We follow [

], which proposed a highly scalable and accurate neural network framework for speaker

veriﬁcation. The network maps a sequence of log-mel spectrogram frames computed from a speech

utterance of arbitrary length, to a ﬁxed-dimensional embedding vector, known as d-vector [

]. The

network is trained to optimize a generalized end-to-end speaker veriﬁcation loss, so that embeddings

of utterances from the same speaker have high cosine similarity, while those of utterances from

different speakers are far apart in the embedding space. The training dataset consists of speech audio

examples segmented into 1.6 seconds and associated speaker identity labels; no transcripts are used.

Input 40-channel log-mel spectrograms are passed to a network consisting of a stack of 3 LSTM

layers of 768 cells, each followed by a projection to 256 dimensions. The ﬁnal embedding is created

-normalizing the output of the top layer at the ﬁnal frame. During inference, an arbitrary length

utterance is broken into 800ms windows, overlapped by 50%. The network is run independently on

each window, and the outputs are averaged and normalized to create the ﬁnal utterance embedding.

Although the network is not optimized directly to learn a representation which captures speaker

characteristics relevant to synthesis, we ﬁnd that training on a speaker discrimination task leads to an

embedding which is directly suitable for conditioning the synthesis network on speaker identity.

2.2 Synthesizer

We extend the recurrent sequence-to-sequence with attention Tacotron 2 architecture [

] to support

multiple speakers following a scheme similar to [

]. An embedding vector for the target speaker is

concatenated with the synthesizer encoder output at each time step. In contrast to [

], we ﬁnd that

simply passing embeddings to the attention layer, as in Figure 1, converges across different speakers.

We compare two variants of this model, one which computes the embedding using the speaker

encoder, and a baseline which optimizes a ﬁxed embedding for each speaker in the training set,

essentially learning a lookup table of speaker embeddings similar to [8, 13].

The synthesizer is trained on pairs of text transcript and target audio. At the input, we map the text to

a sequence of phonemes, which leads to faster convergence and improved pronunciation of rare words

and proper nouns. The network is trained in a transfer learning conﬁguration, using a pretrained

speaker encoder (whose parameters are frozen) to extract a speaker embedding from the target audio,

i.e. the speaker reference signal is the same as the target speech during training. No explicit speaker

identiﬁer labels are used during training.

Target spectrogram features are computed from 50ms windows computed with a 12.5ms step, passed

through an 80-channel mel-scale ﬁlterbank followed by log dynamic range compression. We extend

[

] by augmenting the

loss on the predicted spectrogram with an additional

loss. In practice,

See https://google.github.io/tacotron/publications/speaker_adaptation for samples.

剩余14页未读，继续阅读

评论收藏

内容反馈

漂流の少年

粉丝: 4023
资源: 11

Transfer Learning论文阅读.pdf

最新资源

Transfer Learning论文阅读.pdf

Learning_to_Transfer.pdf

论文笔记-2014ECML-ImportanceWeighted Inductive Transfer Learning for Regression.pdf

最新迁移学习综述论文（A Comprehensive Survey on Transfer Learning）- 中科院.zip

（原文+译文）A Survey on Transfer Learning_Pan and Yang_2010.pdf

Transfer Learning without Full Model.pdf

A survey of transfer learning.pdf

Transfer Learning Through Weighted Loss Function.pdf

Transfer Learning for Non-Intrusive Load Monitoring.pdf

A Survey on Transfer Learning_withMarginNotes.pdf

Learning Linear Transformations for Fast Arbitrary Style Transfer论文解读

2007ICML-Boosting_for_Transfer_Learning[1].(上交).pdf

CVPR2018_Oral_论文合集_人工智能_机器学习

人工智能精选论文（图像识别与图像处理）

ICML 2020上与【元学习（Meta Learning）】相关的论文（六篇）

Multitask learning.pdf

2.Deep Learning for Anomaly Detection A Review 论文分享（中）.pdf

A Comprehensive Survey on Transfer Learning.pdf

戴文渊---Boosting for Transfer Learning（内含代码、硕士论文及发表的英文论文）

迁移学习相关最新SCI文献

Matlab环境下基于神经网络的车牌识别_毕业论文.pdf

基于混合时间序列图形化的心电信号分类方法_毕业论文.pdf

week1&2参考.pdf

A Survey on Transfer Learnin Sinno Jialin Pan and Qiang Yang pdf

最新资源