Python-用PyTorch实现DeepVoice3语音合成_pytorch语音合成,python语音合成模型资源-CSDN文库

共136个文件

py：42个

wav：36个

png：26个

Python开发-机器学习

需积分: 50 190 浏览量 2019-08-09 16:58:37 上传评论 6 收藏 6.71MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python-用PyTorch实现DeepVoice3语音合成（136个子文件）

skeleton.css 11KB

normalize.css 8KB

custom.css 3KB

.gitignore 2KB

.gitignore 27B

.gitignore 6B

.gitmodules 0B

header.html 1KB

mathjax.html 1KB

footer.html 1022B

social.html 635B

list.html 520B

single.html 488B

index.html 176B

MANIFEST.in 29B

tox.ini 184B

banner.jpg 1.1MB

deepvoice3_vctk.json 2KB

deepvoice3_niklm.json 2KB

nyanko_ljspeech.json 2KB

deepvoice3_ljspeech.json 2KB

deepvoice3_nikls.json 2KB

README.md 16KB

index.md 16KB

README.md 3KB

LICENSE.md 1KB

README.md 570B

ljspeech-mel-00001.npy 261KB

512logotipo.png 40KB

3_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker61_alignment.png 28KB

3_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker62_alignment.png 28KB

0_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker62_alignment.png 27KB

2_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker62_alignment.png 27KB

0_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker61_alignment.png 27KB

2_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker61_alignment.png 27KB

5_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker62_alignment.png 27KB

5_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker61_alignment.png 26KB

4_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker62_alignment.png 26KB

4_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker61_alignment.png 26KB

3_20171129_nyanko_checkpoint_step000585000_alignment.png 26KB

1_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker61_alignment.png 26KB

1_20171222_deepvoice3_vctk108_checkpoint_step000300000_speaker62_alignment.png 26KB

2_20171129_nyanko_checkpoint_step000585000_alignment.png 25KB

3_checkpoint_step000210000_alignment.png 25KB

0_20171129_nyanko_checkpoint_step000585000_alignment.png 25KB

5_20171129_nyanko_checkpoint_step000585000_alignment.png 25KB

1_20171129_nyanko_checkpoint_step000585000_alignment.png 24KB

4_20171129_nyanko_checkpoint_step000585000_alignment.png 24KB

5_checkpoint_step000210000_alignment.png 24KB

2_checkpoint_step000210000_alignment.png 24KB

0_checkpoint_step000210000_alignment.png 24KB

4_checkpoint_step000210000_alignment.png 23KB

1_checkpoint_step000210000_alignment.png 23KB

favicon.png 1KB

extract_feats.py 55KB

train.py 36KB

deepvoice3.py 24KB

nyanko.py 16KB

test_deepvoice3.py 12KB

builder.py 10KB

json_meta.py 9KB

modules.py 8KB

test_nyanko.py 7KB

synthesis.py 6KB

gentle_web_align.py 6KB

prepare_htk_alignments_vctk.py 5KB

hparams.py 5KB

__init__.py 5KB

nikl_m.py 3KB

nikl_s.py 3KB

ljspeech.py 3KB

vctk.py 3KB

setup.py 3KB

cleaners.py 3KB

prepare_metafile.py 3KB

audio.py 2KB

conv.py 2KB

__init__.py 2KB

jsut.py 2KB

numbers.py 2KB

cmudict.py 2KB

test_conv.py 2KB

preprocess.py 2KB

__init__.py 2KB

test_frontend.py 2KB

compute_timestamp_ratio.py 2KB

prepare_vctk_labels.py 1KB

lrschedule.py 1KB

test_embedding.py 899B

__init__.py 818B

__init__.py 730B

dump_hparams_to_json.py 725B

symbols.py 618B

test_audio.py 475B

__init__.py 377B

__init__.py 251B

release.sh 632B

config.toml 413B

1_20171129_nyanko_checkpoint_step000585000.wav 225KB

3_20171129_nyanko_checkpoint_step000585000.wav 223KB

共 136 条

![alt text](assets/banner.jpg) # Deepvoice3_pytorch [![PyPI](https://img.shields.io/pypi/v/deepvoice3_pytorch.svg)](https://pypi.python.org/pypi/deepvoice3_pytorch) [![Build Status](https://travis-ci.org/r9y9/deepvoice3_pytorch.svg?branch=master)](https://travis-ci.org/r9y9/deepvoice3_pytorch) [![Build status](https://ci.appveyor.com/api/projects/status/8eurjakfaofbr24k?svg=true)](https://ci.appveyor.com/project/r9y9/deepvoice3-pytorch) PyTorch implementation of convolutional networks-based text-to-speech synthesis models: 1. [arXiv:1710.07654](https://arxiv.org/abs/1710.07654): Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. 2. [arXiv:1710.08969](https://arxiv.org/abs/1710.08969): Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. Audio samples are available at https://r9y9.github.io/deepvoice3_pytorch/. ## Online TTS demo Notebooks supposed to be executed on https://colab.research.google.com are available: - [DeepVoice3: Multi-speaker text-to-speech demo](https://colab.research.google.com/github/r9y9/Colaboratory/blob/master/DeepVoice3_multi_speaker_TTS_en_demo.ipynb) - [DeepVoice3: Single-speaker text-to-speech demo](https://colab.research.google.com/github/r9y9/Colaboratory/blob/master/DeepVoice3_single_speaker_TTS_en_demo.ipynb) ## Highlights - Convolutional sequence-to-sequence model with attention for text-to-speech synthesis - Multi-speaker and single speaker versions of DeepVoice3 - Audio samples and pre-trained models - Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format) - Language-dependent frontend text processor for English and Japanese ### Samples - [Ja Step000380000 Predicted](https://soundcloud.com/user-623907374/ja-step000380000-predicted) - [Ja Step000370000 Predicted](https://soundcloud.com/user-623907374/ja-step000370000-predicted) - [Ko_single Step000410000 Predicted](https://soundcloud.com/user-623907374/ko-step000410000-predicted) - [Ko_single Step000400000 Predicted](https://soundcloud.com/user-623907374/ko-step000400000-predicted) - [Ko_multi Step001680000 Predicted](https://soundcloud.com/user-623907374/step001680000-predicted) - [Ko_multi Step001700000 Predicted](https://soundcloud.com/user-623907374/step001700000-predicted) ## Pretrained models **NOTE**: pretrained models are not compatible to master. To be updated soon. | URL | Model | Data | Hyper paramters | Git commit | Steps | |-----|------------|----------|--------------------------------------------------|----------------------|--------| | [link](https://www.dropbox.com/s/5ucl9remrwy5oeg/20180505_deepvoice3_checkpoint_step000640000.pth?dl=0) | DeepVoice3 | LJSpeech | [link](https://www.dropbox.com/s/0ck82unm0bo0rxd/20180505_deepvoice3_ljspeech.json?dl=0) | [abf0a21](https://github.com/r9y9/deepvoice3_pytorch/tree/abf0a21f83aeb451b918f867bc23378f1e2e608b)| 640k | | [link](https://www.dropbox.com/s/1y8bt6bnggbzzlp/20171129_nyanko_checkpoint_step000585000.pth?dl=0) | Nyanko | LJSpeech | `builder=nyanko,preset=nyanko_ljspeech` | [ba59dc7](https://github.com/r9y9/deepvoice3_pytorch/tree/ba59dc75374ca3189281f6028201c15066830116) | 585k | | [link](https://www.dropbox.com/s/uzmtzgcedyu531k/20171222_deepvoice3_vctk108_checkpoint_step000300000.pth?dl=0) | Multi-speaker DeepVoice3 | VCTK | `builder=deepvoice3_multispeaker,preset=deepvoice3_vctk` | [0421749](https://github.com/r9y9/deepvoice3_pytorch/tree/0421749af908905d181f089f06956fddd0982d47) | 300k + 300k | To use pre-trained models, it's highly recommended that you are on the **specific git commit** noted above. i.e., ``` git checkout ${commit_hash} ``` Then follow the "Synthesize from a checkpoint" section in the README of the specific git commit. Please notice that the latest development version of the repository may not work. You could try for example: ``` # pretrained model (20180505_deepvoice3_checkpoint_step000640000.pth) # hparams (20180505_deepvoice3_ljspeech.json) git checkout 4357976 python synthesis.py --preset=20180505_deepvoice3_ljspeech.json \ 20180505_deepvoice3_checkpoint_step000640000.pth \ sentences.txt \ output_dir ``` ## Notes on hyper parameters - Default hyper parameters, used during preprocessing/training/synthesis stages, are turned for English TTS using LJSpeech dataset. You will have to change some of parameters if you want to try other datasets. See `hparams.py` for details. - `builder` specifies which model you want to use. `deepvoice3`, `deepvoice3_multispeaker` [1] and `nyanko` [2] are surpprted. - Hyper parameters described in DeepVoice3 paper for single speaker didn't work for LJSpeech dataset, so I changed a few things. Add dilated convolution, more channels, more layers and add guided attention loss, etc. See code for details. The changes are also applied for multi-speaker model. - Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough. - With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements. - Binary divergence (described in https://arxiv.org/abs/1710.08969) seems stabilizes training particularly for deep (> 10 layers) networks. - Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable. ## Requirements - Python 3 - CUDA >= 8.0 - PyTorch >= v0.4.0 - TensorFlow >= v1.3 - [nnmnkwii](https://github.com/r9y9/nnmnkwii) >= v0.0.11 - [MeCab](http://taku910.github.io/mecab/) (Japanese only) ## Installation Please install packages listed above first, and then ``` git clone https://github.com/r9y9/deepvoice3_pytorch && cd deepvoice3_pytorch pip install -e ".[bin]" ``` ## Getting started ### Preset parameters There are many hyper parameters to be turned depends on what model and data you are working on. For typical datasets and models, parameters that known to work good (**preset**) are provided in the repository. See `presets` directory for details. Notice that 1. `preprocess.py` 2. `train.py` 3. `synthesis.py` accepts `--preset=<json>` optional parameter, which specifies where to load preset parameters. If you are going to use preset parameters, then you must use same `--preset=<json>` throughout preprocessing, training and evaluation. e.g., ``` python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/LJSpeech-1.0 python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech ``` instead of ``` python preprocess.py ljspeech ~/data/LJSpeech-1.0 # warning! this may use different hyper parameters used at preprocessing stage python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech ``` ### 0. Download dataset - LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/ - VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html - JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut - NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464 ### 1. Preprocessing Usage: ``` python preprocess.py ${dataset_name} ${dataset_path} ${out_dir} --preset=<json> ``` Supported `${dataset_name}`s are: - `ljspeech` (en, single speaker) - `vctk` (en, multi-speaker) - `jsut` (jp, single speaker) - `nikl_m` (ko, multi-speaker) - `nikl_s` (ko, single speak

评论收藏

内容反馈