This is an example pipeline for text-to-speech using Tacotron2.
Here is a [colab example](https://colab.research.google.com/drive/1MPcn1_G5lKozxZ7v8b9yucOD5X5cLK4j?usp=sharing)
that shows how the text-to-speech pipeline is used during inference with the built-in pretrained models.
## Install required packages
Required packages
```bash
pip install librosa tqdm inflect joblib
```
To use tensorboard
```bash
pip install tensorboard pillow
```
## Training Tacotron2 with character as input
The training of Tacotron2 can be invoked with the following command.
```bash
python train.py \
--learning-rate 1e-3 \
--epochs 1501 \
--anneal-steps 500 1000 1500 \
--anneal-factor 0.1 \
--batch-size 96 \
--weight-decay 1e-6 \
--grad-clip 1.0 \
--text-preprocessor english_characters \
--logging-dir ./logs \
--checkpoint-path ./ckpt.pth \
--dataset-path ./
```
The training script will use all GPUs that is available, please set the
environment variable `CUDA_VISIBLE_DEVICES` if you don't want all GPUs to be used.
The newest checkpoint will be saved to `./ckpt.pth` and the checkpoint with the best validation
loss will be saved to `./best_ckpt.pth`.
The training log will be saved to `./logs/train.log` and the tensorboard results will also
be in `./logs`.
If `./ckpt.pth` already exist, this script will automatically load the file and try to continue
training from the checkpoint.
This command takes around 36 hours to train on 8 NVIDIA Tesla V100 GPUs.
To train the Tacotron2 model to work with the [pretrained wavernn](https://pytorch.org/audio/main/models.html#id10)
with checkpoint_name `"wavernn_10k_epochs_8bits_ljspeech"`, please run the following command instead.
```bash
python train.py
--learning-rate 1e-3 \
--epochs 1501 \
--anneal-steps 500 1000 1500 \
--anneal-factor 0.1 \
--sample-rate 22050 \
--n-fft 2048 \
--hop-length 275 \
--win-length 1100 \
--mel-fmin 40 \
--mel-fmax 11025 \
--batch-size 96 \
--weight-decay 1e-6 \
--grad-clip 1.0 \
--text-preprocessor english_characters \
--logging-dir ./wavernn_logs \
--checkpoint-path ./ckpt_wavernn.pth \
--dataset-path ./
```
## Training Tacotron2 with phoneme as input
#### Dependencies
This example use the [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) as
the phonemizer (the function to turn text into phonemes),
please install it with the following command (the code is tested with version 0.0.15).
```bash
pip install deep-phonemizer==0.0.15
```
Then download the model weights from [their website](https://github.com/as-ideas/DeepPhonemizer)
The link to the checkpoint that is tested with this example is
[https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt).
#### Running training script
The training of Tacotron2 with english phonemes as input can be invoked with the following command.
```bash
python train.py \
--workers 12 \
--learning-rate 1e-3 \
--epochs 1501 \
--anneal-steps 500 1000 1500 \
--anneal-factor 0.1 \
--batch-size 96 \
--weight-decay 1e-6 \
--grad-clip 1.0 \
--text-preprocessor english_phonemes \
--phonemizer DeepPhonemizer \
--phonemizer-checkpoint ./en_us_cmudict_forward.pt \
--cmudict-root ./ \
--logging-dir ./english_phonemes_logs \
--checkpoint-path ./english_phonemes_ckpt.pth \
--dataset-path ./
```
Similar to the previous examples, this command will save the log in the directory `./english_phonemes_logs`
and the checkpoint will be saved to `./english_phonemes_ckpt.pth`.
To train the Tacotron2 model with english phonemes that works with the
[pretrained wavernn](https://pytorch.org/audio/main/models.html#id10)
with checkpoint_name `"wavernn_10k_epochs_8bits_ljspeech"`, please run the following command.
```bash
python train.py \
--workers 12 \
--learning-rate 1e-3 \
--epochs 1501 \
--anneal-steps 500 1000 1500 \
--anneal-factor 0.1 \
--sample-rate 22050 \
--n-fft 2048 \
--hop-length 275 \
--win-length 1100 \
--mel-fmin 40 \
--mel-fmax 11025 \
--batch-size 96 \
--weight-decay 1e-6 \
--grad-clip 1.0 \
--text-preprocessor english_phonemes \
--phonemizer DeepPhonemizer \
--phonemizer-checkpoint ./en_us_cmudict_forward.pt \
--cmudict-root ./ \
--logging-dir ./english_phonemes_wavernn_logs \
--checkpoint-path ./english_phonemes_wavernn_ckpt.pth \
--dataset-path ./
```
## Text-to-speech pipeline
Here we present an example of how to use Tacotron2 to generate audio from text.
The text-to-speech pipeline goes as follows:
1. text preprocessing: encoder the text into list of symbols (the symbols can represent characters, phonemes, etc.)
2. spectrogram generation: after retrieving the list of symbols, we feed this list to a Tacotron2 model and the model
will output the mel spectrogram.
3. time-domain conversion: when the mel spectrogram is generated, we need to convert it into audio with a vocoder.
Currently, there are three vocoders being supported in this script, which includes the
[WaveRNN](https://pytorch.org/audio/stable/models/wavernn.html),
[Griffin-Lim](https://pytorch.org/audio/stable/transforms.html#griffinlim), and
[Nvidia's WaveGlow](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/).
The spectro parameters including `n-fft`, `mel-fmin`, `mel-fmax` should be set to the values
used during the training of Tacotron2.
#### Pretrained WaveRNN as the Vocoder
The following command will generate a waveform to `./outputs.wav`
with the text "Hello world!" using WaveRNN as the vocoder.
```bash
python inference.py --checkpoint-path ${model_path} \
--vocoder wavernn \
--n-fft 2048 \
--mel-fmin 40 \
--mel-fmax 11025 \
--input-text "Hello world!" \
--text-preprocessor english_characters \
--output-path "./outputs.wav"
```
If you want to generate a waveform with a different text with phonemes
as the input to Tacotron2, please use the `--text-preprocessor english_phonemes`.
The following is an example.
(Remember to install the [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer)
and download their pretrained weights.
```bash
python inference.py --checkpoint-path ${model_path} \
--vocoder wavernn \
--n-fft 2048 \
--mel-fmin 40 \
--mel-fmax 11025 \
--input-text "Hello world!" \
--text-preprocessor english_phonemes \
--phonimizer DeepPhonemizer \
--phoimizer-checkpoint ./en_us_cmudict_forward.pt \
--cmudict-root ./ \
--output-path "./outputs.wav"
```
To use torchaudio pretrained models, please see the following example command.
For Tacotron2, we use the checkpoint named `"tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech"`, and
for WaveRNN, we use the checkpoint named `"wavernn_10k_epochs_8bits_ljspeech"`.
See https://pytorch.org/audio/stable/models.html for more checkpoint options for Tacotron2 and WaveRNN.
```bash
python inference.py \
--checkpoint-path tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech \
--wavernn-checkpoint-path wavernn_10k_epochs_8bits_ljspeech \
--vocoder wavernn \
--n-fft 2048 \
--mel-fmin 40 \
--mel-fmax 11025 \
--input-text "Hello world!" \
--text-preprocessor english_phonemes \
--phonimizer DeepPhonemizer \
--phoimizer-checkpoint ./en_us_cmudict_forward.pt \
--cmudict-root ./ \
--output-path "./outputs.wav"
```
#### Griffin-Lim's algorithm as the Vocoder
The following command will generate a waveform to `./outputs.wav`
with the text "Hello world!" using Griffin-Lim's algorithm as the vocoder.
```bash
python inference.py --checkpoint-path ${model_path} \
--vocoder griffin_lim \
--n-fft 1024 \
--mel-fmin 0 \
--mel-fmax 8000 \
--input-text "Hello world!" \
--text-preprocessor english_char
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
torchaudio 是一个为 PyTorch 深度学习框架设计的音频处理库。它旨在将 PyTorch 的强大功能应用于音频领域,提供了一系列音频处理工具,使得在 PyTorch 中进行音频相关的机器学习和深度学习任务变得更加便捷。 以下是 torchaudio 的一些关键特性: 音频 I/O 支持:torchaudio 提供了加载和保存多种音频文件格式的功能,如 WAV、MP3、AAC、OGG、FLAC 等。这些操作可以通过 torchaudio.load 和 torchaudio.save 函数实现。 音频数据处理:库中包含了多种音频处理功能,如计算短时傅里叶变换(STFT)、梅尔频率谱图(Mel Spectrogram)、梅尔频率倒谱系数(MFCC)等。这些功能可以通过 torchaudio.functional 模块访问。 音频转换:torchaudio.transforms 提供了一系列预定义的音频转换类,可以方便地应用于音频张量上,如 MelSpectrogram、MFCC 以及各类归一化、增强技术等。 数据集支持:torchaudio 集成了一些常见的音频数据集
资源推荐
资源详情
资源评论
收起资源包目录
Pytorch-简单音频io项目源码 (878个子文件)
mat.ark 112B
vec_flt.ark 81B
vec_int.ark 81B
kenlm_char.arpa 1KB
kenlm.arpa 649B
nasa_13013.avi 623KB
RATRACE_wave_f_nm_np1_fr_goo_37.avi 258KB
cuda_install.bat 7KB
install_runtime.bat 2KB
install_activate.bat 2KB
activate.bat 1KB
driver_update.bat 918B
vc_env_helper.bat 887B
make.bat 782B
build.bat 398B
bld.bat 107B
install_conda.bat 107B
refs.bib 28KB
stub.c 2KB
setup.cfg 77B
CITATION 681B
.clang-format 3KB
.clang-tidy 1KB
LoadHIP.cmake 7KB
TorchAudioHelper.cmake 4KB
CODEOWNERS 387B
encode_process.cpp 30KB
conversion.cpp 22KB
post_process.cpp 20KB
stream_reader.cpp 18KB
pybind.cpp 18KB
utils.cpp 16KB
tensor_converter.cpp 16KB
stream_processor.cpp 13KB
ctc_prefix_decoder.cpp 12KB
stream_writer.cpp 12KB
ray_tracing.cpp 12KB
lfilter.cpp 10KB
effects_chain.cpp 9KB
rir.cpp 8KB
filter_graph.cpp 8KB
compute.cpp 7KB
compute.cpp 5KB
ffmpeg.cpp 5KB
wall_collision.cpp 4KB
chunked_buffer.cpp 4KB
io.cpp 4KB
python_binding.cpp 4KB
effects.cpp 4KB
types.cpp 3KB
transcribe_list.cpp 3KB
encoder.cpp 2KB
compute_betas.cpp 2KB
compute_alphas.cpp 2KB
overdrive.cpp 2KB
autograd.cpp 2KB
pybind.cpp 1KB
transcribe.cpp 1KB
packet_writer.cpp 1KB
hw_context.cpp 945B
compute.cpp 934B
unchunked_buffer.cpp 774B
compute.cpp 746B
main.cpp 657B
packet_buffer.cpp 627B
utils.cpp 455B
pybind.cpp 329B
compute_alphas.cpp 255B
compute_betas.cpp 254B
custom.css 2KB
ctc_prefix_decoder_kernel_v2.cu 30KB
compute.cu 12KB
compute.cu 5KB
iir_cuda.cu 3KB
compute_betas.cu 2KB
compute_alphas.cu 2KB
warpsort_topk.cuh 16KB
gpu_kernels.cuh 10KB
bitonic_sort.cuh 10KB
pow2_utils.cuh 5KB
ctc_fast_divmod.cuh 4KB
device_log_prob.cuh 3KB
gpu_kernel_utils.cuh 2KB
math.cuh 787B
half.cuh 701B
Doxyfile 118KB
.flake8 318B
.gitattributes 96B
.gitignore 2KB
.gitignore 35B
.gitmodules 0B
cpu_kernels.h 16KB
stream_reader.h 14KB
stream_writer.h 13KB
gpu_transducer.h 12KB
ffmpeg.h 7KB
wall.h 7KB
workspace.h 6KB
cpu_transducer.h 5KB
ctc_prefix_decoder_host.h 5KB
共 878 条
- 1
- 2
- 3
- 4
- 5
- 6
- 9
资源评论
交叉编译之王hahaha
- 粉丝: 576
- 资源: 45
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功