Pytorch-简单音频io项目源码_声线转换资源-CSDN文库

共878个文件

py：532个

rst：48个

cpp：43个

pytorch

需积分: 1 89 浏览量 2024-03-05 19:08:16 上传评论收藏 4.69MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Pytorch-简单音频io项目源码（878个子文件）

mat.ark 112B

vec_flt.ark 81B

vec_int.ark 81B

kenlm_char.arpa 1KB

kenlm.arpa 649B

nasa_13013.avi 623KB

RATRACE_wave_f_nm_np1_fr_goo_37.avi 258KB

cuda_install.bat 7KB

install_runtime.bat 2KB

install_activate.bat 2KB

activate.bat 1KB

driver_update.bat 918B

vc_env_helper.bat 887B

make.bat 782B

build.bat 398B

bld.bat 107B

install_conda.bat 107B

refs.bib 28KB

stub.c 2KB

setup.cfg 77B

CITATION 681B

.clang-format 3KB

.clang-tidy 1KB

LoadHIP.cmake 7KB

TorchAudioHelper.cmake 4KB

CODEOWNERS 387B

encode_process.cpp 30KB

conversion.cpp 22KB

post_process.cpp 20KB

stream_reader.cpp 18KB

pybind.cpp 18KB

utils.cpp 16KB

tensor_converter.cpp 16KB

stream_processor.cpp 13KB

ctc_prefix_decoder.cpp 12KB

stream_writer.cpp 12KB

ray_tracing.cpp 12KB

lfilter.cpp 10KB

effects_chain.cpp 9KB

rir.cpp 8KB

filter_graph.cpp 8KB

compute.cpp 7KB

compute.cpp 5KB

ffmpeg.cpp 5KB

wall_collision.cpp 4KB

chunked_buffer.cpp 4KB

io.cpp 4KB

python_binding.cpp 4KB

effects.cpp 4KB

types.cpp 3KB

transcribe_list.cpp 3KB

encoder.cpp 2KB

compute_betas.cpp 2KB

compute_alphas.cpp 2KB

overdrive.cpp 2KB

autograd.cpp 2KB

pybind.cpp 1KB

transcribe.cpp 1KB

packet_writer.cpp 1KB

hw_context.cpp 945B

compute.cpp 934B

unchunked_buffer.cpp 774B

compute.cpp 746B

main.cpp 657B

packet_buffer.cpp 627B

utils.cpp 455B

pybind.cpp 329B

compute_alphas.cpp 255B

compute_betas.cpp 254B

custom.css 2KB

ctc_prefix_decoder_kernel_v2.cu 30KB

compute.cu 12KB

compute.cu 5KB

iir_cuda.cu 3KB

compute_betas.cu 2KB

compute_alphas.cu 2KB

warpsort_topk.cuh 16KB

gpu_kernels.cuh 10KB

bitonic_sort.cuh 10KB

pow2_utils.cuh 5KB

ctc_fast_divmod.cuh 4KB

device_log_prob.cuh 3KB

gpu_kernel_utils.cuh 2KB

math.cuh 787B

half.cuh 701B

Doxyfile 118KB

.flake8 318B

.gitattributes 96B

.gitignore 2KB

.gitignore 35B

.gitmodules 0B

cpu_kernels.h 16KB

stream_reader.h 14KB

stream_writer.h 13KB

gpu_transducer.h 12KB

ffmpeg.h 7KB

wall.h 7KB

workspace.h 6KB

cpu_transducer.h 5KB

ctc_prefix_decoder_host.h 5KB

共 878 条

This is an example pipeline for text-to-speech using Tacotron2. Here is a [colab example](https://colab.research.google.com/drive/1MPcn1_G5lKozxZ7v8b9yucOD5X5cLK4j?usp=sharing) that shows how the text-to-speech pipeline is used during inference with the built-in pretrained models. ## Install required packages Required packages ```bash pip install librosa tqdm inflect joblib ``` To use tensorboard ```bash pip install tensorboard pillow ``` ## Training Tacotron2 with character as input The training of Tacotron2 can be invoked with the following command. ```bash python train.py \ --learning-rate 1e-3 \ --epochs 1501 \ --anneal-steps 500 1000 1500 \ --anneal-factor 0.1 \ --batch-size 96 \ --weight-decay 1e-6 \ --grad-clip 1.0 \ --text-preprocessor english_characters \ --logging-dir ./logs \ --checkpoint-path ./ckpt.pth \ --dataset-path ./ ``` The training script will use all GPUs that is available, please set the environment variable `CUDA_VISIBLE_DEVICES` if you don't want all GPUs to be used. The newest checkpoint will be saved to `./ckpt.pth` and the checkpoint with the best validation loss will be saved to `./best_ckpt.pth`. The training log will be saved to `./logs/train.log` and the tensorboard results will also be in `./logs`. If `./ckpt.pth` already exist, this script will automatically load the file and try to continue training from the checkpoint. This command takes around 36 hours to train on 8 NVIDIA Tesla V100 GPUs. To train the Tacotron2 model to work with the [pretrained wavernn](https://pytorch.org/audio/main/models.html#id10) with checkpoint_name `"wavernn_10k_epochs_8bits_ljspeech"`, please run the following command instead. ```bash python train.py --learning-rate 1e-3 \ --epochs 1501 \ --anneal-steps 500 1000 1500 \ --anneal-factor 0.1 \ --sample-rate 22050 \ --n-fft 2048 \ --hop-length 275 \ --win-length 1100 \ --mel-fmin 40 \ --mel-fmax 11025 \ --batch-size 96 \ --weight-decay 1e-6 \ --grad-clip 1.0 \ --text-preprocessor english_characters \ --logging-dir ./wavernn_logs \ --checkpoint-path ./ckpt_wavernn.pth \ --dataset-path ./ ``` ## Training Tacotron2 with phoneme as input #### Dependencies This example use the [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) as the phonemizer (the function to turn text into phonemes), please install it with the following command (the code is tested with version 0.0.15). ```bash pip install deep-phonemizer==0.0.15 ``` Then download the model weights from [their website](https://github.com/as-ideas/DeepPhonemizer) The link to the checkpoint that is tested with this example is [https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt). #### Running training script The training of Tacotron2 with english phonemes as input can be invoked with the following command. ```bash python train.py \ --workers 12 \ --learning-rate 1e-3 \ --epochs 1501 \ --anneal-steps 500 1000 1500 \ --anneal-factor 0.1 \ --batch-size 96 \ --weight-decay 1e-6 \ --grad-clip 1.0 \ --text-preprocessor english_phonemes \ --phonemizer DeepPhonemizer \ --phonemizer-checkpoint ./en_us_cmudict_forward.pt \ --cmudict-root ./ \ --logging-dir ./english_phonemes_logs \ --checkpoint-path ./english_phonemes_ckpt.pth \ --dataset-path ./ ``` Similar to the previous examples, this command will save the log in the directory `./english_phonemes_logs` and the checkpoint will be saved to `./english_phonemes_ckpt.pth`. To train the Tacotron2 model with english phonemes that works with the [pretrained wavernn](https://pytorch.org/audio/main/models.html#id10) with checkpoint_name `"wavernn_10k_epochs_8bits_ljspeech"`, please run the following command. ```bash python train.py \ --workers 12 \ --learning-rate 1e-3 \ --epochs 1501 \ --anneal-steps 500 1000 1500 \ --anneal-factor 0.1 \ --sample-rate 22050 \ --n-fft 2048 \ --hop-length 275 \ --win-length 1100 \ --mel-fmin 40 \ --mel-fmax 11025 \ --batch-size 96 \ --weight-decay 1e-6 \ --grad-clip 1.0 \ --text-preprocessor english_phonemes \ --phonemizer DeepPhonemizer \ --phonemizer-checkpoint ./en_us_cmudict_forward.pt \ --cmudict-root ./ \ --logging-dir ./english_phonemes_wavernn_logs \ --checkpoint-path ./english_phonemes_wavernn_ckpt.pth \ --dataset-path ./ ``` ## Text-to-speech pipeline Here we present an example of how to use Tacotron2 to generate audio from text. The text-to-speech pipeline goes as follows: 1. text preprocessing: encoder the text into list of symbols (the symbols can represent characters, phonemes, etc.) 2. spectrogram generation: after retrieving the list of symbols, we feed this list to a Tacotron2 model and the model will output the mel spectrogram. 3. time-domain conversion: when the mel spectrogram is generated, we need to convert it into audio with a vocoder. Currently, there are three vocoders being supported in this script, which includes the [WaveRNN](https://pytorch.org/audio/stable/models/wavernn.html), [Griffin-Lim](https://pytorch.org/audio/stable/transforms.html#griffinlim), and [Nvidia's WaveGlow](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/). The spectro parameters including `n-fft`, `mel-fmin`, `mel-fmax` should be set to the values used during the training of Tacotron2. #### Pretrained WaveRNN as the Vocoder The following command will generate a waveform to `./outputs.wav` with the text "Hello world!" using WaveRNN as the vocoder. ```bash python inference.py --checkpoint-path ${model_path} \ --vocoder wavernn \ --n-fft 2048 \ --mel-fmin 40 \ --mel-fmax 11025 \ --input-text "Hello world!" \ --text-preprocessor english_characters \ --output-path "./outputs.wav" ``` If you want to generate a waveform with a different text with phonemes as the input to Tacotron2, please use the `--text-preprocessor english_phonemes`. The following is an example. (Remember to install the [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) and download their pretrained weights. ```bash python inference.py --checkpoint-path ${model_path} \ --vocoder wavernn \ --n-fft 2048 \ --mel-fmin 40 \ --mel-fmax 11025 \ --input-text "Hello world!" \ --text-preprocessor english_phonemes \ --phonimizer DeepPhonemizer \ --phoimizer-checkpoint ./en_us_cmudict_forward.pt \ --cmudict-root ./ \ --output-path "./outputs.wav" ``` To use torchaudio pretrained models, please see the following example command. For Tacotron2, we use the checkpoint named `"tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech"`, and for WaveRNN, we use the checkpoint named `"wavernn_10k_epochs_8bits_ljspeech"`. See https://pytorch.org/audio/stable/models.html for more checkpoint options for Tacotron2 and WaveRNN. ```bash python inference.py \ --checkpoint-path tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech \ --wavernn-checkpoint-path wavernn_10k_epochs_8bits_ljspeech \ --vocoder wavernn \ --n-fft 2048 \ --mel-fmin 40 \ --mel-fmax 11025 \ --input-text "Hello world!" \ --text-preprocessor english_phonemes \ --phonimizer DeepPhonemizer \ --phoimizer-checkpoint ./en_us_cmudict_forward.pt \ --cmudict-root ./ \ --output-path "./outputs.wav" ``` #### Griffin-Lim's algorithm as the Vocoder The following command will generate a waveform to `./outputs.wav` with the text "Hello world!" using Griffin-Lim's algorithm as the vocoder. ```bash python inference.py --checkpoint-path ${model_path} \ --vocoder griffin_lim \ --n-fft 1024 \ --mel-fmin 0 \ --mel-fmax 8000 \ --input-text "Hello world!" \ --text-preprocessor english_char

评论收藏

内容反馈