# Voice100
Voice100 includes neural TTS/ASR models. Inference of Voice100
is low cost as its models are tiny and only depend on CNN
without recursion.
## Objectives
- Don't depend on non-commercially licensed dataset
- Small enough to run on normal PCs, Raspberry Pi and smartphones.
## Sample synthesis
- [Sample synthesis 1](docs/sample-en-1.wav)
beginnings are apt to be determinative and when reinforced by continuous applications of similar influence
- [Sample synthesis 2](docs/sample-en-2.wav)
which had restored the courage of noirtier for ever since he had conversed with the priest his violent
despair had yielded to a calm resignation which surprised all who knew his excessive affection
- [Sample synthesis 1](docs/sample-ja-1.wav)
また、東寺のように五大明王と呼ばれる主要な明王の中央に配されることも多い。
- [Sample synthesis 2](docs/sample-ja-2.wav)
ニューイングランド風は牛乳をベースとした白いクリームスープでありボストンクラムチャウダーとも呼ばれる
## Architecture
### TTS
TTS model is devided into two sub models, align model and audio model.
The align model predicts text alignments given a text. An aligned text
is generated from the text and the text alignments. The audio model predicts
[WORLD](https://github.com/mmorise/World)
features (F0, spectral envelope, coded aperiodicity) given
the aligned text.
![TTS](./docs/tts.png)
#### TTS align model
```
| Name | Type | Params
-----------------------------------------
0 | embedding | Embedding | 3.7 K
1 | layers | Sequential | 614 K
-----------------------------------------
618 K Trainable params
0 Non-trainable params
618 K Total params
1.237 Total estimated model params size (MB)
```
#### TTS audio model
```
| Name | Type | Params
-------------------------------------------
0 | embedding | Embedding | 14.8 K
1 | decoder | VoiceDecoder | 11.0 M
2 | norm | WORLDNorm | 518
3 | criteria | WORLDLoss | 0
-------------------------------------------
11.1 M Trainable params
518 Non-trainable params
11.1 M Total params
22.120 Total estimated model params size (MB)
```
#### Align model pre-processing
The input of the align model is sequence of tokens of the input text.
The input text is lower cased and tokenized
into characters and encoded by the text encoder. The text encoder
has 28 characters in the vocabulary, which includes lower alphabets,
a space and an apostrophy. All characters which are not found in the
vocabulary, are removed.
#### Align model post-processing
The output of the align model is sequence of pairs of timings which
length is the same as the number of input tokens. A pair has two values,
number of frames before the token and number of frames for the token.
One frame is 20ms. An aligned text is generated from the input text and
pairs of timings. The length of the aligned text is the number of total
frames for the audio.
#### Audio model pre-processing.
The input of the audio model is the encoded aligned text, which is
encoded in the same way as the align model pre-processing, except it
has one added token in the vocabulary for spacing between tokens for
the original text.
#### Audio model post-processing.
The output of the audio model is the sequence of F0, F0 existences,
log spectral envelope, coded aperiodicity.
A F0 existence is a boolean value, which is true when F0 is available
false otherwise. F0 is forced into 0 when F0 existence is false.
One frame is 10ms. The length of the output is twice as the length
of the input.
### ASR
The ASR model is 9-layer MobileNet-like inverted residual which is
trained to predict on
[CTC loss](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html).
![ASR](./docs/asr.png)
```
| Name | Type | Params
----------------------------------------------------------------
0 | encoder | ConvVoiceEncoder | 11.6 M
1 | decoder | LinearCharDecoder | 14.9 K
2 | loss_fn | CTCLoss | 0
3 | batch_augment | BatchSpectrogramAugumentation | 0
----------------------------------------------------------------
11.6 M Trainable params
0 Non-trainable params
11.6 M Total params
23.243 Total estimated model params size (MB)
```
### Align model
The align model is 2-layer bi-directional LSTM which is trained to predict
aligned texts from MFCC audio features. The align model is used to
prepare aligned texts for dataset to train the TTS models.
```
| Name | Type | Params
----------------------------------------------------------------
0 | conv | Conv1d | 24.7 K
1 | lstm | LSTM | 659 K
2 | dense | Linear | 7.5 K
3 | loss_fn | CTCLoss | 0
4 | batch_augment | BatchSpectrogramAugumentation | 0
----------------------------------------------------------------
691 K Trainable params
0 Non-trainable params
691 K Total params
1.383 Total estimated model params size (MB)
```
## Training
### Align model with LJ Speech Corpus
Training align model with
[LJ Speech Corpus](https://keithito.com/LJ-Speech-Dataset/).
```sh
MODEL=align_en_lstm_base_ctc
DATASET=ljspeech
LANGUAGE=en
cd data
curl -O https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xfz LJSpeech-1.1.tar.bz2
cd ..
voice100-train-align \
--gpus 1 \
--precision 16 \
--batch_size 256 \
--max_epochs 100 \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--default_root_dir=model/${MODEL}
```
### Align text with align model
This generates the aligned text as `data/align-${DATASET}.txt`.
```sh
CHECKPOINT=align_en_lstm_base_ctc.ckpt
voice100-align-text \
--batch_size 4 \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--checkpoint model/${CHECKPOINT}
```
### Train TTS align model
```sh
MODEL=ttsalign_en_conv_base
voice100-train-ttsalign \
--gpus 1 \
--batch_size 256 \
--precision 16 \
--max_epochs 100 \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--default_root_dir=model/{MODEL} \
```
### Compute audio statistics
This generates the statistics as `data/stat-${DATASET}.pt`.
```sh
voice100-calc-stat \
--dataset ${DATASET} \
--language ${LANGUAGE}
```
### Train TTS audio model
```sh
MODEL=ttsaudio_en_conv_base
voice100-train-ttsaudio \
--gpus 1 \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--batch_size 32 \
--precision 16 \
--max_epochs 150 \
--default_root_dir ./model/${MODEL}
```
### Train ASR model
```sh
DATASET=librispeech
LANGUAGE=en
MODEL=stt_en_conv_base_ctc
voice100-train-asr \
--gpus 1 \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--batch_size 32 \
--precision 16 \
--max_epochs 100 \
--default_root_dir ./model/${MODEL}
```
## Exporting to ONNX
```sh
voice100-export-onnx \
--model ttsaudio \
--checkpoint model/${MODEL}/lightning_logs/version_0/checkpoints/last.ckpt \
--output model/onnx/${MODEL}.onnx
```
## Inference
Use [Voice100 runtime](https://github.com/kaiidams/voice100-runtime) and exported ONNX files.
## Pretrained models
- [English align](https://github.com/kaiidams/voice100/releases/download/v0.7/align_en_lstm_base_ctc-20210628.ckpt)
- [Japanese align](https://github.com/kaiidams/voice100/releases/download/v0.7/align_ja_lstm_base_ctc-20211116.ckpt)
- [English TTS align](https://github.com/kaiidams/voice100/releases/download/v0.7/ttsalign_en_conv_base-20210808.ckpt)
- [Japanese TTS align](https://github.com/kaiidams/voice100/releases/download/v0.7/ttsalign_ja_conv_base-20211118.ckpt)
- [English TTS audio](https://github.com/kaiidams/voice100/releases/download/v1.0.1/ttsaudio_en_conv_base-20220107.ckpt)
- [Japanese TTS audio](https://github.
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
资源分类:Python库 所属语言:Python 资源全名:voice100-1.0.1.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源推荐
资源详情
资源评论
收起资源包目录
voice100-1.0.1.tar.gz (24个子文件)
voice100-1.0.1
PKG-INFO 391B
voice100
cache_dataset.py 648B
audio.py 3KB
text.py 1KB
train_align.py 1KB
datasets.py 19KB
train_asr.py 1KB
__init__.py 0B
align_text.py 2KB
train_ttsalign.py 1KB
calc_stat.py 3KB
train_ttsaudio.py 979B
export_onnx.py 5KB
vocoder.py 5KB
LICENSE 1KB
setup.cfg 38B
setup.py 1KB
voice100.egg-info
PKG-INFO 391B
requires.txt 129B
SOURCES.txt 512B
entry_points.txt 449B
top_level.txt 9B
dependency_links.txt 1B
README.md 8KB
共 24 条
- 1
资源评论
挣扎的蓝藻
- 粉丝: 14w+
- 资源: 15万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功