# datasets-CMU_Wilderness
CMU Wilderness Multilingual Speech Dataset
A dataset of over 700 different languages providing audio, aligned
text and word pronunciations. On average each language provides
around 20 hours of sentence-lengthed transcriptions. Data is mined
from read New Testaments from http://www.bible.is/
List of Languages with relative scores of accuracy of alignment
http://festvox.org/cmu_wilderness/
Map of Languages geopositioned
http://festvox.org/cmu_wilderness/map.html
# Language List
The file LangList.txt has a list of all processed languages with
features as space separated fields
1 LANGID six letter language id from bible.is
2 TLC three letter language code (iso 639-3)
3 WIKI Wikipedia link to language description
4 START start url at bible.is
5 LAT geolocated latitude
6 LONG geolocated longitude
7 #utt0 Number utterances found in Pass 0 (cross-lingual alignment)
8 MCD0 Mel Cepstral Distortion score for Pass 0 (smaller is better)
9 #utt1 Number utterances found in Pass 1 (in-language alignment)
10 MCD1 Mel Cepstral Distortion score for Pass 1 (smaller is better)
11 Dur HH:MM:SS duration of alignment data (from Pass 1)
12 MCDB Mel Cepstral Distortion score for base CG synthesizer
13 MCDR Mel Cepstral Distortion score for Random Forest CG synthesizer
14+ NAME Text name of language (may be multiple fields)
# Prerequisites
Ubuntu (and related) prerequisites:
sudo apt-get install git build-essential libncurses5-dev sox wget
sudo apt-get install csh ffmpeg html2text
Note that the ffmpeg package is sometimes called avconv (you need
to update bin/do_found accordingly if you only have avconv and not
ffmpeg).
# Clone the repository
git clone https://github.com/festvox/datasets-CMU_Wilderness
cd datasets-CMU_Wilderness
# Make Dependencies
Builds the FestVox voice building tools in build/ and sets up the
environment variable settings in festvox_env_settings
./bin/do_found make_dependencies
# Create Alignments For A Language
Because we cannot redistribute the audio from bible.is, you must
download that data directly, then build the alignments using the
indices we distribute.
Alignments (short waveforms plus transcripts) may be recreated for
a language from the packed versions in the indices/ directory. You
need to know the six letter code for the language (see LangList for
mappings). In this example we use NANTTV (Hokkien) to illustrate the
commands, but you should substitute the code for your desired language.
nohup ./bin/do_found fast_make_align indices/NANTTV.tar.gz &
This will unpack the indices in the NANTTV directory, download the
data from bible.is (unless it is already in
downloads/NANTTV/download/) then reconstruct the aligned data in
NANTTV/aligned/wav/ and NANTTV/aligned/etc/ This process will take
around 30 minutes depending on your internet connection, and the
speed of your machine.
# Create Text To Speech Model
Given the alignments in aligned/ you can build a speech synthesizer
for Festival (and Flite) as follows.
cd NANTTV
nohup ../bin/do_found make_tts &
../bin/do_found get_voices
Will build a Random Forest Clustergen synthesis model for Festival
and Flite in NANTTV/voices/ This will take at least 48 hours on a 12
core machine.
# Create Speech To Text Model
You can use the waveforms in NANTTV/aligned/wav/ and transcriptions in
NANTTV/aligned/etc/trascription.txt. The file NANTTV/aligned/etc/txt.done.data
also has an alignment score (lower is better) for utterance. If you want
a pronunciation lexicon and transcription without punctuation you execute
cd NANTTV
nohup ../bin/do_found make_asr &
This does not (yet) build a model, but gives a punctuation free transcription
file in NANTTV/aligned/etc/transcription_nopunct.txt and a pronunciation
lexicon in NANTTV/aligned/etc/pronunciation_lex
# Creating New Alignments
You can do the full alignment creation if you want. Our alignments
certainly can be improved on with better acoustic models,
pronunciations etc. If you are interested in re-aligning you can do so
with the command
nohup ./bin/do_found full_make_align http://listen.bible.is/NANTTV/Matt/1/D &
This may take around 7 days on a 12 core machine. It needs about 150GB of
diskspace (which can be reduced with the command ../bin/do_found tidy_up at the
end to about 20GB). The alignments themselves are usually around 2GB.
Aligning all 700 languages will take around 13 years on a single machine.
# Creating phone level alignments for all utterances
Given the sentence level alignments generated by fast_make_align or
full_make_align, you can generate phone level alignments for every
utterance with
cd NANTTV
nohup ../bin/do_found make_phone_alignments &
This may take several hours to run the hmm aligner. The generated
phone alignments are in v_ph_aligns/lab/\*.lab in Xlabel ascii format.
Full Festival utterances with all alignments and linked with
syllables, words etc are available in v_ph_aligns/festival/utts/\*.utt.
See http://www.festvox.org/bsv/x1902.html for an example of dumping
features and model building.
# Citations
For more details, see
Alan W Black "CMU Wilderness Multilingual Speech Dataset" ICASSP 2019,
Brighton, UK.
# Acknowledgments
This dataset was prepared by Alan W Black (awb@cs.cmu.edu) with substantial
help from a large number of CMU students. We also would like to thank
various members of the CMU community, especially Florian Metze, for access
to CPU resources to help calculate the alignments. This work was in part
funded by the DARPA Lorelei Program.
没有合适的资源?快使用搜索试试~ 我知道了~
Python-CMU多语种语音数据集700多种语言的语音文本对齐语料
共702个文件
gz:699个
do_found:1个
md:1个
需积分: 50 21 下载量 57 浏览量
2019-08-11
04:09:02
上传
评论 2
收藏 91.1MB ZIP 举报
温馨提示
CMU多语种语音数据集:700多种语言的语音/文本对齐语料
资源推荐
资源详情
资源评论
收起资源包目录
Python-CMU多语种语音数据集700多种语言的语音文本对齐语料 (702个子文件)
do_found 38KB
KYZWBT.tar.gz 436KB
PADTBL.tar.gz 386KB
LSIBSM.tar.gz 366KB
GUBWBT.tar.gz 364KB
TUFWYI.tar.gz 337KB
MHXBSM.tar.gz 323KB
URBWBT.tar.gz 300KB
MCDTBL.tar.gz 289KB
IX1WBT.tar.gz 287KB
ESENTM.tar.gz 272KB
NOTTBL.tar.gz 270KB
MYYWBT.tar.gz 263KB
ACUTBL.tar.gz 262KB
TXUTBL.tar.gz 260KB
KWIWBT.tar.gz 260KB
LAWNTM.tar.gz 257KB
BOJWBT.tar.gz 256KB
DBQWYI.tar.gz 253KB
MAZTBL.tar.gz 249KB
HUUTBL.tar.gz 238KB
PIRWBT.tar.gz 238KB
QUPTBL.tar.gz 235KB
KNJSBI.tar.gz 233KB
MEJTBL.tar.gz 226KB
PPKLAI.tar.gz 226KB
AGUNVS.tar.gz 223KB
BRUNXB.tar.gz 221KB
IXIWBT.tar.gz 221KB
RMORAM.tar.gz 221KB
XTDTBL.tar.gz 219KB
QUJPMC.tar.gz 217KB
PMFNTM.tar.gz 217KB
PABTBL.tar.gz 216KB
MOPWBT.tar.gz 216KB
QVSTBL.tar.gz 214KB
AZZTBL.tar.gz 213KB
KVNWBT.tar.gz 212KB
OJ1CBS.tar.gz 210KB
QUFLLB.tar.gz 208KB
SNNWBT.tar.gz 208KB
URYAOV.tar.gz 205KB
XSUMEV.tar.gz 204KB
LEXWBT.tar.gz 204KB
CNMRGB.tar.gz 201KB
SABWBT.tar.gz 201KB
MVJWBT.tar.gz 201KB
MCBTBL.tar.gz 200KB
GNWNTM.tar.gz 200KB
QVMTBL.tar.gz 200KB
ACRWB1.tar.gz 199KB
TUOWBT.tar.gz 199KB
CONWBT.tar.gz 199KB
NHETBL.tar.gz 198KB
HUBWBT.tar.gz 198KB
QXRBSE.tar.gz 198KB
POBWYI.tar.gz 198KB
HATSBH.tar.gz 198KB
SEYWBT.tar.gz 198KB
MAKLAI.tar.gz 198KB
CEGNTP.tar.gz 198KB
NPYLAI.tar.gz 197KB
QUHRBV.tar.gz 196KB
ACCIBS.tar.gz 195KB
QXOLLB.tar.gz 194KB
TERTBL.tar.gz 194KB
CARBSS.tar.gz 193KB
OM1TBL.tar.gz 193KB
KUBTBL.tar.gz 193KB
KJBSBG.tar.gz 193KB
DJKWBT.tar.gz 192KB
SHPTBL.tar.gz 191KB
CACSBG.tar.gz 190KB
TPIPNG.tar.gz 190KB
MIBTBL.tar.gz 189KB
GUOWBT.tar.gz 189KB
APRWBT.tar.gz 188KB
TZTWBT.tar.gz 188KB
KZFLAI.tar.gz 188KB
HNSWBT.tar.gz 188KB
QVZWBT.tar.gz 187KB
TZHSBM.tar.gz 186KB
INZTSI.tar.gz 186KB
TZESBM.tar.gz 185KB
KAQTBL.tar.gz 185KB
POITBL.tar.gz 185KB
SBLTBL.tar.gz 184KB
TZCSBM.tar.gz 184KB
QEJLLB.tar.gz 184KB
POHBSG.tar.gz 184KB
BANIBS.tar.gz 183KB
CAKSBG.tar.gz 183KB
NCHTBL.tar.gz 182KB
YCLNVS.tar.gz 182KB
NIAIBS.tar.gz 182KB
MHYLAI.tar.gz 181KB
CUKNVS.tar.gz 181KB
SAGCAR.tar.gz 180KB
IXLWBT.tar.gz 180KB
JICWBT.tar.gz 180KB
共 702 条
- 1
- 2
- 3
- 4
- 5
- 6
- 8
资源评论
weixin_39840588
- 粉丝: 448
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功