Real-TimeVoiceCloning数据集标注数据_Real-Time-Voice-Cloning训练模型资源-CSDN文库

共5个文件

txt：5个

需积分: 5 119 浏览量 2023-10-02 14:50:12 上传评论收藏 232KB ZIP 举报

实时语音克隆是一种先进的技术，它允许用户复制某人的声音并实时地生成与该声音相似的新语音。在人工智能和自然语言处理领域，这样的技术有着广泛的应用，如语音合成、个性化虚拟助手、游戏音效和教育内容制作等。Real-Time Voice Cloning数据集是专门为这个目的而创建的，它包含了大量的音频样本和相应的标注信息，使得研究人员和开发者能够训练和测试他们的模型。数据集的组成部分： 1. **CHAPTERS.TXT**：这个文件可能包含了数据集中各个音频章节的信息，包括章节标题、内容概要或者与章节相关的特定信息。这对于组织和理解数据集的内容结构非常有用，尤其是在进行有结构的分析或训练时。 2. **SPEAKERS.TXT**：这个文件列出了数据集中所有不同的说话者信息，可能包括他们的性别、年龄、方言或其他声学特性。这些信息对于识别不同说话者的特点、建立多变的语音模型至关重要。 3. **BOOKS.TXT**：如果数据集基于有声读物，这个文件可能会列出数据集所涵盖的书籍清单，以及每本书的元数据，如作者、出版日期和内容概述。这有助于研究人员了解数据的多样性和覆盖范围。 4. **README.TXT**：这是标准的文档，通常包含有关数据集的详细信息，如如何使用数据、数据格式、数据来源以及任何使用数据的限制或许可信息。阅读此文件对于正确理解和有效利用数据集至关重要。 5. **LICENSE.TXT**：这个文件定义了数据集的使用许可条款，规定了使用者可以如何使用、分发和修改数据。遵守许可证规定是合法使用数据集的前提，避免潜在的法律问题。在开发实时语音克隆系统时，这个数据集可以帮助建立一个强大的声学模型，该模型能够捕捉到不同说话者的独特特征，并能实时地复刻声音。通常，会使用深度学习方法，如自编码器或变种（如Wavenet、Tacotron），通过端到端的训练来学习声音的复杂模式。同时，为了提高生成语音的自然度和实时性，还会涉及到语音合成技术，如频谱图到波形的转换算法（如Griffin-Lim算法或WaveNet）。在研究中，可以使用这些标注数据来评估模型的性能，比如对比原始音频和克隆音频的相似度，或者检查模型是否能够准确地适应新的说话者。此外，通过持续迭代和优化模型，可以提高语音克隆的质量，使之更接近真实的人声。在实际应用中，确保隐私和伦理问题的合规性同样重要，因为语音数据可能涉及个人身份信息，需要谨慎处理。

资源推荐

资源详情

资源评论

收起资源包目录

LibriSpeech.zip （5个子文件）

LICENSE.TXT 193B

CHAPTERS.TXT 655KB

BOOKS.TXT 113KB

SPEAKERS.TXT 122KB

README.TXT 8KB

1. General information ====================== LibriSpeech is a corpus of read speech, based on LibriVox's public domain audio books. Its purpose is to enable the training and testing of automatic speech recognition(ASR) systems. 2. Structure ============ The corpus is split into several parts to enable users to selectively download subsets of it, according to their needs. The subsets with "clean" in their name are supposedly "cleaner"(at least on average), than the rest of the audio and US English accented. That classification was obtained using very crude automated means, and should not be considered completely reliable. The subsets are disjoint, i.e. the audio of each speaker is assigned to exactly one subset. The parts of the corpus are as follows: * dev-clean, test-clean - development and test set containing "clean" speech. * train-clean-100 - training set, of approximately 100 hours of "clean" speech * train-clean-360 - training set, of approximately 360 hours of "clean" speech * dev-other, test-other - development and test set, with speech which was automatically selected to be more "challenging" to recognize * train-other-500 - training set of approximately 500 hours containing speech that was not classified as "clean", for some (possibly wrong) reason * intro - subset containing only the LibriVox's intro disclaimers for some of the readers. * mp3 - the original MP3-encoded audio on which the corpus is based * texts - the original Project Gutenberg texts on which the reference transcripts for the utterances in the corpus are based. * raw_metadata - SQLite databases which record various pieces of information about the source text/audio materials used, and the alignment process. (mostly for completeness - probably not very interesting or useful) 2.1 Organization of the training and test subsets ------------------------------------------------- When extracted, each of the {dev,test,train} sets re-creates LibriSpeech's root directory, containing some metadata, and a dedicated subdirectory for the subset itself. The audio for each individual speaker is stored under a dedicated subdirectory in the subset's directory, and each audio chapter read by this speaker is stored in separate subsubdirectory. The following ASCII diagram depicts the directory structure: <corpus root> | .- README.TXT | .- READERS.TXT | .- CHAPTERS.TXT | .- BOOKS.TXT | .- train-clean-100/ | .- 19/ | .- 198/ | | | .- 19-198.trans.txt | | | .- 19-198-0001.flac | | | .- 14-208-0002.flac | | | ... | .- 227/ | ... , where 19 is the ID of the reader, and 198 and 227 are the IDs of the chapters read by this speaker. The *.trans.txt files contain the transcripts for each of the utterances, derived from the respective chapter and the FLAC files contain the audio itself. The main metainfo about the speech is listed in the READERS and the CHAPTERS: - READERS.TXT contains information about speaker's gender and total amount of audio in the corpus. - CHAPTERS.TXT has information about the per-chapter audio durations. The file BOOKS.TXT makes contains the title for each book, whose text is used in the corpus, and its Project Gutenberg ID. 2.2 Organization of the "intro-disclaimers" subset -------------------------------------------------- This part of the data contains simply the LibriVox's intro disclaimers that were successfully extracted, using a slight modification of the alignment algorithms used to derive the test training sets. The standard LibriVox disclaimer is: "This is a LibriVox recording. All LibriVox recordings are in the public domain. For more information, or to volunteer, please visit: librivox DOT org" As is the case for the training and test sets, there is one subdirectory for each reader, and a subsubdirectory for each of the chapters, read by this speaker for which the announcement was successfully extracted. 2.3 Organization of the "original-mp3" subset --------------------------------------------- This part contains the original MP3-compressed recordings as downloaded from the Internet Archive. It is intended to serve as a secure reference "snapshot" for the original audio chapters, but also to preserve (most of) the information both about audio, selected for the corpus, and audio that was discarded. I decided to try make the corpus relatively balanced in terms of per-speaker durations, so part of the audio available for some of the speakers was discarded. Also for the speakers in the training sets, only up to 10 minutes of audio is used, to introduce more speaker diversity during evaluation time. There should be enough information in the "mp3" subset to enable the re-cutting of an extended "LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed. The directory hierarchy follows the already familiar pattern. In each speaker directory there is a file named "utterance_map" which list for each of the utterances in the corpus, the original "raw" aligned utterance. In the "header" of that file there are also 2 lines, that show if the sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the reader is assigned to a test set) and the maximum allowed duration for the set to which this speaker was assigned. Then in the chapter directory, besides the original audio chapter .mp3 file, there are two sets of ".seg.txt" and ".trans.txt" files. The former contain the time range(in seconds) for each of the original(that I called "raw" above) utterances. The latter contains the respective transcriptions. There are two sets for the two possible segmentations of each chapter. The ".sents" segmentation is "sentence-aware", that is, we only split on silence intervals coinciding with (automatically obtained) sentence boundaries in the text. The other segmentation was derived by allowing splitting on every silence interval longer than 300ms, which leads to better utilization of the aligned audio. 2.4 Organization of the "text" subset ------------------------------------- This part just contains one subdirectory, with name equal to the ID of the text in Project Gutenberg's database, for each book. The books are also separated in directories by their encoding-- could be either ASCII or UTF-8. The sole purpose of this subset is to be a permanent snapshot of the original text used for LibriSpeech's construction. 2.5 Organization of the "raw-metadata" part ------------------------------------------- Contains just few SQLite databases. Some of the more important bits of information from this tables are described in the README file within the "raw_data" subdirectory. Acknowledgments =============== First and foremost, I would like to thank the thousands of Project Gutenberg and LibriVox volunteers, without whose contributions the LibriSpeech corpus would not have existed. The successful completion of this project would have been much more difficult, and the quality of the finished corpus much worse, if it wasn't for the generous support and the many helpful advice, provided by Daniel Povey - thanks, Dan! I would also like to express my gratitude to Tony Robinson, for the very interesting, and useful discussions on the long audio alignment problem, that we had some time ago. Thanks also to Guoguo Chen and Sanjeev Khudanpur, with whom we are collaborating on a (yet-to-be-published) paper on the corpus, and who helped to improve the LibriSpeech's example scripts in Kaldi.

评论收藏

内容反馈