UnsupervisedLanguageFilteringusingtheLatentDirichletAllocation资源-CSDN文库

需积分: 10 42 浏览量 2017-05-11 22:40:37 上传评论收藏 195KB PDF 举报

### 使用潜在狄利克雷分配进行无监督语言过滤 #### 摘要本文提出了一种基于潜在狄利克雷分配（Latent Dirichlet Allocation, LDA）的无监督语言识别方法，该方法旨在自动地从零开始构建一个新的语言处理组件，用于语音合成系统。在构建过程中，需要一个纯净的文本语料库，其中任何其他语言的词汇或短语都被明确标识或排除。当使用现成数据且没有关于语料库中所含语言的固有语言学知识时，识别纯正数据成为一项挑战。为此，作者提出了一种无监督的语言识别方法，该方法基于LDA模型，并采用了原始的n-gram计数作为特征输入，而无需进行平滑、修剪或插值等预处理步骤。 #### 研究背景与动机在为新的语音合成系统构建语言处理前端时，需要确保使用的文本语料库中只包含目标语言的数据，且不包含任何外来词汇或短语。这是因为在构建语音合成系统的过程中，通常需要大量的语言学资源，而在处理新的语言时，这些资源往往是不可用的。因此，开发一种能够自动识别并过滤掉非目标语言成分的方法变得尤为重要。 #### LDA模型的应用潜在狄利克雷分配是一种统计模型，最初被设计用于主题建模，即从文档集合中自动发现隐藏的主题结构。在这种背景下，LDA假设每篇文档是由多个主题构成的混合体，而每个主题则由一系列词汇的概率分布定义。通过将LDA应用于语言识别任务，研究者们试图将其转化为一种可以自动识别文档所属语言的方法。 #### 方法论 - **特征表示**：为了训练LDA模型，研究者采用原始的n-gram计数作为输入特征，这避免了传统的特征工程步骤如平滑、修剪或插值，从而保持了数据的原始性。 - **模型重构**：为了适应语言识别的任务需求，LDA模型被重新构造。这种重构包括对LDA的基本框架进行调整，以更好地适应语言识别的需求。 - **训练过程**：研究者利用了坍塌吉布斯采样（Collapsed Gibbs Sampling）来训练无监督语言识别模型。这种方法能够有效地估计模型参数，并有助于提高识别准确性。 - **性能评估**：实验结果表明，该模型能够高效地识别语料库中的主要语言，并过滤掉其他存在的语言成分。 #### 结论与贡献本文提出的基于LDA的无监督语言识别方法，在处理新的语言时具有显著的优势。它不仅能够在缺乏语言学资源的情况下自动识别并净化文本语料库，还能够有效地区分主要语言与其他语言成分。这一成果对于自动构建语音合成系统的语言处理组件具有重要意义，同时也为未来的研究提供了有价值的参考。 ### 相关知识点详解 1. **潜在狄利克雷分配(LDA)**： - **概念**：LDA是一种基于概率的生成模型，主要用于文档的主题建模。 - **应用**：本文将其应用于语言识别任务，通过对文档中词频的分析来识别文档的语言属性。 2. **n-gram计数**： - **定义**：n-gram是一种常见的语言模型，通过计算文本中连续n个词出现的频率来建立模型。 - **作用**：在本研究中，n-gram计数作为LDA模型的输入特征，帮助模型学习语言模式。 3. **坍塌吉布斯采样(Collapsed Gibbs Sampling)**： - **原理**：这是一种贝叶斯统计推断中的采样技术，用于估计复杂模型的参数。 - **应用**：本文中使用坍塌吉布斯采样来训练LDA模型，以识别文本的语言特性。本文介绍了一种创新的无监督语言识别方法，该方法基于LDA模型并通过坍塌吉布斯采样的方式来训练模型。这种方法不仅能够有效地识别文本的主要语言，还能够过滤掉非目标语言成分，对于构建自动化的语音合成系统具有重要的实际意义。

资源推荐

资源详情

资源评论

Unsupervised Language Filtering using the Latent Dirichlet Allocation

Wei Zhang

, Robert A. J. Clark

, Yongyuan Wang

Ocean University of China, Qingdao, China, 266100

the CSTR, University of Edinburgh, United Kingdom, EH8 9AB

weizhang@ouc.edu.cn, robert@cstr.ed.ac.uk

Abstract

To automatically build from scratch the language processing

component for a speech synthesis system in a new language

a puriﬁed text corpora is needed where any words and phrases

from other languages are clearly identiﬁed or excluded. When

using found data and where there is no inherent linguistic

knowledge of the language/languages contained in the data,

identifying the pure data is a difﬁcult problem.

We propose an unsupervised language identiﬁcation ap-

proach based on Latent Dirichlet Allocation where we take the

raw n-gram count as features without any smoothing, pruning

or interpolation. The Latent Dirichlet Allocation topic model is

reformulated for the language identiﬁcation task and Collapsed

Gibbs Sampling is used to train an unsupervised language iden-

tiﬁcation model. We show that such a model is highly capable

of identifying the primary language in a corpus and ﬁltering out

other languages present.

Index Terms: Language Filtering, Language Puriﬁcation, Lan-

guage Identiﬁcation

1. Introduction

This paper concerns Language Identiﬁcation in the context

of ‘purifying’ a text corpus to determine which sentences are

in the primary language of the corpus and contain no foreign

words or phrases. This is a requirement for building the lan-

guage processing front-end of a speech synthesis system en-

tirely automatically in a new language where linguist resources

other than the text are unavailable.

Language identiﬁcation is usually viewed as a form of text

categorization, Several kinds of classiﬁcation approaches have

used to identifying the language of documents: Markov Mod-

els combined with Bayesian classiﬁcation [1], Discrete Hidden

Markov Models [2], Kull-back Leibler divergence–namely rel-

ative entropy [3], minimum cross-entropy [4], decision trees

[5], neural networks [6], support vector machines [7], multi-

ple linear regression [8], centroid-based classiﬁcations [9] and

improvements to the previous method [10]. Other work include

conditional random ﬁelds [11] and minimum description length

with dynamic programming [12].

These methods are all supervised and require clean edito-

rially managed corpora for training. They are appropriate only

for a limited number of languages, and require relatively large-

sized documents. [13] demonstrate that “the task becomes in-

creasingly difﬁcult as we increase the number of languages, re-

duce the amount of training data and reduce the length of docu-

ments”.

There have been some attempts to solve the problem of an-

notating training corpora. [14], in their multi-lingual speech

synthesis, used the phonemes, words and sentences multilayer

identiﬁcation, and a combination of morphological and syntac-

tic analysis. This kind of domain speciﬁc, sophisticated-design

language identiﬁcation is difﬁcult to extend to the general sit-

uation. [15] according to the methodology of Web As Corpus

[16], collect very large-scale multi-linguistic corpora and con-

duct their annotation, then train their LangID.py tool using do-

main adaptation, to provide an off-the-shelf tool for general lan-

guage identiﬁcation. The accuracy is still affected if the style

of the documents to be identiﬁed is inconsistent with the train-

ing corpus. To address this issue, [17] studied language iden-

tiﬁcation of eBay and twitters postings; he utilizes the initial

and ﬁnal words of postings and the corresponding site informa-

tion for the initial annotation, and then bootstraps a supervised

learning approach to achieve positive results. [18] with Twitter

and Facebook posting language identiﬁcation, also uses a boot-

strap method where a trained supervised model, built from a

Wikipedia corpus is tuned by fusing it with the Tweets location

feature.

These approaches demonstrate the need for high-level an-

notation accompanying the documents to be identiﬁed. [17] and

[18] provide the annotation with observed context hints of post-

ings. Essentially, these are still supervised methods and there

will still be problems when the text to be identiﬁed includes

some languages which are not in training corpus.

For our requirement to purify a text where we have little

linguistic knowledge of the language or languages present, this

presents a problem and raises the key question: Can this anno-

tation for identifying language be generated automatically and

can unsupervised methods be used to identify language or at

least classify a text into the different languages present.

In our own work we are attempting to fully automatically

build the front-end language-processing component of a speech

synthesis system. This component is required to take text in a

given language, of which we have little of no linguistic knowl-

edge, and produce a linguistic representation of the sounds and

structure required to speak the text. We can achieve this using

methods such as vector space models [19] but to do so we re-

quire pure data in a single language as input. As we usually

dealing with low-resourced minority languages and the data we

are using is often found data and we do not have the expertise

to time to manually clean up data-sets. A typical scenario is

that we wish to create a monolingual corpus from Wikipedia

and similar web sites, the data crawled from these web sources

is generally a mix of several languages either due to code-

switching within the text of one language itself or due to the

text having been partially translated from another language.

The existing supervised or bootstrapped approaches are un-

suited to this problem and we require a completely unsupervised

language identiﬁcation method. [20] demonstrate an approach

using similarity measures, but performance is greatly reduced

when compared to supervised methods. [21] present a promis-

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

qdoneman

粉丝: 0
资源: 4

Unsupervised Language Filtering using the Latent Dirichlet Alloc...

最新资源

Unsupervised Language Filtering using the Latent Dirichlet Alloc...

Unsupervised language identification based on Latent Dirichlet Allocation

Human action recognition using labeled Latent Dirichlet Allocation model

Exploit Latent Dirichlet Allocation for One-Class Collaborative Filtering

Unsupervised Learning by Probabilistic Latent Semantic Analysis

rankingSvm

《Language Models are Unsupervised Multitask Learners》

Hands-On Unsupervised Learning Using Python epub格式

无监督学习入门 Hands-On Unsupervised Learning Using Python

Unsupervised Texture Segmentation Using Active Contour Model and Oscillating Information

Unsupervised Object Discovery: A Comparison （Maxplank）

Multi-Sensor Prognostics using an Unsupervised Health Index

gpt2-language_models_are_unsupervised_multitask_learners.pdf

Python Natural Language Processing

Unsupervised.Learning.with.R

scikit-learn-docs

An end-to-end MATLAB toolkit for completely unsupervised Speaker

Python-PyTorch配准监督提高人脸标定检测器精度的无监督方法

ctex.rar_CTEX_Dominant Colors _inclusion_multichannel images_pro

Embedded Unsupervised Feature Selection.pdf

机器学习教程 - Unsupervised Learning: Word Embedding

Prediction-using-Unsupervised-ML:The Sparks Foundation实习项目2

Unsupervised Learning of Video Representations using LSTMs.pdf

基于语义结构图的文本分析

Unsupervised Part-based Weighting Aggregation

Apache Spark 2.x Cookbook

Unsupervised-Attention-guided-Image-to-Image-Translation-master.zip

Unsupervised Multiway Data Analysis

最新资源