Unsupervised Language Filtering using the Latent Dirichlet Allocation
Wei Zhang
1
, Robert A. J. Clark
2
, Yongyuan Wang
1
1
Ocean University of China, Qingdao, China, 266100
2
the CSTR, University of Edinburgh, United Kingdom, EH8 9AB
weizhang@ouc.edu.cn, robert@cstr.ed.ac.uk
Abstract
To automatically build from scratch the language processing
component for a speech synthesis system in a new language
a purified text corpora is needed where any words and phrases
from other languages are clearly identified or excluded. When
using found data and where there is no inherent linguistic
knowledge of the language/languages contained in the data,
identifying the pure data is a difficult problem.
We propose an unsupervised language identification ap-
proach based on Latent Dirichlet Allocation where we take the
raw n-gram count as features without any smoothing, pruning
or interpolation. The Latent Dirichlet Allocation topic model is
reformulated for the language identification task and Collapsed
Gibbs Sampling is used to train an unsupervised language iden-
tification model. We show that such a model is highly capable
of identifying the primary language in a corpus and filtering out
other languages present.
Index Terms: Language Filtering, Language Purification, Lan-
guage Identification
1. Introduction
This paper concerns Language Identification in the context
of ‘purifying’ a text corpus to determine which sentences are
in the primary language of the corpus and contain no foreign
words or phrases. This is a requirement for building the lan-
guage processing front-end of a speech synthesis system en-
tirely automatically in a new language where linguist resources
other than the text are unavailable.
Language identification is usually viewed as a form of text
categorization, Several kinds of classification approaches have
used to identifying the language of documents: Markov Mod-
els combined with Bayesian classification [1], Discrete Hidden
Markov Models [2], Kull-back Leibler divergence–namely rel-
ative entropy [3], minimum cross-entropy [4], decision trees
[5], neural networks [6], support vector machines [7], multi-
ple linear regression [8], centroid-based classifications [9] and
improvements to the previous method [10]. Other work include
conditional random fields [11] and minimum description length
with dynamic programming [12].
These methods are all supervised and require clean edito-
rially managed corpora for training. They are appropriate only
for a limited number of languages, and require relatively large-
sized documents. [13] demonstrate that “the task becomes in-
creasingly difficult as we increase the number of languages, re-
duce the amount of training data and reduce the length of docu-
ments”.
There have been some attempts to solve the problem of an-
notating training corpora. [14], in their multi-lingual speech
synthesis, used the phonemes, words and sentences multilayer
identification, and a combination of morphological and syntac-
tic analysis. This kind of domain specific, sophisticated-design
language identification is difficult to extend to the general sit-
uation. [15] according to the methodology of Web As Corpus
[16], collect very large-scale multi-linguistic corpora and con-
duct their annotation, then train their LangID.py tool using do-
main adaptation, to provide an off-the-shelf tool for general lan-
guage identification. The accuracy is still affected if the style
of the documents to be identified is inconsistent with the train-
ing corpus. To address this issue, [17] studied language iden-
tification of eBay and twitters postings; he utilizes the initial
and final words of postings and the corresponding site informa-
tion for the initial annotation, and then bootstraps a supervised
learning approach to achieve positive results. [18] with Twitter
and Facebook posting language identification, also uses a boot-
strap method where a trained supervised model, built from a
Wikipedia corpus is tuned by fusing it with the Tweets location
feature.
These approaches demonstrate the need for high-level an-
notation accompanying the documents to be identified. [17] and
[18] provide the annotation with observed context hints of post-
ings. Essentially, these are still supervised methods and there
will still be problems when the text to be identified includes
some languages which are not in training corpus.
For our requirement to purify a text where we have little
linguistic knowledge of the language or languages present, this
presents a problem and raises the key question: Can this anno-
tation for identifying language be generated automatically and
can unsupervised methods be used to identify language or at
least classify a text into the different languages present.
In our own work we are attempting to fully automatically
build the front-end language-processing component of a speech
synthesis system. This component is required to take text in a
given language, of which we have little of no linguistic knowl-
edge, and produce a linguistic representation of the sounds and
structure required to speak the text. We can achieve this using
methods such as vector space models [19] but to do so we re-
quire pure data in a single language as input. As we usually
dealing with low-resourced minority languages and the data we
are using is often found data and we do not have the expertise
to time to manually clean up data-sets. A typical scenario is
that we wish to create a monolingual corpus from Wikipedia
and similar web sites, the data crawled from these web sources
is generally a mix of several languages either due to code-
switching within the text of one language itself or due to the
text having been partially translated from another language.
The existing supervised or bootstrapped approaches are un-
suited to this problem and we require a completely unsupervised
language identification method. [20] demonstrate an approach
using similarity measures, but performance is greatly reduced
when compared to supervised methods. [21] present a promis-