Unsupervised language identification based on Latent Dirichlet Allocation

所需积分/C币:12 2017-05-11 22:37:13 1.13MB PDF
收藏 收藏

Unsupervised language identification based on Latent Dirichlet Allocation
W. Zhang et al. / Computer Speech and language 39(2016)47-66 Note that(1) and ( 3)concern the representation of our data, namely what feature space can express the documents for training and identifying the language present. The word-level features, used by, for example, Grefenstette(1995) and Rehurek and Kolkus(2009), have drawbacks in this respect. The more appropriate alternatively is to create a language n-gram model based on characters(Cavnar and Trenkle, 1994)or encoded bytes(Dunning, 1994). With n-grams where n<3, linguistic phenomenon with granularity smaller than words can usually be characterized and a elatively small training corpus can be used and training converges faster(Souter et al., 1994). According to Chen and Goodman(1996), Zhai and Lafferty(2001), and Zhai and Lafferty(2004)(as many others ), there are empirical results that using smoothing or interpolation will promise in better results on Natural Language Processing. The smoothing of language model here is to better estimate real probabilities and avoid the bias, because there is always insufficient data in training set to estimate probabilities accurately in the languages. However, due to the need for smoothing, pruning or interpolation, it can be time-consuming to generate n-gram features. In this paper we will see that even the raw n-gram counts can effectively be used as features for effective language identification as long as an appropriate learning algorithm is employed to provide effective smoothing(with alpha and beta of the Dirichlet parameter in our paper). Conversely(2)means that the local linkage information of words(or n-grams)may lead Chinese Whispers based Igorithms to fail to filter out the mixed language short sentences since the words from other languages will add incorrect linkages between language clusters. We show that an appropriately structured probabilistic model, which utilizes global statistical information is able to avoid these local incorrect interventions This paper presents an unsupervised language identification method using the raw n-gram count to characterize features and a reformulated Latent Dirichlet Allocation (LDa) topic model (Blei et al., 2003). This approach is tested on the ECI/mci benchmark in the context of being a language identification tool, and on other data as a filtering ool. Additionally, we compared four kinds of measure and also the Hierarchical Dirichlet process(HDP)on several configurations of ECIUCI benchmark, to determine the number of languages present The outline of the rest of this paper is as follows: in Section 3, we conduct a detailed analysis of the problem we are addressing and propose the Language Identification model based on Latent Dirichlet Allocation (LDA LI). In Section 4, we introduce the feature space, training and inference algorithms of the LDA-LI. In Section 4 we also briefly evaluate four kinds of measure and the HDP, compared their differences on finding the suitable topic number(number of clustering ) In Section 5, we present experimental results including analyses using the ECI/MCI benchmark and a Wikipedia based Swahili corpus. Finally, Section 6 draws conclusions and proposes future directions 3. Towards an unsupervised approach 3.1. The latent dirichlet allocation model To be able to identify language in an unsupervised fashion we adopt and adapt a model from the field of topic modelling. Here, a topic model is a statistical model for discovering the abstract topics that occur in a collection of documents. Papadimitriou et al.(1998)give a probabilistic analysis of Latent Semantic Indexing which is seen as an early topic model. Almost at the same time, Hofmann(1999) proposed the probabilistic Latent Semantic Indexing- PLSI with a suitable EM learning algorithm. The most common topic model currently in use, is the generalization of plSI into Latent Dirichlet Allocation, which allows documents to contain a mixture of topics, developed by blei et al.(2003). In LDA, each document is modelled as a mixture of K latent topics, where each topic k is a multinomial distribution pk over a W-word vocabulary. For any document j, its topic mixture 8; is a probability distribution drawn from a Dirichlet prior with parameter a For each ith word xi inj, a topic zi=k is drawn from Oj, and xi is drawn from ck. The generative process for LDa is thus given by Dir(),中k~Dir(P), ziil= klj Mult(ei) xilai muti(r), W. Zhang et al. Computer Speech and Language 39(2016)47-66 B F D Fig. I. Bayesian network of LDA where Dir()denotes the Dirichlet distribution and Mult(*)is the multinomial distribution. Lda is a kind of Bayesian hierarchical model, the graphical model for it is illustrated in Fig. 1, where the observed variables, that is, words xi and hyper-parameters a and B, are shaded For more detailed description of LDA, see Blei et al.(2003). The LDa model has three merits. The first is exchangeability, according to Blei et al. (2003), topics are conditionally independent and identically distributed in a fixed document which means the topics are infinitely exchangeable within the document The second is that the conjugate distribution of the dirichlet is a multinomial distribution The exchangeability and conjugation between of the dirichlet and multinomial distribution makes the learning algorithm relative simple as a pseudo-count can be used directly by Expectation Maximization(Blei et al., 2003)or Gibbs sampling. In Section 4 our presented implementation uses Collapsed Gibbs Sampling( Griffiths and Steyvers, 2004). The third merit of LDa is that it inherently provides some degree of the automatic smoothing. Asuncion et al. (2009)point out the lda is a flexible latent variable framework for modelling sparse data in extremely high dimensional spaces. Even with the default hyper-parameter settings of those learning algorithms, LDa can smooth the sparse count data and infer on unseen data(blei et al, 2003). Thus in Section 4.2 we have just used the raw n-gram counts as the features. These counts are from unigram to 5-g, so include both lower-order and high-order n-gram 3.2. LDA for language identification The basic idea behind traditional LDa is that documents are represented as random mixtures over latent topics, here each topic is characterised by a distribution over words(blei et al., 2003). To adapt this to language identification we consider documents represented as random mixtures over latent languages, where each language is characterised as a distribution over letter n-grams counts In such way, the document-Language and Language N-gram hierarchies can similarly be modelled by the lda for language identification, We call this approach LDa-LI for short. Fig. 2 gives the pseudo code of generative LDA-LI model During the inference phase of LDA-li, to classify a document as given language either the most probably language can be chosen, or a threshold can be set and multiple languages can be assigned to individual documents. Additionally the formulation of the model places no restrictions on the length of the document and is able classify very short documents, i.e., individual short sentences. As our speech synthesis work currently builds systems in a single language at a time, this paper explores the result of assuming that we are only interested in sentences that are purely of one language, and investigates whether we can identify these sentences appropriately. However there is plenty of scope for using this model in scenarios that do not make this assumption W. Zhang et al. / Computer Speech and language 39(2016)47-66 // Language plat for all languages kE/1, K/do sample components ok n Dir(s) for all documents je 1, D/ do sample mixture proportion 8, Diria) the document length is Ni cam plate for all n-grams ie/1, Nj/in document do sample lai u(6y;) sample N-gram tai N Mult(ozi nd end Fig. 2. Generative model of LD. 4. Algorithm and model selection 4. Gibbs sampling and inference The learning algorithm in this paper is based on the Collapsed Gibbs Sampling(CGS)Griffiths and Steyvers(2004) a Markov-chain Monte Carlo method. The model parameter= iok- Dir(B)), the set of topic distributions, can be integrated using the Dirichlet-multinomial conjugacy. The posterior distribution P(zlw) can then be estimated using the Collapsed Gibbs Sampling algorithm, which, in each iteration, updates each topic assignment zii E Z by sampling the full conditional posterior distribution p(zii= kz Cku t B d cuora+ w where k c[l, K is a topic,wc[l,wi is a word in the vocabulary, xi denotes the ith word in document and the topic assigned to xij denotes the words in the corpus with xi excluded and z are the corresponding topic assignments of W. In addition, cword denotes the number of times that word w is assigned to topic k not including the current instance xi and zij, and Chi the number of times that topic k has occurred in document j not including xi and zi. Whenever zi is assigned to a sample drawn from(1), matrices word and cdoc are updated. After enough samplin D K iterations to burn in the Markov chain, 0=(ejj=l and o=l9*)k=I can be estimated by COsta ∑≌1C0+Kc word ord + wB From Eqs.(2)and (3), we see that the CGs learning and inference are some kinds of pseudo counts of original corpus. The implementation of CGs used in this paper is based upon the implementation of Wang et al.(2009)using a Map-Reduce parallel framework with efficiency improvements by Liu et al. (2011) using a Message Passing Interface I (dN) Sometimes we are required to actually identify the individual languages present(for example, when evaluating the model with a test set and calculating precision and recall), for each language(topic)cluster we examine the sentence Ihttps://code.googl coIn p/plda/ W. Zhang et al. Computer Speech and Language 39(2016)47-66 assigned to that language with the highest probability, and manually determine which language it is and then label all the sentences assigned to this language cluster as being of that language. Merging languages is also performed manually if there are more language clusters than actual languages known to be present this strategy is used throughout our experiment which means use of LDa-li reduces the need for annotation of hundreds or thousands of sentences to a few representative ones for given languages. This classifies each sentence as being of an actual language, which can then be evaluated correct or not in experiments where ground-truth is known 4.2. Feature space and sample representation A corpus is first converted into samples by considering each individual sentence a document. These documents are then converted into character based n-gram counts( tokens for spaces, and beginning and end of sentence markers are included for each document). Cavnar and Trenkle(1994) show that for supervised learning n<3 is sufficient n-gram length. We however found improved performance with our unsupervised method when we included n-grams with n in the range 1-5. which we believe allows us to capture more information across both short and long contexts. We did attempt to build models using n>5 but they proved computationally impractical to train. Due to the smoothing ability of LDa with large sparse data as discussed in Section 3. 1, we are able to use the raw n-gram counts In practice the smoothing and pruning is actually realised by the hyper parameters a and B, which are configured with their default small values(<1)suggested by Liu et al. (2011) 4.3. Model selection and topic number An important issue with lda topic modelling is how to determine that an adequate number of individual topics are being modelled. In most cases(Blei et al., 2003; Griffiths and Steyvers, 2004; Newman et al., 2009: Wallach et al 2009; Grin and Hornik, 2011), perplexity is used to evaluate the resulting model on held-out data In our experiment(see Section 5), we found that the perplexity always reduces as the number of languages is increased, as shown in Fig. 9 of Blei et al. (2003) and Fig. 10 of Newman et al. (2009), this continues beyond the point where the number of languages in the model is larger than the actual number of different languages in the data In fact, the perplexity 2n( q) is another form of cross-entropy over the test set H(P, q)=-Pu log qv=H(p)+KL(Pg (4) Pu is the probability of each word v, estimated by Pv=n//inn/ in test set, qu is the probability of each word v computed by the LDa model with 4=∑∑ jk=1 H(p)denotes the entropy of p and kl(pll@)is the Kullback-Leibler divergence(the Kl divergence or relative entropy) of q from p. From analyses, it can be seen that there is no explicit penalty term on the language number in Eq. (4) which means perplexity is not biased towards minimising the number of languages. Further more, even if a model makes pv=qu on every word such that KL(PlQ)=0. it is by no means that the model is the best model, because H(p, )=H(p)is not the minimum of real underlying probability distribution p, instead it is just a bias estimation of H(p) on limited test set. This implies that perplexity may not be the best measure to find the smallest topic number without significantly degrading performance Another way to find the correct number of languages is to use the hierarchical Dirichlet process(HDP)(Teh et al 2006), instead of lDa to hierarchically cluster the document into languages and thus automatically find the number of languages. However, here we find that HDP tends to choose far too many languages than the measure in Arun et al (2010) of LDA(see Table 2 of Section 5.3). The hdP behaviour is more like a language hashing process than one that can find the minimal language number appropriate for a given dataset. This means HDP is not necessary to find the minimum of topic number since the split-merge process are recursively run on sub topics W Zhang et al. /Computer Speech and language 39(2016 ) 47-66 In Cao et al.(2009), standard cosine distance(similarity) is used to measure the correlation between topics cOre(b;,中)= ,中n中元 (5 ∑v√∑ where i,j∈[1,K,U∈[1,W]· When corre(φ;,中) is smaller, the topics are more independent. the average cosine distance between every pair of topics is used to measure the stability of a topic structure ∑A∑=+1core(中;,中;) ave-dis(structure) K(K-1)/2 a smaller ave dis shows that the structure is more stable. so ave dis is minimised. however. in some situations more topics always reduce ave- dis even though they are unnecessary as we saw with perplexity This happens in our experiments, see Section 5.3 and Fig. 8(a) for detail. Alternatively, Zavitsanos et al. (2008)use Kl divergence instead of cosine similarity. Here, we revise this notion by replacing the term from Eg (5)in Eq(6)with symmetric KL divergence, since it is more reasonable that Eq (6)should average correlation between pairs of topics. Now we can use this to find the language number with the maximal average kl-divergence since the distributions of word-vectors of different languages should have maximal average divergence when the correct number of languages is selected Thus it ill reduce the difference (in divergence)to either decrease or increase the number of languages. However, we find this measure is not sufficiently sensitive to language number and its value almost does not change with language number in some situations which are confirmed by Fig 8(b). The disadvantage with both Cao et al.(2009)and Zavitsanos et al. (2008)is that they only consider the information in the stochastic language-letter n-gram matrix and ignore the sentence-language matrix Arun et al.(2010)view LDa as a matrix factorization mechanism, where a given corpus C is split into two matrix factors and given by CDW=⊙Dk·kW,⊙Dk=园,B2…,bD,中kW=[,,…,k1 where D is the number of documents present in the corpus and w is the size of the vocabulary as mentioned in 4.1 The quality of the split depends on K, the right number of topics chosen. This measure is computed in terms of symmetric KL-divergence of salient distributions that are derived from these matrix factors. ODxK is the document- topic(sentence-language)matrix, while Kxw is the topic-word (language-n-gram) matrix. Note that here 8 and lokk=i are different from Section 4.1, and are just the numerators of Eqs. (2)and (3). So it is clear that ∑k,0=∑d,kk∈[1,K1 U=1 d=1 where o(k, v) is the kth row vth column element of matrix 6, the same goes with (d, k). Eg (7)is the number of words assigned to each topic looked in two different ways-one as row sum over words and other as column-sum over documents. However, when both these matrices are row normalized(as done by LDa), this equality will not hold any more. This is the reason only the numerator of Eys. (2) and (3)are used They proposed the Symmetric Kl divergence of Co and Co symkl= kl(CollCo)+kl(CollO) where, Co is the distribution of singular values of topic-word matrix a, Co is the distribution obtained by normalizing the vector /* O(L is 1*D vector of lengths of each document in the corpus and o is the document-topic matrix). Both the distributions Co and Co are in sorted order so that the corresponding topic components are expected to match With the measure in Eg (8), they find that this divergence between the Co distribution and the ce distribution initially degrades then starts to increase once the right number of topics is reached. In Section 5.4, we will see the effectiveness of symKL and also compared with the others measures mentioned above W. Zhang et al. Computer Speech and Language 39(2016)47-66 5. Experimental evaluation It is difficult to fairly evaluate the LDI-Li model directly against other supervised language identification models partly due to the bias of the tasks that they are designed to perform and partly due to the supervised-unsupervised difference. Therefore results comparing systems, should not be treated judgementally to say one system is better than the other, but rather they should be used to understand how the models behave differentl Experiments I and 2 evaluated the lda- li in this way as a general language identification model. here we performed experiments either using LDA-LI as unsupervised learner(Experiment I)or as unsupervised clusterer (Experiment 2) and compared our model to other approaches using the ECI/MCi, a benchmark corpora for language identification studies(Armstrong-Warwick et al, 1994 We then perform a number of experiments to further evaluate the LDA-LI in the way we intended to use it. In Experiment 3, we compared the measures related to the topic number mentioned in section 4.3. We also filtered either on ECI-MCI dataset (Experiment 4) or a corpus of Swahili from Wikipedia(Experiment 5 )as tests of found data where a language was mixed with unknown other languages Finally, in Experiment 6, we compared the performance of LDA-LI to that of a more traditional lDa word-feature model to show that the letter n-gram counts a much better representation than words for this task examined the first sentence(that with the highest probability which is computed and output by LDa approach as to a As mentioned in the Section 4.1, to actually calculate precision and recall, for each topic modelled by LDA-LI (2)in 4.1)assigned to that topic, manually identified its language, and then labelled all the other sentences assigned to that topic as being of that same language. We then, where necessary manually merged topics assigned the same language. This strategy was used throughout our experiments 5.1. Experiment 1: LDA-LI as pseudo-supervised learning Experiments I and 2 used nine languages which had the same configurations as Takciand Gungor(2012). First in data(actually, for LDA-LI we consider this pseudo-training, as no use was made ot ems trained with the same limited Experiment 1, we tested the hypothesis that our system was equal to supervised systems trained with the same limited We performed 10-fold Cross validation(CV)training and testing with the same configuration used by Takciand Guingor (2012) and calculated precision, recall and F-score We compared our LDA-LI model to 3 available existing methods of Language identification: the lang lD tool of Lui and Baldwin(2012), the Guess- language tool-the current version of Text Cat(Cavnar and Trenkle, 1994)--and the ICF of Takciand Gungor(2012) We firstly prepared the eCi/mci data for 10-fold Cross Validation with the same configuration with Takciand Gungor (2012). With this we then trained and tested langld, guess_language and our LDA-LI models. Note that for the LDa-li model this is pseudo-training as we did not use the language classes present. The average precision, recall and F-scores were then compared (here the precision, recall and F-score were firstly averaged across the 10-cv, then averaged according to the language ratio configuration of Takciand Guingor(2012)). We also include results reported by Takciand Gungor (2012) for their ICF model that used the same configuration. Fig 3(a),(b) and Table I show that our LDA-LI system though is unsupervised generally compared favourably in performance with the supervised systems. Note that we present the lDa-lI model with 16 topic classes where symmetric KL-divergence reaches the minimum This is discussed further in the Experiment 3 5.2. Experiment 2: LDA-Ll as clustering One criticism of Experiment l is that it is not fair on the supervised systems, as they are designed to be used trained on larger data sets. We address this with Experiment 2. In Experiment 2 we tested the hypothesis that our system 2http://www.elsnet.org/eci.html 3 Dutch-291K, English-108K, French-108K, German-171K, Italian-99K, Portuguese-107K, Spanish-107K, Swedish-91K, and Turkish-109K ttps: //github. com/saffsd/angID 5https://bitbucket.org/spirit/guess_language/overview W. Zhang et al. / Computer Speech and language 39(2016)47-66 日=- 0.8 √ (a) average precisions on 10-fold CV (b)average recalls on 10-fold CV Fig 3. Precisions and recalls on 10-fold cv Table 1 Performance over 10-fold CV, Note: ICF method is from Takciand Guingor(2012)Guess language is from Cavnar and Trenkle(1994)langID is from Lui and Baldwin (2012) Method Sentence length of test Precision Recall F-score LDA-LI Max 1297. Min 10 95.71% 95.03% 95.35% langlo Average 81.65 95.71% 96.00% 95.73% Guess.⊥ anguage characters 99.27% 95.00% 97.05% ICF 100 characters 97.10% 97.50% 97.30% as a unsupervised algorithm, would perform as well as supervised systems in the more real-word situation where the supervised systems are were fully trained on additional data. We used our LDA-LI system to cluster the ECIMci data in an unsupervised fashion and compared to the other systems fully-trained on their wider variety of training data We evaluated on the full eCi/mci subset used above default versions of lang ld and guess_ language were taken pre-trained on their very large corpora, while the LDa-Li ran solely on the ECI/MCi as a standard clustering algorithm without additional language knowledge. The trade off here was that our LDA-LI system was solely pseudo-trained on in-domain data, where as the other systems are trained on substantially more(including in-domain) data. We justify this comparisons as it is how each of these systems would be used in practice. Here the LDa-li model was used with 16 topic classes, and performed using 10-CV as described in Experiment 1 Fig 4(a) and(b)shows that our LDa-LI system generally outperformed the other systems. In a sense this was not surprising in terms of the data used, as the lda-li clusters all the data, and hence was in-domain. The more general training regimes of the other systems was to their disadvantage here. Domain and style differences are the likely explanation for lang lD outperforming Guess_language as the former is more up to date in terms of content of the training corpora and closer to the data being classified. The general conclusion here is training the supervised systems only on the in-domain here data does not harm them, this may not however be the case for smaller data sets 5.3. Experiment 3: comparison of measures on topic number In Experiment 3, we investigated measures for finding the correct number of languages present in a corpus. We implemented the perplexity and HDP measures in conjunction with the nine languages used in Experiments 1 and 2. We also used an additional nine languages, now 18 languages in total. For each N of (N=3, 6, 9, 12, 15, 18)we randomly select N languages and spit the data from these languages to perform a 10-fold Cross validation for a given N 6 Dutch, English, French, German, Italian, Spanish, Swedish, Turkish, Portuguese, Albanian, Bulgarian, Czech, Estonian, Latin, Lithuanian, Modern greek. Norwegian and russian w. Zhang et al. /Computer Speech and Language 39(2016)47-66 l langID. py Guess language (a) comparison of precision scores (b comparison of recalls scores Fig 4. Precisions and recalls scores Perplexity 350 200 150 60 topic numb Fig. 5. Perplexity related to topic number, with 9 languages present We then averaged the cosine distance( Cao et al., 2009), word KL-divergence( Zavitsanos et al., 2008)and symmetric KL divergence(Arun et al., 2010). To test for consistency in our choice of languages selected, we repeated this set of Experiments 5 times, randomly choosing different subsets of languages each time for each N. results were found to be consistent irrespective of the languages chosen Experiments 1 and 2, showed that we can use the LDa-li as a language identification tool with the ability to generalize well if we could find the appropriate number if topics present. As a general language identification tool there are two requirements on the language number firstly the language number must be large enough to account for all the languages present. Then it is better for the language number to be as close to actual number of languages present in order to avoid unnecessary examination and merging of the language clusters actually of the same language To evaluate the ability to find the correct number of languages present, we calculated the perplexity, HDP and symKL (all averaged across 10-CV) on the nine languages used in Experiments 1 and 2, and compared with the precision, recall and F-scores. We used 18 languages- to compare the cosine distance, word KL-divergence and symmetric KL divergence(abbreviated as: cosim, wordkl and symKL, respectively). The five measures are described in detail in Section 4.3 In Fig. 5 we can see that perplexity always reduced with an increase in language number even when the number of languages was greater than 50 and the precision and recall were degraded(see Fig. 6, each time the topic number increases, a new LDa is trained-subsequently the F score might drop), therefore we discard perplexity. Table 2 shows

试读 20P Unsupervised  language  identification  based  on  Latent  Dirichlet Allocation
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
  • 分享小兵

关注 私信
Unsupervised language identification based on Latent Dirichlet Allocation 12积分/C币 立即下载
Unsupervised  language  identification  based  on  Latent  Dirichlet Allocation第1页
Unsupervised  language  identification  based  on  Latent  Dirichlet Allocation第2页
Unsupervised  language  identification  based  on  Latent  Dirichlet Allocation第3页
Unsupervised  language  identification  based  on  Latent  Dirichlet Allocation第4页

试读结束, 可继续读2页

12积分/C币 立即下载