***************************
LATENT DIRICHLET ALLOCATION
***************************
Java port (LDA-J):
Gregor Heinrich
gregor[at]arbylon.net
(C) Copyright 2005, Gregor Heinrich (gregor [at] arbylon [dot] net)
Original design (LDA-C) and theory:
David M. Blei
blei[at]cs.cmu.edu
(C) Copyright 2004, David M. Blei (blei [at] cs [dot] cmu [dot] edu)
This file is part of LDA-J, which is a Java port of LDA-C, retaining
its general structure and I/O formats.
LDA-J is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
LDA-J is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
USA
------------------------------------------------------------------------
From LDA-C's readme.txt:
This is a C implementation of latent Dirichlet allocation (LDA), a
model of discrete data which is fully described in Blei et. al. (2003)
(http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf).
LDA is a hierarchical model of documents. Let \alpha be a scalar and
\beta_{1:K} be K distributions of words (called topics). As
implemented here, a K topic LDA model assumes the following generative
process of an N word document:
1. \theta | \alpha ~ Dirichlet(\alpha / K, ..., \alpha /K)
2. for each word n = {1, ..., N}:
a. Z_n | \theta ~ Mult(\theta)
b. W_n | z_n, \beta ~ Mult(\beta_{z_n})
This code implements variational inference of \theta and z_{1:N} for a
document, and estimation of the topics \beta_{1:K} and \alpha.
**** COMPILING ****
Type "make" in a shell.
**** TOPIC ESTIMATION *****
Estimate the model by executing:
lda est [initial alpha] [k] [settings] [data] [random/seeded/*] [directory]
The term [random/seeded/*] > describes how the topics will be
initialized. "Random" initializes each topic randomly; "seeded"
initializes each topic to a distribution smoothed from a randomly
chosen document; or, you can specify a model name to load a
pre-existing model as the initial model (this is useful to continue EM
from where it left off). To change the number of initial documents
used, edit lda-estimate.c.
The model (\alpha and \beta_{1:K}) and variational posterior Dirichlet
parameters will be saved in the specified directory every ten
iterations. Additionally, there will be a log file for the likelihood
bound and convergence score at each iteration. The algorithm runs
until that score is less than em convergence (from the settings file)
or em max iter iterations are reached. (To change the lag between
saved models, edit lda-estimate.c.)
The saved models are in two files:
<iteration>.other contains alpha.
<iteration>.beta contains the topic distributions. Each line is a topic.
The variational posterior Dirichlets are in:
<iteration>.gamma
The settings file and data format are described below.
1. Settings file
See settings.txt for a sample.
This is of the following form:
var max iter [integer e.g., 10]
var convergence [float e.g., 1e-8]
em max iter [integer e.g., 100]
em convergence [float e.g., 1e-5]
alpha [fixed/estimate]
where the settings are
[var max iter]
The maximum number of iterations of coordinate ascent
variational inference for a single document.
[var convergence]
the convergence criteria for variational inference. Stop if
(score_old - score) / abs(score_old) is less than this value (or
after the maximum number of iterations). Note that the score is
the lower bound on the likelihood for a particular document.
[em max iter]
The maximum number of iterations of variational EM.
[em convergence]
The convergence criteria for varitional EM. Stop if (score_old
- score) / abs(score_old) is less than this value (or after the
maximum number of iterations). Note that score is the lower
bound on the likelihood for the whole corpus.
[alpha]
If set to [fixed] then alpha does not change from iteration to
iteration. If set to [estimate], then alpha is estimated along
with the topic distributions.
2. Data format
Under LDA, the words of each document are assumed exchangeable. Thus,
each document is succinctly represented as a sparse vector of word
counts. The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document.
**** INFERENCE ****
To perform inference on a different set of data (in the same format as
for estimation), execute:
lda infer [settings] [model] [data] [name]
Variational inference is performed on the data using the model in
[model].* (see above). Two files will be created : [name].gamma are
the variational Dirichlet parameters for each document;
[name].likelihood is the bound on the likelihood for each document.
**** Project status, feedback, questions and problems ****
lda-j is in a pre-alpha state, i.e., without extensive testing or guaranteed
stability. For feedback and questions (especially regarding the Java port)
please email Gregor Heinrich gregor[at]arbylon.net.
(It might happen that I cannot respond immediately as lda-j is currently
rather a "Sunday project".)
评论16
最新资源