forked from nageshbhattu/LARA-LDA
-
Notifications
You must be signed in to change notification settings - Fork 0
License
Prashanth1608/LARA-LDA
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
*************************** LATENT DIRICHLET ALLOCATION *************************** Java port (LDA-J): Gregor Heinrich gregor[at]arbylon.net (C) Copyright 2005, Gregor Heinrich (gregor [at] arbylon [dot] net) Original design (LDA-C) and theory: David M. Blei blei[at]cs.cmu.edu (C) Copyright 2004, David M. Blei (blei [at] cs [dot] cmu [dot] edu) This file is part of LDA-J, which is a Java port of LDA-C, retaining its general structure and I/O formats. LDA-J is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. LDA-J is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ------------------------------------------------------------------------ From LDA-C's readme.txt: This is a C implementation of latent Dirichlet allocation (LDA), a model of discrete data which is fully described in Blei et. al. (2003) (http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf). LDA is a hierarchical model of documents. Let \alpha be a scalar and \beta_{1:K} be K distributions of words (called topics). As implemented here, a K topic LDA model assumes the following generative process of an N word document: 1. \theta | \alpha ~ Dirichlet(\alpha / K, ..., \alpha /K) 2. for each word n = {1, ..., N}: a. Z_n | \theta ~ Mult(\theta) b. W_n | z_n, \beta ~ Mult(\beta_{z_n}) This code implements variational inference of \theta and z_{1:N} for a document, and estimation of the topics \beta_{1:K} and \alpha. **** COMPILING **** Type "make" in a shell. **** TOPIC ESTIMATION ***** Estimate the model by executing: lda est [initial alpha] [k] [settings] [data] [random/seeded/*] [directory] The term [random/seeded/*] > describes how the topics will be initialized. "Random" initializes each topic randomly; "seeded" initializes each topic to a distribution smoothed from a randomly chosen document; or, you can specify a model name to load a pre-existing model as the initial model (this is useful to continue EM from where it left off). To change the number of initial documents used, edit lda-estimate.c. The model (\alpha and \beta_{1:K}) and variational posterior Dirichlet parameters will be saved in the specified directory every ten iterations. Additionally, there will be a log file for the likelihood bound and convergence score at each iteration. The algorithm runs until that score is less than em convergence (from the settings file) or em max iter iterations are reached. (To change the lag between saved models, edit lda-estimate.c.) The saved models are in two files: <iteration>.other contains alpha. <iteration>.beta contains the topic distributions. Each line is a topic. The variational posterior Dirichlets are in: <iteration>.gamma The settings file and data format are described below. 1. Settings file See settings.txt for a sample. This is of the following form: var max iter [integer e.g., 10] var convergence [float e.g., 1e-8] em max iter [integer e.g., 100] em convergence [float e.g., 1e-5] alpha [fixed/estimate] where the settings are [var max iter] The maximum number of iterations of coordinate ascent variational inference for a single document. [var convergence] the convergence criteria for variational inference. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that the score is the lower bound on the likelihood for a particular document. [em max iter] The maximum number of iterations of variational EM. [em convergence] The convergence criteria for varitional EM. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that score is the lower bound on the likelihood for the whole corpus. [alpha] If set to [fixed] then alpha does not change from iteration to iteration. If set to [estimate], then alpha is estimated along with the topic distributions. 2. Data format Under LDA, the words of each document are assumed exchangeable. Thus, each document is succinctly represented as a sparse vector of word counts. The data is a file where each line is of the form: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. **** INFERENCE **** To perform inference on a different set of data (in the same format as for estimation), execute: lda infer [settings] [model] [data] [name] Variational inference is performed on the data using the model in [model].* (see above). Two files will be created : [name].gamma are the variational Dirichlet parameters for each document; [name].likelihood is the bound on the likelihood for each document. **** Project status, feedback, questions and problems **** lda-j is in a pre-alpha state, i.e., without extensive testing or guaranteed stability. For feedback and questions (especially regarding the Java port) please email Gregor Heinrich gregor[at]arbylon.net. (It might happen that I cannot respond immediately as lda-j is currently rather a "Sunday project".)
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Java 92.2%
- C 7.8%