Skip to content

PLDAQuickStart

Cyrus Choi edited this page Jan 31, 2016 · 1 revision

Introduction

plda must be run in linux environment with g++ compiler installed.

Installation

  • Download plda-2.0.tar.gz
  • tar xvfz plda-2.0.tar.gz
  • cd plda-2.0
  • make lda infer
  • You will see a binary file lda and infer generated in the folder

Prepare datafile

  • Data is stored using a sparse representation, with one document per line. Each line is the words of this document together with the word count. It is like <word1> <word1_count> <word2> <word2_count> <word3> <word3_count> ... Each word is an arbitrary string, but it is not expected to contain space/newline or other special characters.
  • Example: Suppose there are two documents. The first one is "a is a character"; The second one is "b is a character after a". Then the datafile would look like:
a 2 is 1 character 1
a 2 is 1 b 1 character 1 after 1

Do a simple training and testing

  • Download test_data.txt as training sample and put it under plda/testdata/ directory.
  • Train
    • ./lda --num_topics 2 --alpha 0.1 --beta 0.01 --training_data_file testdata/test_data.txt --model_file /tmp/lda_model.txt --burn_in_iterations 100 --total_iterations 150
  • After training completes, you will see a file /tmp/lda_model.txt generated. This file stores the training result. Each line is the topic distribution of a word. The first element is the word string, then its occurrence count within each topic. You could use view_model.py to convert the model to a readable text.
  • Infer unseen documents:
    • ./infer --alpha 0.1 --beta 0.01 --inference_data_file testdata/test_data.txt --inference_result_file /tmp/inference_result.txt --model_file /tmp/lda_model.txt --total_iterations 15 --burn_in_iterations 10

Command-line flags

  • Training flags
    • alpha: Suggested to be 50/number_of_topics
    • beta: Suggested to be 0.01
    • num_topics: The total number of topics.
    • total_iterations: The total number of GibbsSampling iterations.
    • burn_in_iterations: After --burn_in_iterations iteration, the model will be almost converged. Then we will average models of the last (total_iterations-burn_in_iterations) iterations as the final model. This only takes effect for single processor version. For example: you set total_iterations to 200, you found that after 170 iterations, the model is almost converged. Then you could set burn_in_iterations to 170 so that the final model will be the average of the last 30 iterations.
    • model_file: The output file of the trained model.
    • training_data_file: The training data.
  • Inferring flags:
    • alpha and beta should be the same with training.
    • total_iterations: The total number of GibbsSampling iterations for an unseen document to determine its word topics. This number needs not be as much as training, usually tens of iterations is enough.
    • burn_in_iterations: For an unseen document, we will average the document_topic_distribution of the last (total_iterations-burn_in_iterations) iterations as the final document_topic_distribution.
Clone this wiki locally