Name		Name	Last commit message	Last commit date
parent directory ..
data		data
.gitignore		.gitignore
README.rst		README.rst
blei_lda.py		blei_lda.py
build_lda.py		build_lda.py
wikitopics_create.py		wikitopics_create.py
wikitopics_create_hdp.py		wikitopics_create_hdp.py
wikitopics_plot.py		wikitopics_plot.py
wordcloud.py		wordcloud.py

README.rst

Chapter 4

Support code for Chapter 4: Topic Modeling

AP Data

To download the AP data, use the download_ap.sh script inside the data directory:

cd data
./download_ap.sh

Word cloud creation

Word cloud creation requires that pytagcloud be installed (in turn, this requires pygame). Since this is not an essential part of the chapter, the code will work even if you have not installed it (naturally, the cloud image will not be generated and a warning will be printed).

Wikipedia processing

You will need a lot of disk space. The download of the Wikipedia text is 11GB and preprocessing it takes another 24GB to save it in the intermediate format that gensim uses for a total of 34GB!

Run the following two commands inside the data/ directory:

./download_wp.sh
./preprocess-wikidata.sh

As the filenames indicate, the first step will download the data and the second one will preprocess it. Preprocessing can take several hours, but it is feasible to run it on a modern laptop. Once the second step is finished, you may remove the input file if you want to save disk space (data/enwiki-latest-pages-articles.xml.bz2).

To generate the model, you can run the wikitopics_create.py script, while the wikitopics_plot.py script will plot the most heavily discussed topic as well as the least heavily discussed one. The code is split into steps as the first one can take a very long time. Then it saves the results so that you can later explore them at leisure.

You should not expect that your results will exactly match the results in the book, for two reasons:

The LDA algorithm is a probabilistic algorithm and can give different results every time it is run.
Wikipedia keeps changing. Thus, even your input data will be different.

Scripts

blei_lda.py: Computes LDA using the AP Corpus.
wikitopics_create.py: Create the topic model for Wikipedia using LDA (must download wikipedia database first)
wikitopics_create_hdp.py: Create the topic model for Wikipedia using HDP (must download wikipedia database first)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch04

ch04

README.rst

Chapter 4

AP Data

Word cloud creation

Wikipedia processing

Scripts

Files

ch04

Directory actions

More options

Directory actions

More options

Latest commit

History

ch04

Folders and files

parent directory

README.rst

Chapter 4

AP Data

Word cloud creation

Wikipedia processing

Scripts