Text Analysis Research

Getting Started

Clone the repo into your home directory:

cd ~
git clone https://github.com/robert-giaquinto/text-analysis.git

Setup python environment.

We're currently using the version of python contained in "python2/2.7.12_anaconda4.1.1" on MSI, because it contains some needed packages that are sometimes difficult to install. To use this as the default version on python on MSI add these two lines to your ~/.bashrc file:
```
module unload python
module load python2/2.7.12_anaconda4.1.1
```
Virtual environments.

MSI allows us to install python modules into virtual envirnments. These are also nice because it makes it easy for all to use the same python modules without issues of module dependencies caused by other projects. I prefer keeping the virtualenv in my home folder, but you can also put it in the project folder (just make sure not to push it to github). To create the virtualenv, go to your home folder and run:
```
virtualenv venv
```
To activate the virtualenv run:
```
source ~/venv/bin/activate
```
To use automatically load this virtualenv everytime you login (recommended, if you aren't working on other MSI projects) add the previous line of code to your ~/.bashrc file.
Installing the necessary packages.

A requirements.txt file, listing all packages used for this project is included in the repository. To install them first make sure your virtual environment is activated, then run the following line of code:
```
pip install -r ~/text-analysis/requirements.txt
```
If the are other packages you want to use, install them and update the requirements.txt file with this command:
```
pip freeze > ~/text-analysis/requirements.txt
```
Installing NLTK data.

There is a file in the misc folder for downloading all the data for NLTK. To run it use this:
```
python ~/text-analysis/misc/download_nltk_data.py
```
If you use other files from NLTK, add them to the list of things to download in the python file mentioned above.

Interacting with Journals

See example_cleaning_journals.py for how to iterate over all the journals and run the journal's clean_journal() method.

For an example of running this parallel, stay tuned an example is in the works...

Modules

parse_journal - for parsing out the text from the journal.json files.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
misc		misc
parse_journal		parse_journal
scripts		scripts
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
example_cleaning_journals.py		example_cleaning_journals.py
journals.py		journals.py
journals_manager.py		journals_manager.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Analysis Research

Getting Started

Interacting with Journals

Modules

About

Releases

Packages

Languages

haiweima/text-analysis

Folders and files

Latest commit

History

Repository files navigation

Text Analysis Research

Getting Started

Interacting with Journals

Modules

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages