design4emergency

Setup

Clone repo:

git clone [email protected]:w4nderlust/design4emergency.git

Enter repo directory:

cd design4emergency

Create virtualenv:

virtualenv -p python3 venv

Enter in the virtualenv:

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Download spaCy languages (needed for lemmatization):

python -m spacy download it_core_news_sm

Install SentITA:

git clone https://github.com/w4nderlust/SentITA.git

Download https://drive.google.com/uc?id=1IN-RZL-gpgzuosr-BknKtA6reDJ48XNI

gdown https://drive.google.com/uc?id=1IN-RZL-gpgzuosr-BknKtA6reDJ48XNI

Move test_sentita_lstm-cnn_wikiner_v1.h5 inside SentITA/sentita folder

mv test_sentita_lstm-cnn_wikiner_v1.h5 SentITA/sentita

Install SentITA

pip install ./SentITA

Remove folder SentITA

rm -r SentITA

Download data.tsv and place it into the data folder

Usage

text_analysis

Command:

python text_analysis.py data/data.tsv column_name column_name ...

Example:

python text_analysis.py data/data.tsv "Cosa ti fa più paura?" "Cosa ti fa stare bene?"

For more parameters check:

usage: text_analysis.py [-h] [-g GROUPS [GROUPS ...]] [-l LANGUAGE] [-lm]
                        [-nr NGRAM_RANGE] [-w NUM_WORDS] [-t NUM_TOPICS]
                        [-m MANUAL_MAPPINGS] [-gw] [-wc WORD_CLOUD_FILENAME]
                        [-fw FREQUENT_WORDS_FILENAME]
                        [-fwp FREQUENT_WORDS_PLOT_FILENAME]
                        [-ttw TOP_TFIDF_WORDS_FILENAME]
                        [-ttwp TOP_TFIDF_WORDS_PLOT_FILENAME] [-pt]
                        [-tf TOPICS_FILENAME] [-ptf PREDICTED_TOPICS_FILENAME]
                        [-lv LDAVIS_FILENAME_PREFIX] [-s]
                        [-ps PREDICTED_SENTIMENT_FILENAME] [-o OUTPUT_PATH]
                        data_path columns [columns ...]

This script analyzes text in columns of a TSV file

positional arguments:
  data_path             path to the data TSV
  columns               columns to extract from TSV

optional arguments:
  -h, --help            show this help message and exit
  -g GROUPS [GROUPS ...], --groups GROUPS [GROUPS ...]
                        columns from the TSV to use for grouping
  -l LANGUAGE, --language LANGUAGE
                        language of the text in the data (for data cleaning
                        purposes)
  -lm, --lemmatize      performs lemmatization of all texts
  -nr NGRAM_RANGE, --ngram_range NGRAM_RANGE
                        minimum and maximum value for ngrams, specify as
                        "min,max"
  -w NUM_WORDS, --num_words NUM_WORDS
                        number of most frequent words to show
  -t NUM_TOPICS, --num_topics NUM_TOPICS
                        number of topics for topic modeling
  -m MANUAL_MAPPINGS, --manual_mappings MANUAL_MAPPINGS
                        path to JSON file contaning manual mappings
  -gw, --generate_word_cloud
                        generates word cloud plots
  -wc WORD_CLOUD_FILENAME, --word_cloud_filename WORD_CLOUD_FILENAME
                        path to save the word cloud to
  -fw FREQUENT_WORDS_FILENAME, --frequent_words_filename FREQUENT_WORDS_FILENAME
                        path to save frequent words to
  -fwp FREQUENT_WORDS_PLOT_FILENAME, --frequent_words_plot_filename FREQUENT_WORDS_PLOT_FILENAME
                        path to save the frequent word plot to
  -ttw TOP_TFIDF_WORDS_FILENAME, --top_tfidf_words_filename TOP_TFIDF_WORDS_FILENAME
                        path to save top tfidf words to
  -ttwp TOP_TFIDF_WORDS_PLOT_FILENAME, --top_tfidf_words_plot_filename TOP_TFIDF_WORDS_PLOT_FILENAME
                        path to save the top tfidf word plot to
  -pt, --predict_topics
                        learns topics and predicts them for each text (pretty
                        slow)
  -tf TOPICS_FILENAME, --topics_filename TOPICS_FILENAME
                        path to save frequent words to
  -ptf PREDICTED_TOPICS_FILENAME, --predicted_topics_filename PREDICTED_TOPICS_FILENAME
                        path to save predicted LDA topics for each datapoint
                        to
  -lv LDAVIS_FILENAME_PREFIX, --ldavis_filename_prefix LDAVIS_FILENAME_PREFIX
                        path (prefix) to save LDA vis plot files to
  -s, --predict_sentiment
                        performs sentiment analysis (it is pretty slow)
  -ps PREDICTED_SENTIMENT_FILENAME, --predicted_sentiment_filename PREDICTED_SENTIMENT_FILENAME
                        path to save predicted sentiment for each datapoint to
  -o OUTPUT_PATH, --output_path OUTPUT_PATH
                        path that will contain all directories, one for each
                        column

Outputs:

The script outputs one directory for each column spacified, contaning the following files:

frequent_words_filename (frequent_words.json): a JSON file containing all non-stopword words in the text in the specified column with their respective frequency
frequent_words_plot_filename (frequent_words.png): a PNG file containing a plot of top k most frequent non-stopword words in the text of the column
top_tfidf_words_filename (top_tfidf_words.json): a JSON file contaning all non-stopword words in the text in the specified column with their respective tfidf
top_tfidf_words_plot_filename (top_tfidf_words.png): a PNG file containing a plot of top k top tfidf non-stopword words in the text of the column
if generate_word_cloud is specified:
- word_cloud_filename (wordcloud.png): a PNG file of a word cloud of the text in the specified column
if predict_topics is specified:
- topics_filename (topics.json): a JSON file containing, for each topic, for each word, the probability of that word given that topic
- predicted_topics_filename (predicted_topics.csv): a CSV file containing one column for each topic, and for each row in the input TSV file, the probability that the text in the input files column belongs to each topic
- ldavis_filename_prefix (ldavis_):
  - ldavis_filename_prefix_N: N is the number of topics. This file is a Python pickle file contaning the data to generate an HTML LDA visualization
  - ldavis_filename_prefix_N.html: a HTML visualization of the LDA topic distribution
if sentiment is specified:
- predicted_sentiment_filename (predicted_sentiment.csv): a CSV file containing a positive and a negative column and for each row in the original TSV file independent probability values for positive and negative sentiment (two low probabilities means neutral, two high probability values mean ambivalent). This is returned only if --predict_sentiment is provided, as calculating the sentiment can be pretty slow)
if groups is specified:
- wordcloud_filename, frequent_words_filename, frequent_words_plot_filename, top_tfidf_words_filename and top_tfidf_words_plot_filename will be repeated for each groups for each value, with a [group]_[value]_ prefix
- two additional JSON files per each group collecting word frequencies and tfidf for all values will be collected, namely [group]_frequent_words.json and [group]_tfidf_words.json.

manual_topic_classifier

Command:

python manual_classifier.py data/data.tsv column_name column_name ... --manual_classes manual_classes.json

Example:

python manual_classifier.py "/home/piero/data/DfE_Dataset_Corriere_1600_Danilo - Form Responses 1.tsv" "Cosa ti ha convinto del fatto che rimanere in casa è necessario?" --manual_classes manual_classes.json

For more parameters check:

usage: manual_classifier.py [-h] -mc MANUAL_CLASSES [-l LANGUAGE] [-lm]
                                  [-mm MANUAL_MAPPINGS]
                                  [-pc PREDICTED_CLASSES_FILENAME]
                                  [-o OUTPUT_PATH]
                                  data_path columns [columns ...]

This script creates a classifier from a bag of words using embeddings and uses
it to predict classes

positional arguments:
  data_path             path to the data TSV
  columns               columns to extract from TSV

optional arguments:
  -h, --help            show this help message and exit
  -mc MANUAL_CLASSES, --manual_classes MANUAL_CLASSES
                        path to JSON file contaning manual classes
  -l LANGUAGE, --language LANGUAGE
                        language of the text in the data (for data cleaning
                        purposes)
  -lm, --lemmatize      performs lemmatization of all texts
  -mm MANUAL_MAPPINGS, --manual_mappings MANUAL_MAPPINGS
                        path to JSON file contaning manual mappings
  -pc PREDICTED_CLASSES_FILENAME, --predicted_classes_filename PREDICTED_CLASSES_FILENAME
                        path to save predicted classes for each datapoint to
  -o OUTPUT_PATH, --output_path OUTPUT_PATH
                        path that will contain all directories, one for each
                        column

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.gitignore		.gitignore
README.md		README.md
all_stopwords.json		all_stopwords.json
db_utils.py		db_utils.py
it_stopwords.json		it_stopwords.json
manual_classes.json		manual_classes.json
manual_classifier.py		manual_classifier.py
manual_mappings.json		manual_mappings.json
requirements.txt		requirements.txt
text_analysis.py		text_analysis.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

design4emergency

Setup

Usage

text_analysis

manual_topic_classifier

About

Releases

Packages

Languages

w4nderlust/design4emergency

Folders and files

Latest commit

History

Repository files navigation

design4emergency

Setup

Usage

text_analysis

manual_topic_classifier

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages