- Scraper is based on pushshift.
- Pandas, nltk were used to filter the data.
- As a result, 100174 documents were received.
Date range: 08-05-2019 - 20-04-2021
Preprocessing steps:
- Lowercase the text
- Remove unicode characters
- Remove stop words
- Remove mentions
- Remove URL
- Remove Hashtags
- Remove ticks and the next character
- Remove punctuations
- Remove numbers
- Replace the over spaces
subreddits:
- relationships
- love
- family
- Marriage
- Parenting
- askwomenadvice
- DecidingToBeBetter
- depression
- SuicideWatch
- TwoXChromosomes
Plots are in the Task 2/figures
folder.
I have chosen fastText embeddings trained on Common Crawl and Wikipedia using fastText.
Original embeddings were pruned with this lib.
Pruned embeddings and all CSVs are available at my Google Drive.
- SVC results without stemming/lemmatization ~ 0.72.
- LogisticRegression results without stemming/lemmatization ~ 0.73.
- LDA (discriminant analysis) results without stemming/lemmatization ~ 0.69.
- LDA catched topics: Marriage, family, SuicideWatch, Parenting, relationship.
metric - adjusted_rand_score
- Kmeans using embeddings show bad results.
- Kmeans using tfifd ~ 0.66
- MiniBatchKMeans using tfifd ~ 0.50
metric - R^2
- LinearRegression results ~ 0.53
- RandomForestRegressor results ~ 0.58
- GradientBoostingRegressor results ~ 0.51
The baseline solution showed a weak result. To improve the quality of the model, can try the following steps
- use additional information about the publication, such as the author and his popularity;
- use TF-IDF as an estimate of the importance of a word in a document;
- check the document collection for anomalies;
- selection of hyperparameters for regression models.