GitHub

Scraper Task 1

Scraper is based on pushshift.
Pandas, nltk were used to filter the data.
As a result, 100174 documents were received.

Date range: 08-05-2019 - 20-04-2021

Data preprocessing

Preprocessing steps:

Lowercase the text
Remove unicode characters
Remove stop words
Remove mentions
Remove URL
Remove Hashtags
Remove ticks and the next character
Remove punctuations
Remove numbers
Replace the over spaces

subreddits:

relationships
love
family
Marriage
Parenting
askwomenadvice
DecidingToBeBetter
depression
SuicideWatch
TwoXChromosomes

EDA Task 2

Plots are in the Task 2/figures folder.

Features

I have chosen fastText embeddings trained on Common Crawl and Wikipedia using fastText.

Original embeddings were pruned with this lib.
Pruned embeddings and all CSVs are available at my Google Drive.

Tested Models Task 3

Classification

SVC results without stemming/lemmatization ~ 0.72.
LogisticRegression results without stemming/lemmatization ~ 0.73.
LDA (discriminant analysis) results without stemming/lemmatization ~ 0.69.

Topic modeling

LDA catched topics: Marriage, family, SuicideWatch, Parenting, relationship.

Clustering

metric - adjusted_rand_score

Kmeans using embeddings show bad results.
Kmeans using tfifd ~ 0.66
MiniBatchKMeans using tfifd ~ 0.50

Regression baseline model Task 4

metric - R^2

LinearRegression results ~ 0.53
RandomForestRegressor results ~ 0.58
GradientBoostingRegressor results ~ 0.51

Conclusion:

The baseline solution showed a weak result. To improve the quality of the model, can try the following steps

use additional information about the publication, such as the author and his popularity;
use TF-IDF as an estimate of the importance of a word in a document;
check the document collection for anomalies;
selection of hyperparameters for regression models.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Task_1		Task_1
Task_2		Task_2
Task_3		Task_3
Task_4		Task_4
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper Task 1

Data preprocessing

EDA Task 2

Features

Tested Models Task 3

Classification

Topic modeling

Clustering

Regression baseline model Task 4

Conclusion:

Links to sourses

About

Releases

Packages

Languages

kirill98731/reddit_topics

Folders and files

Latest commit

History

Repository files navigation

Scraper Task 1

Data preprocessing

EDA Task 2

Features

Tested Models Task 3

Classification

Topic modeling

Clustering

Regression baseline model Task 4

Conclusion:

Links to sourses

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages