Skip to content

kirill98731/reddit_topics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraper Task 1


  • Scraper is based on pushshift.
  • Pandas, nltk were used to filter the data.
  • As a result, 100174 documents were received.

Date range: 08-05-2019 - 20-04-2021

Data preprocessing

Preprocessing steps:

  • Lowercase the text
  • Remove unicode characters
  • Remove stop words
  • Remove mentions
  • Remove URL
  • Remove Hashtags
  • Remove ticks and the next character
  • Remove punctuations
  • Remove numbers
  • Replace the over spaces

subreddits:

  • relationships
  • love
  • family
  • Marriage
  • Parenting
  • askwomenadvice
  • DecidingToBeBetter
  • depression
  • SuicideWatch
  • TwoXChromosomes

EDA Task 2

Plots are in the Task 2/figures folder.

Features

I have chosen fastText embeddings trained on Common Crawl and Wikipedia using fastText.

Original embeddings were pruned with this lib.
Pruned embeddings and all CSVs are available at my Google Drive.

Tested Models Task 3

Classification

  • SVC results without stemming/lemmatization ~ 0.72.
  • LogisticRegression results without stemming/lemmatization ~ 0.73.
  • LDA (discriminant analysis) results without stemming/lemmatization ~ 0.69.

Topic modeling

  • LDA catched topics: Marriage, family, SuicideWatch, Parenting, relationship.

Clustering

metric - adjusted_rand_score

  • Kmeans using embeddings show bad results.
  • Kmeans using tfifd ~ 0.66
  • MiniBatchKMeans using tfifd ~ 0.50

Regression baseline model Task 4

metric - R^2

  • LinearRegression results ~ 0.53
  • RandomForestRegressor results ~ 0.58
  • GradientBoostingRegressor results ~ 0.51

Conclusion:

The baseline solution showed a weak result. To improve the quality of the model, can try the following steps

  • use additional information about the publication, such as the author and his popularity;
  • use TF-IDF as an estimate of the importance of a word in a document;
  • check the document collection for anomalies;
  • selection of hyperparameters for regression models.

Links to sourses

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published