GitHub - Netjimmy/Store-Shutdown-Prediction-by-Yelp-Dataset: NYU Data Science DS-GA 1003: Machine Learning Course Project

Abstract

We build a business failure model to predict if a business will survive another half-year. The data is from "The Yelp Dataset Challenge 2017". Collecting data from Yelp users, our model can come up with promising results (Kappa coefficient = 0.156) in business failure classification problems. In addition, PCA can further improve our performance by projecting our features into lower dimensions. More details please refer to poster and report

Data Underdstanding

We use data from "The Yelp Dataset Challenge 2017", which is an open data competition held by Yelp. The dataset contains 140k business and 4.1 millions reviews across 11 cities in four countries around the world. The total data size is around 4.8 GB. The original dataset has three major json files of our interest: business, reviews and users.

Feature Selection and Generation

Text-based Features

For unstructure data like reviews, we tokenized it into words and calculated the positive and negtive scores. We also select the top 100 most frequent nouns and adjective as input features.

Features of Temporal Information

We ran linear regression on the time series to get star_trend

Features of Spatial Information

We count the stores in same catogory in 1 mile as competitor and generated feature competitor_total

Modeling Comparision

To evaluate imbalanced data, Kappa Coefficient is introduced. The definition please see wikipedia.

Random Forest gives the best result of 0.156 by Kappa coefficient and the AUC is 0.564, while Logistic Regression has the best AUC. The reason of this inconsistence is that Kappa score emphasizes agreement between prediction and the true value.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data_id		data_id
dicts		dicts
CrossValidator.py		CrossValidator.py
README.md		README.md
build_dicts.py		build_dicts.py
config.py		config.py
feature.ipynb		feature.ipynb
feature.py		feature.py
main_spark.py		main_spark.py
model.py		model.py
stores_shutdown_prediction_poster.pdf		stores_shutdown_prediction_poster.pdf
stores_shutdown_prediction_report.pdf		stores_shutdown_prediction_report.pdf
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Data Underdstanding

Feature Selection and Generation

Text-based Features

Features of Temporal Information

Features of Spatial Information

Modeling Comparision

About

Releases

Packages

Languages

Netjimmy/Store-Shutdown-Prediction-by-Yelp-Dataset

Folders and files

Latest commit

History

Repository files navigation

Abstract

Data Underdstanding

Feature Selection and Generation

Text-based Features

Features of Temporal Information

Features of Spatial Information

Modeling Comparision

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages