We build a business failure model to predict if a business will survive another half-year. The data is from "The Yelp Dataset Challenge 2017". Collecting data from Yelp users, our model can come up with promising results (Kappa coefficient = 0.156) in business failure classification problems. In addition, PCA can further improve our performance by projecting our features into lower dimensions. More details please refer to poster and report
We use data from "The Yelp Dataset Challenge 2017", which is an open data competition held by Yelp. The dataset contains 140k business and 4.1 millions reviews across 11 cities in four countries around the world. The total data size is around 4.8 GB. The original dataset has three major json files of our interest: business, reviews and users.
For unstructure data like reviews, we tokenized it into words and calculated the positive and negtive scores. We also select the top 100 most frequent nouns and adjective as input features.
We ran linear regression on the time series to get star_trend
We count the stores in same catogory in 1 mile as competitor and generated feature competitor_total
To evaluate imbalanced data, Kappa Coefficient is introduced. The definition please see wikipedia.
Random Forest gives the best result of 0.156 by Kappa coefficient and the AUC is 0.564, while Logistic Regression has the best AUC. The reason of this inconsistence is that Kappa score emphasizes agreement between prediction and the true value.