In this project, I built and tested 6 supervised machine learning algorithms including logistic regression, k-nearest neighbors, classification tree, random forest (RF), gradient boosting (GBT) and artificial neural network (ANN), to predict the success of kickstarter projects.
Kickstarter is a platform where creators share their project visions with the communities that will come together to fund them.
For Kickstarter's managament, predicting success means planning ahead. My model helps in predicting the success of projects, guiding staff picks, to select the projects worthy of the spotlight, which can increase the visibility and popularity of the platform.
I followed these steps to build and test the models:
-
π Data Exploration:
- Explored the data and found that US projects accounted for 71% of the data, so I grouped the other countries as βNon-USβ.
-
π§Ή Data Cleansing:
- Dropped ππππ_πππ and πππ’ππ_πππ, keeping the cleaned versions.
- Handled a strong correlation between pledged and π’π π_πππππππ, by dropping the former.
- Created a new column ππππ_π’π π by multiplying ππππ and π π‘ππ‘ππ_π’π π_πππ‘π.
- Addressed missing values in πππ‘πππππ¦, and excluded observations with π π‘ππ‘π other than 'successful' or 'failure'.
-
π οΈ Feature Engineering:
- Excluded irrelevant features such as ππ and ππππ, hourly details, original date columns, and weekday columns.
- The goal of this project is to classify a new project as successful or not, based on the information available at the moment when the project owner submits the project. So, the model should only use the predictors that are available at that time. Hence, I removed 12 columns not available at project submission, including πππππππ, π’π π_πππππππ, πππ ππππ_πππππ’πππππ‘πππ, π π‘ππ‘π_πhπππππ_ππ‘, π π‘πππ_ππππ and π πππ‘πππhπ‘.
- After separating the target π π‘ππ‘π (I tried multiple train-test splits, 2:1 gives me the best accuracy across all models), I created dummies from 17 features, resulting in 39 predictors, and eliminated 3 having a correlation of 0.80 or higher.
-
π€ Model Training: After splitting the dataset, I trained six classification models, and chose accuracy as the primary performance metric to predict true success and failure. Further, I each of the 6 models for LASSO selected features and PCA se;ected components, but since those gave me a lower accuracy for RF and GBT, I chose our initial list of features as final model.
-
π Top Performer: The Gradient Boosting (GBT) algorithm emerged as the top performer with the highest accuracy at 75.30%.
π‘ GBT generates a large number of trees, and through its sequential tree growth (every time learning from the tree one before it), it places greater emphasis on observations with large errors, making it well-suited for this context.
I applied the GBT model to predict the state of projects in kickstarter_test_dataset.xlsx, and achieved an accuracy of 74.34%, confirming its effectiveness as the best model.