Skip to content

Built and tested 6 supervised machine learning algorithm to develop a predictive classification model to classify 13000+ projects as success or failure.

Notifications You must be signed in to change notification settings

knayyar0416/predictive-model-classification

Repository files navigation

Classifying the kickstarter projects as success or failure

In this project, I built and tested 6 supervised machine learning algorithms including logistic regression, k-nearest neighbors, classification tree, random forest (RF), gradient boosting (GBT) and artificial neural network (ANN), to predict the success of kickstarter projects.

🌐 About Kickstarter

Kickstarter is a platform where creators share their project visions with the communities that will come together to fund them.

πŸ’Ό Business Value

For Kickstarter's managament, predicting success means planning ahead. My model helps in predicting the success of projects, guiding staff picks, to select the projects worthy of the spotlight, which can increase the visibility and popularity of the platform.

πŸ”„ Process Overview

I followed these steps to build and test the models:

  1. πŸ“Š Data Exploration:

    • Explored the data and found that US projects accounted for 71% of the data, so I grouped the other countries as β€˜Non-US’.
  2. 🧹 Data Cleansing:

    • Dropped π‘›π‘Žπ‘šπ‘’_𝑙𝑒𝑛 and π‘π‘™π‘’π‘Ÿπ‘_𝑙𝑒𝑛, keeping the cleaned versions.
    • Handled a strong correlation between pledged and 𝑒𝑠𝑑_𝑝𝑙𝑒𝑑𝑔𝑒𝑑, by dropping the former.
    • Created a new column π‘”π‘œπ‘Žπ‘™_𝑒𝑠𝑑 by multiplying π‘”π‘œπ‘Žπ‘™ and π‘ π‘‘π‘Žπ‘‘π‘–π‘_𝑒𝑠𝑑_π‘Ÿπ‘Žπ‘‘π‘’.
    • Addressed missing values in π‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘¦, and excluded observations with π‘ π‘‘π‘Žπ‘‘π‘’ other than 'successful' or 'failure'.
  3. πŸ› οΈ Feature Engineering:

    • Excluded irrelevant features such as 𝑖𝑑 and π‘›π‘Žπ‘šπ‘’, hourly details, original date columns, and weekday columns.
    • The goal of this project is to classify a new project as successful or not, based on the information available at the moment when the project owner submits the project. So, the model should only use the predictors that are available at that time. Hence, I removed 12 columns not available at project submission, including 𝑝𝑙𝑒𝑑𝑔𝑒𝑑, 𝑒𝑠𝑑_𝑝𝑙𝑒𝑑𝑔𝑒𝑑, π‘‘π‘–π‘ π‘Žπ‘π‘™π‘’_π‘π‘œπ‘šπ‘šπ‘’π‘›π‘–π‘π‘Žπ‘‘π‘–π‘œπ‘›, π‘ π‘‘π‘Žπ‘‘π‘’_𝑐hπ‘Žπ‘›π‘”π‘’π‘‘_π‘Žπ‘‘, π‘ π‘‘π‘Žπ‘“π‘“_π‘π‘–π‘π‘˜ and π‘ π‘π‘œπ‘‘π‘™π‘–π‘”h𝑑.
    • After separating the target π‘ π‘‘π‘Žπ‘‘π‘’ (I tried multiple train-test splits, 2:1 gives me the best accuracy across all models), I created dummies from 17 features, resulting in 39 predictors, and eliminated 3 having a correlation of 0.80 or higher.
  4. πŸ€– Model Training: After splitting the dataset, I trained six classification models, and chose accuracy as the primary performance metric to predict true success and failure. Further, I each of the 6 models for LASSO selected features and PCA se;ected components, but since those gave me a lower accuracy for RF and GBT, I chose our initial list of features as final model. image

  5. πŸš€ Top Performer: The Gradient Boosting (GBT) algorithm emerged as the top performer with the highest accuracy at 75.30%.

    πŸ’‘ GBT generates a large number of trees, and through its sequential tree growth (every time learning from the tree one before it), it places greater emphasis on observations with large errors, making it well-suited for this context.

πŸŽ‰ Conclusion

I applied the GBT model to predict the state of projects in kickstarter_test_dataset.xlsx, and achieved an accuracy of 74.34%, confirming its effectiveness as the best model.

πŸ”— Supporting files

Releases

No releases published

Packages

No packages published

Languages