A data mining project which will predict the success of future movies. This is a student project at the University of Mannheim. HWS17 Master of Science, Business Informatics.
Before new movies are being produced, every stakeholder is interested in the monetary success of the intended movie. In order to predict the success, costly methods are being applied, such as market investigations or analyses. The benefit of Data Mining to the analysis of large datasets can also be transferred to the stated problem of predicting a movie’s success.
The goal of this project is to learn a model which will predict how successful a not yet released movie will be. This is done by using common data mining techniques in the Python programming language using the machine learning models provided by the library scikit-learn. As the main objective the question ”Based on revenue, will the movie be popular or will it be a flop?” shall be answered for all possible combinations of information on a new movie as precisely as possible.
The selected dataset onto which a classification model shall be learned is provided by Kaggle. It is named The Movies Dataset and contains metadata of approximately 45,000 movies in its raw format. It is provided and updated by Rounak Banik. The complete dataset consists of several files in csv-format containing data about movie casts, metadata and external scores. The main file used during preprocessing is named movies-metadata.csv. Note: not all datasets are provided in this repository due to large file sizes. Additionally, the .zip folder in the directory data/raw/ must be extracted to ensure the functionality of the preprocessing scripts.
In order to generate a new Dataset for later usage the script preprocess-data.py must be executed.
Afterwards you get new csv file in data/processed/ named train-set.csv. This file can be used for further data mining and classification.
In order to be more flexible with selected features and parameters a classifier template was introduced. With the help of this it is possible to search for the features which will result in the best performance. Hyper parameter tuning is done via GridSearch from the scikit-learn Framework. Scripts can be found under src/model.