This is a repository for Udacity Data Analyst Project 1 (Investigate a Dataset). The dataset used in the project is also included in this repository.
The libraries used on this project include:
- Pandas – For storing and manipulating structured data. Pandas functionality is built on NumPy (upgrade to version 0.25.1)
- Numpy – For multi-dimensional array, matrix data structures and, performing mathematical operations
- Matplotlib – For all visualizations (including maps and graphs)
I analyzed the dataset which contains information of about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. The analysis is focused on answering the questions:
- the effect of vote count, popularity and budget on the revenue generated
- the effect of the runtime on the revenue generated
- effect of budget and vote count
The main steps for this project can be summarized as follows:
- Data Wrangling
- Data Assessment
- Data Cleaning
- Exploratory Analysis
Based on the data and analysis carried out;
- properties such as vote count, popularity and budget have strong effect on the revenue generated
- the effect of the runtime is not that strong on the revenue generated
- budget and vote count have the strongest effect
The budget of a movie that generates low revenue is about 5 million while that of a high revenue movie over 52 million. This clearly shows that budget of a movie is correllated with the revenue of a movie, but there are limitations to this result, such as the year the movie was released(release_year) and Director of the Movie.