This project is the submission from General Assembly Data Science Immersive Course DSI-4
Project workflow organised in document order: Part 1 : Identify/ Pitch Part 2 : Aquire, Parse Part 3 : Mine, Refine Part 4 : Build Part 5 : Predict Part 6 : Present
Part 1 :Identify IDENTIFY: Understand the problem
- Identify business/product objectives.
- Identify and hypothesize goals and criteria for success.
- Create a set of questions to help you identify the correct data set.
Pitch us on potential ideas for a data-driven project. Think of topics you’re passionate about, knowledge you’re familiar with, or problems relevant to industries you’d like to work with. What questions do you want to answer?
Part 2 :Parse + Aquire ACQUIRE: Obtain the data Ideal Data vs. Available Data Often times we start by identifying the ideal data we would want for a project.
Data for Predictions: Foursquare API Data for modelling: XML file of labeled from meta share
Some typical questions at this stage may include:
- Identifying the right data set(s)
- Is there enough data?
- Does it appropriately align with the question/problem statement?
- Can the dataset be trusted? How was it collected?
- Is this dataset aggregated? Can we use the aggregation or do we need to get it pre-aggregation?
- Assess resources, requirements, assumptions, and constraints
PARSE: Understand the data
- Common Tasks at this step include:
- Reading any documentation provided with the data (e.g. data dictionary above)
- Performing exploratory surface analysis via filtering, sorting, and simple visualizations
- Describing data structure and the information being collected
- Exploring variables, data types via select
- Assessing preliminary outliers, trends
- Verifying the quality of the data (feedback loop -> 1)
Part 3 Mine + Refine MINE: Prepare, structure, & clean the data Often, our data will need to be cleaned prior performing our analysis. Common Tasks at this step include:
- Sampling the data, determine sampling methodology
- Iterating and explore outliers, null values via select
- Reviewing qualitative vs quantitative data
- Formatting and cleaning data in Python (e.g. dates, number signs, formatting)
- Defining how to appropriately address missing values (cleaning)
- Categorization, manipulation, slicing, format, integrate data
- Formatting and combining different data points, separate columns, etc.
- Determining most appropriate aggregations, cleaning methods
- Creating necessary derived columns from the data (new data)
REFINE: Exploratory Data Analysis & Iteration
Such descriptive statistics allow us to:
- Identify trends and outliers
- Decide how to deal with outliers - excluding, filtering, and communication
- Apply descriptive and inferential statistics
- Determine initial visualization techniques
- Document and capture knowledge
- Choose visualization techniques for different data types
- Transform data
Part 4 Build BUILD: Create a data model
Some of the steps we will take to build a model include:
- Selecting the appropriate model
- Building a model
- Testing and training our model
- Evaluating and refining our model