Howdy! We are Jose Alfaro, Steve Broll, and Kevin Chou, the winners of the undergraduate division of the 2018 TAMIDS Data Science Competition! This respository contains the all of the code we used to build our builds and create our visualizations and the final report we submitted. Also included are links to the subsetted data we used and the visuals we created to supplement our report and used in our presentation.
The competition is focused on a large public data set, namely, more than 110 million Chicago taxi rides over the period from January 1, 2013 through July 31, 2017. The data includes the time of the day, the length of each trip in terms of both time and distance, the taxi fare as well as information about the pickup and drop off locations.
The data set also contains anonymized unique identifiers for each taxi. This makes it possible to examine how trip revenue per taxi and the number of trips per taxi have changed over time.
The data does not include any direct information about rides with Uber or Lyft in Chicago. However, since Uber and Lyft have operated in Chicago since 2011 and 2013, respectively, the taxi data does allow us to study changes to Chicago taxi trips in response to competition from these ride sharing services.
Interest centers on building visualizations and predictive models explaining how the Chicago taxi business has changed over time. In particular, contestants are asked to consider how hourly, daily and weekly revenue and trips for a typical Chicago taxi have changed over both location and time. By “a typical Chicago taxi”, we mean to imply the median value. For example, we understand “weekly revenue for a typical Chicago taxi” to mean the median weekly revenue for Chicago taxis. In other words, if we calculated the weekly revenue for each Chicago taxi, then we would consider the median for each week as the target value.
The data from 2013 through 2016 shall be used for training predictive models, while the data from 2017 shall be used for testing the efficacy of the predictive models built using the training data.