forked from gkamradt/ryd.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtechnical_summary.html
executable file
·289 lines (253 loc) · 18 KB
/
technical_summary.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Ryd.io - Technical Summary</title>
<!-- Bootstrap Core CSS -->
<link href="../static/dist/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="../static/dist/css/clean-blog.min.css" rel="stylesheet">
<link rel="shortcut icon" type="image/ico" href="../static/favicon.ico" />
<!-- Custom Fonts -->
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='http://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<style>
a {
font-weight: bold;
text-decoration: underline;
}
</style>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-64693107-1', 'auto');
ga('send', 'pageview');
</script>
<body>
<!-- Navigation -->
<nav class="navbar navbar-custom navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-main-collapse">
<i class="fa fa-bars"></i>
</button>
<a class="navbar-brand page-scroll" href="/">
<i class="fa fa-play-circle"></i> <span class="light">Start</span> Ryd
</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse navbar-right navbar-main-collapse">
<ul class="nav navbar-nav">
<!-- Hidden li included to remove active class from about link when scrolled up past about section -->
<li class="hidden">
<a href="#page-top"></a>
</li>
<li>
<a class="page-scroll" href="/cluster_map">Ryd Cluster Map</a>
</li>
<li>
<a class="page-scroll" href="/map_predict">Predictive Map</a>
</li>
<li>
<a class="page-scroll" href="/#about">About</a>
</li>
<li>
<a class="page-scroll" href="/#view_on_github">View On GitHub</a>
</li>
<li>
<a class="page-scroll" href="/#contact">Contact</a>
</li>
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<!-- Set your background image for this header on the line below. -->
<header class="intro-header" style="background-image: url('../static/dist/img/postPic.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>Ryd.io - A Technical Summary</h1>
<h2 class="subheading">Random queries to predictions on a map</h2>
<span class="meta">Posted by <a href="index.html#contact">Greg Kamradt</a> on July 1, 2015</span>
</div>
</div>
</div>
</div>
</header>
<!-- Post Content -->
<article>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<h2 class="section-heading_h2">Summary</h2>
<p>Here we get more into the details of what is going on in the background and how documents were made. For those who want the fine details go check out the <a href='https://github.com/gkamradt/ryd.io'>repo.</a></p>
<h3 style="color: black;">MVP - Model - Part 1 - Query One Location</h3>
<span style="padding-left: 40px;">
First I downloaded the NYC Data set to my local computer. The 30gb file was partitioned into 12 parts. Here's the link to the <a href='http://www.andresmh.com/nyctaxitrips/'>actual data.</a> Next I moved on to the EDA portion. It didn't take long for me to quickly realize that I was going to have to find another way to parse the data. My first pass over the 12 parts took Galvanize's mac-mini 40 minutes to process. Regardless, I had my data for one location. It was time to make sure that we had proper signal in the data. Basically...would I have a project to work with? The thing about being a data scientist is that you can't trust your data. Is it incomplete? corrupt? User bias?<br><br> Who knows! <br><br>Here are the exact graphs I saw when I knew I had a project to work with:
<img src='/static/dist/img/broadway_spring_per_day.png'><br>
<img src='/static/dist/img/broadway_spring_per_dayofweek.png'><br>
<img src='/static/dist/img/broadway_spring_per_hour.png'><br>
<p>
Check out that yearly, weekly, and hourly signal...thats awesome. Once I saw these graphs I knew I'd have something to work with.
</p>
<p>
Once I had the yearly data for one location I could then train a ARIMA model via sklearn. But because I only had one years worth of data with strong weekly seasonality I could use a SARIMA model to forecast days out. For SARIMA models you have to make sure that your data is stationary (white noise) before you fit your model. In order to check this you can use the auto correlation and partial autocorrelation of the residuals to your model.
</p>
<img src='/static/dist/img/actpcf.png'><br><br>
As you can see from the ACF we drop off pretty quickly which says that we have removed most of the seasonality from our data...a good thing!
<p>
The next thing was the geo-prediction. I sat and thought about a technique for a while and I ended up using a <a href='https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation'>multivariate gaussian kernel density estimation.</a> A fancy name for something simple. Basically...based off of where rides have been dropped off in the past...can you predict where they will be in the future? The answer is yes with a bit of randomness thrown in. Here is a view of the output of my KDE.
</p>
<img src='/static/dist/img/2dkde_.jpg' style="width: 440px;">
<p>
Check out how most of the rides landed on one of the streets. This tells me that this street is more popular or simply easier to get dropped off at. For how cool this graph looks it was really simple to implement into the model.
</p>
</span>
<h3 style="color: black;">MVP - Model - Part 2 - Clustering</h3>
<span style="padding-left: 40px;">
<p>
So I saw the weekly and hourly distribution from the first (and only) location that we clustered. I was stoked to see that you could piece out the early hours of the night, happy hour time, and presumably when people were going to bed. I fortunately figured out what the hell I was doing with my data set and switched my queries over to BigQuery. This took my query time for one location from 40min to ~16seconds. This pretty much changed what my vision with the project. I could run a ton of queries and not have to worry about time (as much).
</p>
<p>
The next step was to draw a grid over all of New York, but in order to decrease the amount of queries I'd be performing I sub-sectioned my view to just Manhattan. I figured there would be more of a story to tell with this section. Using np.arange and itertools I was able to get a 9000 point grid over all of Manhattan. However, I knew that some of these points were over water...but I'd have to take care of that later for now. MVP right?
</p>
<p>
I set off the 9000 queries to BigQuery and waited for about an hour for a massive data set. In the past I had gotten a full query of every single ride that landed at a certain location and then did the day by day computation on my local machine. I figured it was easy to right up an SQL command to make BigQuery do this for me....so I did just that. Make them do all the work. So finally...what I got back was a matrix that was 9000 rows by 31 columns. I had BigQuery return to me the amount of rides that were dropped off in total for every Sun/Mon/Tue/Wed/Thurs/Fri/Sat/Sun/1am/2am/3am/4am/5am
/6am/7am/8am/9am/10am/11am/12pm/1pm/2pm/3pm/4pm/5pm/6pm
/7pm/8pm/9pm/10pm/11pm/12am. Literally those were the rows. Here is what the output looked like.
</p>
<img src='/static/dist/img/bigqueryoutput.png'><br>
<p>
So every row you're looking at represents a lat/long that sits on Manhattan somewhere. The columns next to that lat and long describe how many rides arrived there for each time period (the time labels that we talked about above). Once I got that data I put in a threshold to only return certain rows that met a minimum number of rides requirement. After all I had to get rid of those that were sitting on the water. After that I tried clustering the different rows to see if we could get some meaningful labels but it was honestly a bunch of garbage. I thought to myself that the data much be varying to wildly. I then normalized the data, optimized KMeans with the silhouette statistic and the results were exactly what I was looking for. Once I saw this map I officially had a major part two of my project.
</p>
<img src='/static/dist/img/manhattanclustered_raw.png'><br>
<p>
It is way too cool to see that we have clustered different points in Manhattan purely based off of dropoff activity. The next step was to find out what these clusters actually meant. How were they different?
</p>
<p>
What I found out was that the model was able to pull out areas that had different hourly distributions but similar hourly distributions. What the hell does that mean? Check out the full <a href='/cluster_map'>infographic</a> to see exactly what the six different clusters represent. Once I had my SARIMA model up and my clusters defined I put down the code (well the python) and headed over the viz/web app side of things
</p>
</span>
<h3 style="color: black;">MVP - Visualization - Part 1 - Clustering Map</h3>
<p>
This part wasn't too bad. I have a little bit of a design background so photoshop/illustrator was at my disposal. I threw my clusters up on to CartoDB to poach their tiles and to easily check out my data. Once I had them up on Carto I took screen shots of the map and aligned those screen shots on photoshop. What I got was a really cool vibe of a map that ended up defining the visual style for the rest of my presentation. The bar charts were initially made with matplotlib and then finished off in photoshop.
</p>
<h3 style="color: black;">MVP - Visualization - Part 2 - WebApp/Map</h3>
<p>
All pages of the site were made with bootstrap.<br> Simple as that.<br>
The prediction map has a bootstrap base with a leaflet map. The orange markers are custom and made with photoshop. I used flash and AJAX to transfer data. I had ambitions of using d3 but after some initial road blocks I decided to stick with my MVP and keep things simple. I knew very little javascript before I started this project so it was great to dive into a new language and try to tackle it. Overall I'm happy with the results. For the next steps I'd like to get a little more complicated and add sliders to the graph and see the Manhattan clusters dynamically move.
</p>
</p>
<h2 class="section-heading_h2">Languages/Libraries Used</h2>
<ul>
<li>BigQuery for python</li>
<li>pandas</li>
<li>time</li>
<li>NumPy</li>
<li>itertools</li>
<li>os.istdir</li>
<li>glob</li>
<li>os</li>
<li>rv_discrete</li>
<li>SciPy.stats</li>
<li>re</li>
<li>requests</li>
<li>json</li>
<li>threading</li>
<li>Queue</li>
<li>rauth</li>
<li>BeautifulSoup</li>
<li>Jinja</li>
<li>matplotlib.pyplot</li>
</ul>
<h2 class="section-heading_h2">Resources Used</h2>
<ul>
<li>
Big Query
</li>
<li>
Python
</li>
<li>
Elasticsearch
</li>
</ul>
<h2 class="section-heading_h2">Special Thanks</h2>
<ul>
<li>
Tyler Treat - For the wonderful library that connected python and BigQuery. I had saved so much time with his code that I even sent him an email thanking him for his contribution. He even was nice enough to field a couple of my questions later on down the road.
</li>
<li>
Galvanize Staff - Specifically Jeff, Tammy, and Katie. Without these guys Ryd.io (nor myself) wouldn't be what it is today. Thanks to them for the continual inspiration and motivation.
</li>
<li>
Tal Levy - The technical advisor - A huge help throughout the process. He was that friend that I wasn't afraid to ask the dumbest of questions to.
</li>
</ul>
<h2 class="section-heading_h2">My Favorite Line Of Code</h2>
<blockquote>lat_long = self.kernel.resample(1)<br>return lat_long</blockquote>
English: 1. Using the previous dropoff_points in the past return one prediction of a new one. 2. Return that prediction
<br><br><br>
<a href="#">
<img class="img-responsive" src="../static/dist/img/seat.jpg" alt="">
</a>
<span class="caption text-muted">The spot that 98% of the project was completed</span>
</div>
</div>
</div>
</article>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<ul class="list-inline text-center">
<li>
<a href="https://twitter.com/gregkamradt">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-twitter fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="https://github.com/gkamradt">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">Copyright © Ryd.io 2015</p>
</div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="js/clean-blog.min.js"></script>
</body>
</html>