forked from csev/py4e
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcfbook016.html
470 lines (465 loc) · 22.7 KB
/
cfbook016.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="GENERATOR" content="hevea 1.07" />
<title>
Visualizing data
</title>
</head>
<body>
<a href="cfbook015.html"><img src="previous_motif.gif" alt="Previous" /></a>
<a href="index.html"><img src="contents_motif.gif" alt="Up" /></a>
<a href="cfbook017.html"><img src="next_motif.gif" alt="Next" /></a>
<hr />
<h1><font color="black"><a name="htoc185">Chapter 15</a> Visualizing data</font></h1>
<font color="black">So far we have been learning the Python language and then
learning how to use Python, the network, and databases
to manipulate data.<br />
<br />
In this chapter, we take a look at
three
complete applications that bring all of these things together
to manage and visualize data. You might use these applications
as sample code to help get you started in solving a
real-world problem.<br />
<br />
Each of the applications is a ZIP file that you can download
and extract onto your computer and execute.</font><br />
<br />
<a name="toc167"></a>
<h2><font color="black"><a name="htoc186">15.1</a> Building a Google map from geocoded data</font></h2>
<a name="@default811"></a>
<a name="@default812"></a>
<font color="black">In this project, we are using the Google geocoding API
to clean up some user-entered geographic locations of
university names and then placing the data on a Google
map. <br />
</font><div align="center"><font color="black"><img src="cfbook021.png" /></font></div><font color="black">
<br />
To get started, download the application from:<br />
<br />
<tt>www.py4inf.com/code/geodata.zip</tt><br />
<br />
The first problem to solve is that the free Google geocoding
API is rate limited to some number of requests per day. So if you have
a lot of data you might need to stop and restart the lookup
process several times. So we break the problem into two
phases. <br />
<br />
<a name="@default813"></a>
In the first phase we take our input "survey" data in the file
<b>where.data</b> and read it one line at a time, and retrieve the
geocoded information from Google and store it
in a database <b>geodata.sqlite</b>.
Before we use the geocoding API for each user-entered location,
we simply check to see if we already have the data for that
particular line of input. The database is functioning as a
local "cache" of our geocoding data to make sure we never ask
Google for the same data twice.<br />
<br />
You can re-start the process at any time by removing the file
<b>geodata.sqlite</b>.<br />
<br />
Run the <b>geoload.py</b> program. This program will read the input
lines in <b>where.data</b> and for each line check to see if it is already
in the database and if we don't have the data for the location,
call the geocoding API to retrieve the data and store it in
the database.<br />
<br />
Here is a sample run after there is already some data in the
database:
</font><pre><font size="4" color="blue">
Found in database Northeastern University
Found in database University of Hong Kong, ...
Found in database Technion
Found in database Viswakarma Institute, Pune, India
Found in database UMD
Found in database Tufts University
Resolving Monash University
Retrieving http://maps.googleapis.com/maps/api/
geocode/json?sensor=false&address=Monash+University
Retrieved 2063 characters { "results" : [
{u'status': u'OK', u'results': ... }
Resolving Kokshetau Institute of Economics and Management
Retrieving http://maps.googleapis.com/maps/api/
geocode/json?sensor=false&address=Kokshetau+Inst ...
Retrieved 1749 characters { "results" : [
{u'status': u'OK', u'results': ... }
...
</font></pre><font color="black">The first five locations are already in the database and so they
are skipped. The program scans to the point where it finds un-retrieved
locations and starts retrieving them.<br />
<br />
The <b>geoload.py</b> can be stopped at any time, and there is a counter
that you can use to limit the number of calls to the geocoding
API for each run. Given that the <b>where.data</b> only has a few hundred
data items, you should not run into the daily rate limit, but if you
had more data it might take several runs over several days to
get your database to have all of the geocoded data for your input.<br />
<br />
Once you have some data loaded into <b>geodata.sqlite</b>, you can
visualize the data using the <b>geodump.py</b> program. This
program reads the database and writes the file <b>where.js</b>
with the location, latitude, and longitude in the form of
executable JavaScript code. <br />
<br />
A run of the <b>geodump.py</b> program is as follows:
</font><pre><font size="4" color="blue">
Northeastern University, ... Boston, MA 02115, USA 42.3396998 -71.08975
Bradley University, 1501 ... Peoria, IL 61625, USA 40.6963857 -89.6160811
...
Technion, Viazman 87, Kesalsaba, 32000, Israel 32.7775 35.0216667
Monash University Clayton ... VIC 3800, Australia -37.9152113 145.134682
Kokshetau, Kazakhstan 53.2833333 69.3833333
...
12 records written to where.js
Open where.html to view the data in a browser
</font></pre><font color="black">The file <b>where.html</b> consists of HTML and JavaScript to visualize
a Google map. It reads the most recent data in <b>where.js</b> to get
the data to be visualized. Here is the format of the <b>where.js</b> file:
</font><pre><font size="4" color="blue">
myData = [
[42.3396998,-71.08975, 'Northeastern Uni ... Boston, MA 02115'],
[40.6963857,-89.6160811, 'Bradley University, ... Peoria, IL 61625, USA'],
[32.7775,35.0216667, 'Technion, Viazman 87, Kesalsaba, 32000, Israel'],
...
];
</font></pre><font color="black">This is a JavaScript variable that contains a list of lists.
The syntax for JavaScript list constants is very similar to
Python so the syntax should be familiar to you.<br />
<br />
Simply open <b>where.html</b> in a browser to see the locations. You
can hover over each map pin to find the location that the
geocoding API returned for the user-entered input. If you
cannot see any data when you open the <b>where.html</b> file, you might
want to check the JavaScript or developer console for your browser.</font><br />
<br />
<a name="toc168"></a>
<h2><font color="black"><a name="htoc187">15.2</a> Visualizing networks and interconnections</font></h2>
<a name="@default814"></a>
<a name="@default815"></a>
<a name="@default816"></a>
<font color="black">In this application, we will perform some of the functions of a search
engine. We will first spider a small subset of the web and then run
a simplified version of the Google page rank algorithm to
to determine which pages are most highly connected and then visualize
the page rank and connectivity of our small corner of the web.
We will use the D3 JavaScript visualization library
<tt>http://d3js.org/</tt> to produce the visualization output.<br />
<br />
You can download and extract this application from:<br />
<br />
<tt>www.py4inf.com/code/pagerank.zip</tt></font><br />
<div align="center"><font color="black"><img src="cfbook022.png" /></font></div><font color="black">
<br />
The first program (<b>spider.py</b>) program crawls a web
site and pulls a series of pages into the
database (<b>spider.sqlite</b>), recording the links between pages.
You can restart the process at any time by removing the
<b>spider.sqlite</b> file and re-running <b>spider.py</b>.
</font><pre><font size="4" color="blue">
Enter web url or enter: http://www.dr-chuck.com/
['http://www.dr-chuck.com']
How many pages:2
1 http://www.dr-chuck.com/ 12
2 http://www.dr-chuck.com/csev-blog/ 57
How many pages:
</font></pre><font color="black">In this sample run, we told it to crawl a website and retrieve two
pages. If you restart the program and tell it to crawl more
pages, it will not re-crawl any pages already in the database. Upon
restart it goes to a random non-crawled page and starts there. So
each successive run of <b>spider.py</b> is additive.
</font><pre><font size="4" color="blue">
Enter web url or enter: http://www.dr-chuck.com/
['http://www.dr-chuck.com']
How many pages:3
3 http://www.dr-chuck.com/csev-blog 57
4 http://www.dr-chuck.com/dr-chuck/resume/speaking.htm 1
5 http://www.dr-chuck.com/dr-chuck/resume/index.htm 13
How many pages:
</font></pre><font color="black">You can have multiple starting points in the same database -
within the program these are called "webs". The spider
chooses randomly amongst all non-visited links across all
the webs as the next page to spider.<br />
<br />
If you want to dump the contents of the <b>spider.sqlite</b> file, you can
run <b>spdump.py</b> as follows:
</font><pre><font size="4" color="blue">
(5, None, 1.0, 3, u'http://www.dr-chuck.com/csev-blog')
(3, None, 1.0, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')
(1, None, 1.0, 2, u'http://www.dr-chuck.com/csev-blog/')
(1, None, 1.0, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm')
4 rows.
</font></pre><font color="black">This shows the number of incoming links, the old page rank, the new page
rank, the id of the page, and the url of the page. The <b>spdump.py</b> program
only shows pages that have at least one incoming link to them.<br />
<br />
Once you have a few pages in the database, you can run page rank on the
pages using the <b>sprank.py</b> program. You simply tell it how many page
rank iterations to run.
</font><pre><font size="4" color="blue">
How many iterations:2
1 0.546848992536
2 0.226714939664
[(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)]
</font></pre><font color="black">You can dump the database again to see that page rank has been updated:
</font><pre><font size="4" color="blue">
(5, 1.0, 0.985, 3, u'http://www.dr-chuck.com/csev-blog')
(3, 1.0, 2.135, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')
(1, 1.0, 0.659, 2, u'http://www.dr-chuck.com/csev-blog/')
(1, 1.0, 0.659, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm')
4 rows.
</font></pre><font color="black">You can run <b>sprank.py</b> as many times as you like and it will simply refine
the page rank each time you run it. You can even run <b>sprank.py</b> a few times
and then go spider a few more pages sith <b>spider.py</b> and then run <b>sprank.py</b>
to re-converge the page rank values. A search engine usually runs both the crawling and
ranking programs all the time.<br />
<br />
If you want to restart the page rank calculations without re-spidering the
web pages, you can use <b>spreset.py</b> and then restart <b>sprank.py</b>.
</font><pre><font size="4" color="blue">
How many iterations:50
1 0.546848992536
2 0.226714939664
3 0.0659516187242
4 0.0244199333
5 0.0102096489546
6 0.00610244329379
...
42 0.000109076928206
43 9.91987599002e-05
44 9.02151706798e-05
45 8.20451504471e-05
46 7.46150183837e-05
47 6.7857770908e-05
48 6.17124694224e-05
49 5.61236959327e-05
50 5.10410499467e-05
[(512, 0.0296), (1, 12.79), (2, 28.93), (3, 6.808), (4, 13.46)]
</font></pre><font color="black">For each iteration of the page rank algorithm it prints the average
change in page rank per page. The network initially is quite
unbalanced and so the individual page rank values change wildly between
iterations.
But in a few short iterations, the page rank converges. You
should run <b>prank.py</b> long enough that the page rank values converge.<br />
<br />
If you want to visualize the current top pages in terms of page rank,
run <b>spjson.py</b> to read the database and write the data for the
most highly linked pages in JSON format to be viewed in a
web browser.
</font><pre><font size="4" color="blue">
Creating JSON output on spider.json...
How many nodes? 30
Open force.html in a browser to view the visualization
</font></pre><font color="black">You can view this data by opening the file <b>force.html</b> in your web browser.
This shows an automatic layout of the nodes and links. You can click and
drag any node and you can also double click on a node to find the URL
that is represented by the node.<br />
<br />
If you re-run the other utilities, re-run <b>spjson.py</b>
press refresh in the browser to get the new data from <b>spider.json</b>.</font><br />
<br />
<a name="toc169"></a>
<h2><font color="black"><a name="htoc188">15.3</a> Visualizing mail data</font></h2>
<font color="black">Up to this point in the book, you have become quite familiar with our
<b>mbox-short.txt</b> and <b>mbox.txt</b> data files. Now it is time to take
our analysis of e-mail data to the next level. <br />
<br />
In the real world, sometimes you have to pull down mail data from servers
and that might take quite some time and the data might be inconsistent,
error filled and need a lot of cleanup or adjustment. In this section, we
work with an application that is the most complex so far and pull down nearly a
gigabyte of data and visualize it.<br />
</font><div align="center"><font color="black"><img src="cfbook023.png" /></font></div><font color="black">
<br />
You can download this application from:<br />
<br />
<tt>www.py4inf.com/code/gmane.zip</tt><br />
<br />
We will be using data from a free e-mail list archiving service called
<tt>www.gmane.org</tt>. This service is very popular with open-source
projects because it provides a nice searchable archive of their
e-mail activity. They also have a very liberal policy regarding accessing
their data through their API. They have no rate limits, but ask that you
don't overload their service and take only the data you need. You can read
gmane's terms and conditions at this page:<br />
<br />
<tt>http://gmane.org/export.php</tt><br />
<br />
<em>It is very important that you make use of the gmane.org data
responsibly by adding delays to your access of their services and spreading
long-running jobs over a longer period of time. Do not abuse this free service
and ruin it for the rest of us.</em><br />
<br />
When the Sakai e-mail data was spidered using this software, it produced nearly
a Gigabyte of data and took a number of runs on several days.
The file <b>README.txt</b> in the above ZIP may have instructions as to how
you can download a pre-spidered copy of the <b>content.sqlite</b> file for
a majority of the Sakai e-mail corpus so you don't have to spider for
five days just to run the programs. If you download the pre-spidered
content, you should still run the spidering process to catch up with
more recent messages.<br />
<br />
The first step is to spider the gmane repository. The base URL
is hard-coded in the <b>gmane.py</b> and is hard-coded to the Sakai
developer list. You can spider another repository by changing that
base url. Make sure to delete the <b>content.sqlite</b> file if you
switch the base url. <br />
<br />
The <b>gmane.py</b> file operates as a responsible caching spider in
that it runs slowly and retrieves one mail message per second so
as to avoid getting throttled by gmane. It stores all of
its data in a database and can be interrupted and re-started
as often as needed. It may take many hours to pull all the data
down. So you may need to restart several times.<br />
<br />
Here is a run of <b>gmane.py</b> retrieving the last five messages of the
Sakai developer list:
</font><pre><font size="4" color="blue">
How many messages:10
http://download.gmane.org/gmane.comp.cms.sakai.devel/51410/51411 9460
[email protected] 2013-04-05 re: [building ...
http://download.gmane.org/gmane.comp.cms.sakai.devel/51411/51412 3379
[email protected] 2013-04-06 re: [building ...
http://download.gmane.org/gmane.comp.cms.sakai.devel/51412/51413 9903
[email protected] 2013-04-05 [building sakai] melete 2.9 oracle ...
http://download.gmane.org/gmane.comp.cms.sakai.devel/51413/51414 349265
[email protected] 2013-04-07 [building sakai] ...
http://download.gmane.org/gmane.comp.cms.sakai.devel/51414/51415 3481
[email protected] 2013-04-07 re: ...
http://download.gmane.org/gmane.comp.cms.sakai.devel/51415/51416 0
Does not start with From
</font></pre><font color="black">The program scans <b>content.sqlite</b> from one up to the first message number not
already spidered and starts spidering at that message. It continues spidering
until it has spidered the desired number of messages or it reaches a page
that does not appear to be a properly formatted message.<br />
<br />
Sometimes <tt>gmane.org</tt> is missing a message. Perhaps administrators can delete messages
or perhaps they get lost. If your spider stops, and it seems it has hit
a missing message, go into the SQLite Manager and add a row with the missing id leaving
all the other fields blank and restart <b>gmane.py</b>. This will unstick the
spidering process and allow it to continue. These empty messages will be ignored in the next
phase of the process.<br />
<br />
One nice thing is that once you have spidered all of the messages and have them in
<b>content.sqlite</b>, you can run <b>gmane.py</b> again to get new messages as
they get sent to the list. <br />
<br />
The <b>content.sqlite</b> data is pretty raw, with an inefficient data model,
and not compressed.
This is intentional as it allows you to look at <b>content.sqlite</b>
in the SQLite Manager to debug problems with the spidering process.
It would be a bad idea to run any queries against this database as they
would be quite slow.<br />
<br />
The second process is to run the program <b>gmodel.py</b>. This program reads the rough/raw
data from <b>content.sqlite</b> and produces a cleaned-up and well-modeled version of the
data in the file <b>index.sqlite</b>. The file index.sqlite will be much smaller (often 10X
smaller) than <b>content.sqlite</b> because it also compresses the header and body text.<br />
<br />
Each time <b>gmodel.py</b> runs - it deletes and re-builds <b>index.sqlite</b>, allowing
you to adjust its parameters and edit the mapping tables in <b>content.sqlite</b> to tweak the
data cleaning process. This is a sample run of <b>gmodel.py</b>. It prints a line out each time
250 mail messages are processed so you can see some progress happening as this program may
run for a while processing nearly a Gigabyte of mail data.
</font><pre><font size="4" color="blue">
Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00 [email protected]
251 2005-12-22T10:03:20-08:00 [email protected]
501 2006-01-12T11:17:34-05:00 [email protected]
751 2006-01-24T11:13:28-08:00 [email protected]
...
</font></pre>
<font color="black">The <b>gmodel.py</b> program handles a number of data cleaning tasks.<br />
<br />
Domain names are truncated to two levels for .com, .org, .edu, and .net.
Other domain names are truncated to three levels. So si.umich.edu becomes
umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also e-mail addresses are
forced to lower case and some of the @gmane.org address like the following
</font><pre><font size="4" color="blue">
</font></pre><font color="black">are converted to the real address whenever there is a matching real e-mail
address elsewhere in the message corpus.<br />
<br />
If you look in the <b>content.sqlite</b> database there are two tables that allow
you to map both domain names and individual e-mail addresses that change over
the lifetime of the e-mail list. For example, Steve Githens used the following
e-mail addresses as he changed jobs over the life of the Sakai developer list:
</font><pre><font size="4" color="blue">
</font></pre><font color="black">We can add two entries to the Mapping table in <b>content.sqlite</b> so
<b>gmodel.py</b> will map all three to one address:
</font><pre><font size="4" color="blue">
</font></pre><font color="black">You can also make similar entries in the DNSMapping table if there are multiple
DNS names you want mapped to a single DNS. The following mapping was added to the the Sakai data:
</font><pre><font size="4" color="blue">
iupui.edu -> indiana.edu
</font></pre><font color="black">So all the accounts from the various Indiana University campuses are tracked together.<br />
<br />
You can re-run the <b>gmodel.py</b> over and over as you look at the data, and add mappings
to make the data cleaner and cleaner. When you are done, you will have a nicely
indexed version of the e-mail in <b>index.sqlite</b>. This is the file to use to do data
analysis. With this file, data analysis will be really quick.<br />
<br />
The first, simplest data analysis is to determine "who sent the most mail?" and "which
organization sent the most mail"? This is done using <b>gbasic.py</b>:
</font><pre><font size="4" color="blue">
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants
[email protected] 2657
[email protected] 1742
[email protected] 1591
[email protected] 1304
[email protected] 1184
Top 5 Email list organizations
gmail.com 7339
umich.edu 6243
uct.ac.za 2451
indiana.edu 2258
unicon.net 2055
</font></pre><font color="black">Note how much more quickly <b>gbasic.py</b> runs compared to <b>gmane.py</b>
or even <b>gmodel.py</b>. They are all working on the same data, but
<b>gbasic.py</b> is using the compressed and normalized data in
<b>index.sqlite</b>. If you have a lot of data to manage, a multi-step
process like the one in this application may take a little longer to develop,
but will save you a lot of time when you really start to explore
and visualize your data.<br />
<br />
You can produce a simple visualization of the word frequency in the subject lines
in the file <b>gword.py</b>:
</font><pre><font size="4" color="blue">
Range of counts: 33229 129
Output written to gword.jsonp
</font></pre>
<font color="black">This produces the file <b>gword.jsonp</b> which you can visualize using
<b>gword.htm</b> to produce a word cloud similar to the one at the beginning
of this section.<br />
<br />
A second visualization is produced by <b>gline.py</b>. It computes e-mail
participation by organizations over time.
</font><pre><font size="4" color="blue">
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu',
'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com',
'stanford.edu', 'ox.ac.uk']
Output written to gline.jsonp
</font></pre><font color="black">Its output is written to <b>gline.jsonp</b> which is visualized using <b>gline.htm</b>.<br />
</font><div align="center"><font color="black"><img src="cfbook024.png" /></font></div><font color="black">
<br />
This is a relatively complex and sophisticated application and
has features to do some real data retrieval, cleaning and visualization.</font><br />
<br />
<hr />
<a href="cfbook015.html"><img src="previous_motif.gif" alt="Previous" /></a>
<a href="index.html"><img src="contents_motif.gif" alt="Up" /></a>
<a href="cfbook017.html"><img src="next_motif.gif" alt="Next" /></a>
</body>
</html>