changes to day3

meklf · Jan 11, 2012 · 9485f13 · 9485f13
1 parent c4627eb
commit 9485f13
Show file tree

Hide file tree

Showing 3 changed files with 31 additions and 20 deletions.
diff --git a/Makefile b/Makefile
@@ -1,11 +1,12 @@
 all: labs
 
 labs: day0/README.md day1/README.md day2/README.md day3/regression.py day3/hypothesis_testing.py
+	cp  -r ./day0 ./day1 ./day2 ./day3 ./day4 /tmp/dataiap_html/
 	python resources/markdown/markdown_headers.py day0/README.md /tmp/dataiap_html/day0/index.html
 	python resources/markdown/markdown_headers.py day1/README.md /tmp/dataiap_html/day1/index.html
 	python resources/markdown/markdown_headers.py day2/README.md /tmp/dataiap_html/day2/index.html
+	python resources/markdown/markdown_headers.py day3/README.md /tmp/dataiap_html/day3/index.html
 	python resources/markdown/markdown_headers.py day4/README.md /tmp/dataiap_html/day4/index.html
 	python resources/hacco/hacco.py day3/regression.py -d /tmp/dataiap_html/day3/ #/tmp/dataiap_html
 	python resources/hacco/hacco.py day3/hypothesis_testing.py -d /tmp/dataiap_html/day3/ #/tmp/dataiap_html
-	cp  -r ./day0 ./day1 ./day2 ./day3 ./day4 /tmp/dataiap_html/
 	echo "\n\nnow do: \n\tgit checkout gh-pages\n\tcp -r /tmp/dataiap_html/* .\n"
diff --git a/day3/README.md b/day3/README.md
@@ -0,0 +1,6 @@
+# Day 3
+
+Today's lab is broken into two parts.
+
+1. [Hypothesis testing and T-tests](./hypothesis_testing.html)
+1. [Regressions](./regression.html)
diff --git a/day3/regression.py b/day3/regression.py
@@ -46,14 +46,17 @@
 # process](https://github.com/dataiap/dataiap/blob/master/datasets/county_health_rankings/README)
 # if you're interested.
 #
-# There's still a bit of work to do to load the data.  Some of the
-# YPLL values are marked "Unreliable" in a column ypll.csv, and we
-# don't want to train our regression on these.  Simiarly, some of the
-# columns of additional measures are empty, and we want to discard
-# these.  Finally, there is a row per state that summarizes the
-# state's statistics, and we want to ignore that row since we are
-# doing a county-by-county analysis. Here's a function, `read_csv`, that
-# will read the desired columns from one of the csv files.
+# We need to perform some data cleaning and filtering when loading the
+# data.  There is a column called "Unreliable" that will be marked if
+# we shouldn't trust the YPLL data.  We want to ignore those.  Also,
+# some of the rows won't contain data for some of the additional
+# measures.  For example, Yakutat, Alaska doesn't have a value for %
+# child illiteracy.  We want to skip those rows.  Finally, there is a
+# row per state that summarizes the state's statistics.  It has an
+# empty value for the "county" column and we want to
+# ignore those rows since we are doing a county-by-county
+# analysis. Here's a function, `read_csv`, that will read the desired
+# columns from one of the csv files.
 
 import csv
 
@@ -72,17 +75,18 @@ def read_csv(file_name, cols, check_reliable):
             pass
     return rows
 
-# The function returns a dictionary mapping each state/county to the
-# columns in an array `cols`.  It handles all of the dirty data: data
-# marked unreliable, state-only data, and missing columns.
-#
-# All of this data cleaning across different .csv files will result in
-# some county YPLL data to be dropped for being unreliable, and some
-# county additional measures data to be dropped for having missing
-# columns.  We need to do what database folks call a ** join ** between
-# the two county datasets so that only the counties remaining in both
-# datasets will be considered.  This is handled by the function
-# `get_arrs`:
+# The function takes as input the csv filename, an array of column
+# names to extract, and whether or not it should check and discard
+# unreliable data.  It returns a dictionary mapping each state/county
+# to the values of the columns specified in `cols`.  It handles all of the dirty
+# data: data marked unreliable, state-only data, and missing columns.
+#
+# When we call `read_csv` multiple times with different csv files, a
+# row that is dropped in one csv file may be kept in another.  We need
+# to do what database folks call a ** join ** between the `dict`
+# objects returned from `read_csv` so that only the counties remaining
+# in both datasets will be considered.  This is handled by the
+# function `get_arrs`:
 
 import numpy