Skip to content

Commit

Permalink
changes to day3
Browse files Browse the repository at this point in the history
  • Loading branch information
sirrice committed Jan 11, 2012
1 parent c4627eb commit 9485f13
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 20 deletions.
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
all: labs

labs: day0/README.md day1/README.md day2/README.md day3/regression.py day3/hypothesis_testing.py
cp -r ./day0 ./day1 ./day2 ./day3 ./day4 /tmp/dataiap_html/
python resources/markdown/markdown_headers.py day0/README.md /tmp/dataiap_html/day0/index.html
python resources/markdown/markdown_headers.py day1/README.md /tmp/dataiap_html/day1/index.html
python resources/markdown/markdown_headers.py day2/README.md /tmp/dataiap_html/day2/index.html
python resources/markdown/markdown_headers.py day3/README.md /tmp/dataiap_html/day3/index.html
python resources/markdown/markdown_headers.py day4/README.md /tmp/dataiap_html/day4/index.html
python resources/hacco/hacco.py day3/regression.py -d /tmp/dataiap_html/day3/ #/tmp/dataiap_html
python resources/hacco/hacco.py day3/hypothesis_testing.py -d /tmp/dataiap_html/day3/ #/tmp/dataiap_html
cp -r ./day0 ./day1 ./day2 ./day3 ./day4 /tmp/dataiap_html/
echo "\n\nnow do: \n\tgit checkout gh-pages\n\tcp -r /tmp/dataiap_html/* .\n"
6 changes: 6 additions & 0 deletions day3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Day 3

Today's lab is broken into two parts.

1. [Hypothesis testing and T-tests](./hypothesis_testing.html)
1. [Regressions](./regression.html)
42 changes: 23 additions & 19 deletions day3/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,17 @@
# process](https://github.com/dataiap/dataiap/blob/master/datasets/county_health_rankings/README)
# if you're interested.
#
# There's still a bit of work to do to load the data. Some of the
# YPLL values are marked "Unreliable" in a column ypll.csv, and we
# don't want to train our regression on these. Simiarly, some of the
# columns of additional measures are empty, and we want to discard
# these. Finally, there is a row per state that summarizes the
# state's statistics, and we want to ignore that row since we are
# doing a county-by-county analysis. Here's a function, `read_csv`, that
# will read the desired columns from one of the csv files.
# We need to perform some data cleaning and filtering when loading the
# data. There is a column called "Unreliable" that will be marked if
# we shouldn't trust the YPLL data. We want to ignore those. Also,
# some of the rows won't contain data for some of the additional
# measures. For example, Yakutat, Alaska doesn't have a value for %
# child illiteracy. We want to skip those rows. Finally, there is a
# row per state that summarizes the state's statistics. It has an
# empty value for the "county" column and we want to
# ignore those rows since we are doing a county-by-county
# analysis. Here's a function, `read_csv`, that will read the desired
# columns from one of the csv files.

import csv

Expand All @@ -72,17 +75,18 @@ def read_csv(file_name, cols, check_reliable):
pass
return rows

# The function returns a dictionary mapping each state/county to the
# columns in an array `cols`. It handles all of the dirty data: data
# marked unreliable, state-only data, and missing columns.
#
# All of this data cleaning across different .csv files will result in
# some county YPLL data to be dropped for being unreliable, and some
# county additional measures data to be dropped for having missing
# columns. We need to do what database folks call a ** join ** between
# the two county datasets so that only the counties remaining in both
# datasets will be considered. This is handled by the function
# `get_arrs`:
# The function takes as input the csv filename, an array of column
# names to extract, and whether or not it should check and discard
# unreliable data. It returns a dictionary mapping each state/county
# to the values of the columns specified in `cols`. It handles all of the dirty
# data: data marked unreliable, state-only data, and missing columns.
#
# When we call `read_csv` multiple times with different csv files, a
# row that is dropped in one csv file may be kept in another. We need
# to do what database folks call a ** join ** between the `dict`
# objects returned from `read_csv` so that only the counties remaining
# in both datasets will be considered. This is handled by the
# function `get_arrs`:

import numpy

Expand Down

0 comments on commit 9485f13

Please sign in to comment.