D4D-W205-Final

Final for W205 - Working With Data For Democracy for Election Transparency

Members of this final project: Nikki Haas I-Ching Wang Divya Sriram

Before analyzing data, the raw data must be converted properly.

Clone this repo
Choose a state to analyze
Go to the following website and download zip folder of data for your state (ex: Ohio or OH) https://www2.census.gov/census_2010/04-Summary_File_1/
Unzip the file
Add the unzipped file as is to the cloned repo
Run the 'raw_to_table.py' python script with the following command: for i in p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p12a p12b p12d p12e p12f p12g p12h p12i p13 p13a p13b p13d p13e p13f p13g p13h p13i p14 p35 p36 p37a p37b p37d p37e p37f p37g p37h p37i p44 p45 p46 p47 p48 p49; do python raw_to_table.py -f oh-raw/ -d $i -s oh -t; done Hint: if you clone the github repo, all you need to do is add in a raw-data folder that consists of all the files that were unzipped from the US Census zip file that was downloaded. All default paths should then work.

The help menu for this script is: python raw_to_table.py -h

Create properly formatted tab-delimited tables and geo table from raw US Census Gov data.

-f path to folder that contains the raw data. required -d name of the table you want to create (ex. p1, p12a, etc). required -o output file name (default is state_table_2010.txt. ie, tn_p1_2010.txt) -c headers folder (default is 'header_files_with_text_clean', located in folder you are running in) -m map file (default is 'sf1_table_map.csv', located in folder you are running in) -s state (ex. tn, ca, nc, etc) required -t text header names (default is number header names. trigger text headers by putting '-t')

The output will be: (a) tables without headers (ex: oh_p9_2010_no_header.txt) - tables without headers were created in attempt to use hive (b) tables with headers (ex: oh_p9_2010.txt) (c) oh_geo_cleaned_2010_no_header.txt
Put all newly created files (^above) into a folder for that state (ex: put all files into a folder called "oh_data_cleaned")

Now that all the raw data has been made into clean datasets (tables), we can run some analysis on our data.

Now run the 'gerry_analysis.py' python script to create a bar chart showing the 'Minority Deviation from Mean in Contentious Districts' for your state.

$ python gerry_analysis.py -s oh -i oh_data_cleaned/

Help menu for this script is: python gerry_analysis.py -h

Create a plot for the contentious districts difference in population against the mean.
```
 -i  Input folder.  This should be the folder that contains files generated from raw_to_table.py
 -s  2-letter State identifier
```
The output will be a png image file for the race profiles.

##Joining the Data Together

Once the geo tables, the SF1 tables of importance, and the County/CD maps are loaded into HDFS, we can now start doing the fun stuff. The SF1 tables contain multiple levels of geographic representation, including the whole state, blocks, Census tracts, and counties. For our w205 analysis, we used the County data. D4D will ultimately use the block data when the shapefile team completes their cleaning project. For our purposes, it was not feasible to groom both SF1 data and Shapefiles. Shapefiles require knowledge of CAD and GPS data that are beyond the scope of our project. We can subset each p-table by county via joining to the geo file and the county/distric map:

SELECT
p.*,
g.*,
c.*
FROM tn_p1 p
JOIN tn_geo_cleaned g
ON concat(p.filedid,'-',p.stusab,'-',p.chariter,'-',p.cifsn,'-',p.logrecno) = concat(g.filedid,'-',g.stusab,'-',g.chariter,'-',g.cifsn,'-',g.logrecno)
JOIN tn_dist_county_map c
on UPPER(g.location) = UPPER(c.county)
WHERE
UPPER(g.location) LIKE '%COUNTY'
and UPPER(g.location) NOT LIKE '%(PART)%';

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
ching_playground_notebook		ching_playground_notebook
header_files_with_text_clean		header_files_with_text_clean
nc_data_cleaned		nc_data_cleaned
oh_data_cleaned		oh_data_cleaned
old_scripts		old_scripts
tn_data_cleaned		tn_data_cleaned
DivChingNik_Final_Presentation_w205.pdf		DivChingNik_Final_Presentation_w205.pdf
NC_exploratory_data_analysis.ipynb		NC_exploratory_data_analysis.ipynb
OH_exploratory_data_analysis.ipynb		OH_exploratory_data_analysis.ipynb
README.md		README.md
TN_exploratory_data_analysis.ipynb		TN_exploratory_data_analysis.ipynb
gerry_analysis.py		gerry_analysis.py
hive_analysis.sql		hive_analysis.sql
hive_base.sql		hive_base.sql
load_data_oh.sh		load_data_oh.sh
load_data_tn.sh		load_data_tn.sh
nc_dist_county_map.csv		nc_dist_county_map.csv
nc_geo_cleaned.csv		nc_geo_cleaned.csv
oh_dist_county_map.csv		oh_dist_county_map.csv
oh_geo_cleaned.csv		oh_geo_cleaned.csv
raw_to_table.py		raw_to_table.py
sf1_table_map.csv		sf1_table_map.csv
tn_african_deviation.png		tn_african_deviation.png
tn_asian_deviation.png		tn_asian_deviation.png
tn_caucasian_deviation.png		tn_caucasian_deviation.png
tn_dist_county_map.csv		tn_dist_county_map.csv
tn_gerrymandering_race_profile.png		tn_gerrymandering_race_profile.png
tn_latino_deviation.png		tn_latino_deviation.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D4D-W205-Final

$ python gerry_analysis.py -s oh -i oh_data_cleaned/

About

Releases

Packages

Languages

chingnaw/D4D-W205-Final

Folders and files

Latest commit

History

Repository files navigation

D4D-W205-Final

$ python gerry_analysis.py -s oh -i oh_data_cleaned/

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages