The Relationships of Gun Violence, Educational Attainment, Poverty Status And Unemployment Rate in US
Siyu Duan - [email protected]
Baisheng Qiu - [email protected]
We have two python files:
zip_code_crawler.py is only to display how we get zip codes by request API, and the output can be viewed in the file "zip_code_crawler1.csv";
gv_violence_analysis.py is tp show how we did the analysis and visualized our data.
All the datasets we used could be downloaded here- https://drive.google.com/drive/folders/1fT9W7jdMuOJouXkO0_3aHLns31UShLUb?usp=sharing
Gun violence Data - https://github.com/jamesqo/gun-violence-data
Education Attainment - https://data.census.gov/cedsci/table?t=Educational%20Attainment&g=0100000US.860000&y=2017&tid=ACSST5Y2017.S1501
Unemployment - https://data.census.gov/cedsci/table?t=Employment%20and%20Labor%20Force%20Status&g=0100000US.050000&y=2017&tid=ACSST1Y2017.S2301
Population & Sex & Age - https://data.census.gov/cedsci/table?t=Populations%20and%20People&g=0100000US.860000&y=2017&tid=ACSST5Y2017.S0101
Hypothesis 1: There is a positive correlation between the number of gun cases and unemployment rate.
Hypothesis 2: There is a positive correlation between the number of gun violence cases and poverty status.
Hypothesis 3: There is a negative correlation between the number of gun cases and the educational level.
Due to the imbalance population distribution between different cities, we decided to do our analysis on the zip-code level.
From the above graph, we can observe that gun violence cases have increased every year since 2014. In 2014, the number of cases has reached 50k then reached 55k in 2016 and finally reached 60k in 2017. Gun violence increased 20% in 3 years.
Three States have most gun violence incidents between 2013-2018 are California, Florida, and Illinois. For almost every year between 2013 to 2018, California and Illinois would always on the top 3 list. As long as California and Illinois have many incident cases, it doesn’t mean they are the most dangerous states. Those incidents with high number of killed usually coming with high number of injureds.These cases are identified as mass shooting incidents Most of the participant were unharmed ,21% of particpants were arrested,25.3% of particpants were injured and 13% of participants were killed. From the three above graph, we can observe the higher peak around July in all three years. This is interesting as July 4th is celebrated as the independent day in the United States of America.will pull out the records from July 4th to see if the peak records on that month was infected by the independent day. And I also expecting a peak occurred in November 2016, because election day was on November 8th, 2016. By look at the count of the incident for year 2015,2016,2017. The day on July 4th always on the top 3 list of the number of cases in years. on Novermber8th,2016.The number of incidents was't as high as I expected, So one of my guesses was failed. This is the workflow for our projectFor getting the relative zip code to specific address, we requested geolocation data from google API. And we got 61400 zip codes for clustering in analysis.
We chose GV rate for correlation analysis and GV level for predictive analysis. GV level is based on the GV rate, and the splitting method will be mentioned later. And for eductional level, we examined two indexes with different definitions and tried to figure out which one can help improve the performance of our models.
For zip codes:
Cleaned the null values and None values on the zipcode column and the null values and empty field on the address column.
Cleaned irregular zip code like 27410(Too many cases, that zip code represent empty address).
For variables:
According to definations, we merged population, unemployment rate, poverty rate, and percentage of people with high school degree or higher and percentage of people with Bachelor's degree or higher with Gun Violence dataset together. Then we checked missing values of these variables and there are two types: NaN(Inf) and "-".
Before cleaning, there are 9847 rows.
To view the missing value, we counted missing value with two types as we mentioned.
And then we viewed the skewness of all variables:
The distribution of poverty rate shows the right skewness.
The distribution of unemployment rate shows the right skewness.
The distribution of education level (High School or Higher) shows the left skewness and so as education level (Bachelor's or Higher).
We found the all the variables are skewed, especially GV rate, so we used Log transformation to handle the skewness of GV rate.
Since they are all numerical variables and highly skewed with only a very small number of missing values, we decided to clean the missing values.
After data cleaning, there are 9819 samples:
We did descriptive statistics:
Due to high skewness, the gv rate were divided into Level 0 - 7 according to [ <, median-3std, median-2std, median-std, median, median+std, median+2std, median +3std, >]. After splitting, each level contains the number of samples:
We constructed correlation metrix and pairwise metrix to display the correlationship of variables.
As we can observe, as the zip code area has a higher unemployment rate, the rate of gun violence increases. By calculating the correlation coefficient of unemployment rate and GV level, which is 0.292, we can accept hypothesis 1, even though the relationship is not strong.
As we can observe, as the zip code area has a higher poverty rate, the rate of gun violence increases. By calculating the correlation coefficient of poverty rate and GV level, which is 0.442, we can accept hypothesis 2.
As we can observe, as the zip code area has a higher education level, the rate of gun violence does decrease. By calculating the correlation coefficient of education level (Bachelor’s or higher) and GV level, which is -0.384, we can accept hypothesis 3.
We also used variables to build a linear regression model. The summary of multi-linear regression shows the variables are significantly related ( p < 0.05), and the model has relatively higher goodness of fit. (R^2)
Although all the hypotheses we made were accepted, the correlation in all three variables are not that strong. Thus, we were thinking to find others variables that may have a higher correlation to the gun violence rate.
We found the correlation between the unemployment rate and poverty has a much higher coefficient to 0.6213, and the correlation coefficient between poverty and education level is 0.5668. Next step is to deal with multicollinearity.
And for predictive model construction, we will continue to train other models and adjust parameters to optimize our models. Also we may consider use ElasticNet Regression from the Abhiram's advice.