Skip to content

siyuduan6/2021_Spring_finals

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2021_Spring_finals

The Relationships of Gun Violence, Educational Attainment, Poverty Status And Unemployment Rate in US

Introduction

Gun violence represents a major threat to the health and safety of all Americans. According to the simple analysis from the gun violence data, the number of gun violence cases has steadily increased by more than 4 years. The argument about banning guns was always a controversial topic in the United States. Some people might say the gun should not be banned because every gun shooting should blame the person who uses the gun, not the gun itself. So we were thinking if there any correlation between the gun violence rate and the education level, poverty status, and unemployment.

Team Members

Siyu Duan - [email protected]
Baisheng Qiu - [email protected]

Datasets used for Analysis

We have two python files:

zip_code_crawler.py is only to display how we get zip codes by request API, and the output can be viewed in the file "zip_code_crawler1.csv";

gv_violence_analysis.py is tp show how we did the analysis and visualized our data.

All the datasets we used could be downloaded here- https://drive.google.com/drive/folders/1fT9W7jdMuOJouXkO0_3aHLns31UShLUb?usp=sharing

Sources:

Gun violence Data - https://github.com/jamesqo/gun-violence-data

Education Attainment - https://data.census.gov/cedsci/table?t=Educational%20Attainment&g=0100000US.860000&y=2017&tid=ACSST5Y2017.S1501

Unemployment - https://data.census.gov/cedsci/table?t=Employment%20and%20Labor%20Force%20Status&g=0100000US.050000&y=2017&tid=ACSST1Y2017.S2301

Poverty - https://data.census.gov/cedsci/table?t=Income%20and%20Poverty&g=0100000US.860000&y=2017&tid=ACSST5Y2017.S1701

Population & Sex & Age - https://data.census.gov/cedsci/table?t=Populations%20and%20People&g=0100000US.860000&y=2017&tid=ACSST5Y2017.S0101

Hypothesis

Hypothesis 1: There is a positive correlation between the number of gun cases and unemployment rate.

Hypothesis 2: There is a positive correlation between the number of gun violence cases and poverty status.

Hypothesis 3: There is a negative correlation between the number of gun cases and the educational level.

Due to the imbalance population distribution between different cities, we decided to do our analysis on the zip-code level.

Background:

From the above graph, we can observe that gun violence cases have increased every year since 2014. In 2014, the number of cases has reached 50k then reached 55k in 2016 and finally reached 60k in 2017. Gun violence increased 20% in 3 years.

Three States have most gun violence incidents between 2013-2018 are California, Florida, and Illinois. For almost every year between 2013 to 2018, California and Illinois would always on the top 3 list. As long as California and Illinois have many incident cases, it doesn’t mean they are the most dangerous states.

Those incidents with high number of killed usually coming with high number of injureds.These cases are identified as mass shooting incidents Most of the participant were unharmed ,21% of particpants were arrested,25.3% of particpants were injured and 13% of participants were killed.

July 4th

From the three above graph, we can observe the higher peak around July in all three years. This is interesting as July 4th is celebrated as the independent day in the United States of America.will pull out the records from July 4th to see if the peak records on that month was infected by the independent day. And I also expecting a peak occurred in November 2016, because election day was on November 8th, 2016. By look at the count of the incident for year 2015,2016,2017. The day on July 4th always on the top 3 list of the number of cases in years. on Novermber8th,2016.The number of incidents was't as high as I expected, So one of my guesses was failed.

Workflow

This is the workflow for our project

Zip Code Crawling

For getting the relative zip code to specific address, we requested geolocation data from google API. And we got 61400 zip codes for clustering in analysis.

Variable Definitions

We chose GV rate for correlation analysis and GV level for predictive analysis. GV level is based on the GV rate, and the splitting method will be mentioned later. And for eductional level, we examined two indexes with different definitions and tried to figure out which one can help improve the performance of our models.

Data Cleaning and Descriptions

Missing Value

For zip codes:

Cleaned the null values and None values on the zipcode column and the null values and empty field on the address column.

Cleaned irregular zip code like 27410(Too many cases, that zip code represent empty address).

For variables:

According to definations, we merged population, unemployment rate, poverty rate, and percentage of people with high school degree or higher and percentage of people with Bachelor's degree or higher with Gun Violence dataset together. Then we checked missing values of these variables and there are two types: NaN(Inf) and "-".

Before cleaning, there are 9847 rows.

To view the missing value, we counted missing value with two types as we mentioned.

And then we viewed the skewness of all variables:

The distribution of poverty rate shows the right skewness.

The distribution of unemployment rate shows the right skewness.

The distribution of education level (High School or Higher) shows the left skewness and so as education level (Bachelor's or Higher).

We found the all the variables are skewed, especially GV rate, so we used Log transformation to handle the skewness of GV rate.

Since they are all numerical variables and highly skewed with only a very small number of missing values, we decided to clean the missing values.

After data cleaning, there are 9819 samples:

We did descriptive statistics:

GV Level

Due to high skewness, the gv rate were divided into Level 0 - 7 according to [ <, median-3std, median-2std, median-std, median, median+std, median+2std, median +3std, >]. After splitting, each level contains the number of samples:

Correlation

We constructed correlation metrix and pairwise metrix to display the correlationship of variables.

Hypothesis Test

Hypothesis 1

As we can observe, as the zip code area has a higher unemployment rate, the rate of gun violence increases. By calculating the correlation coefficient of unemployment rate and GV level, which is 0.292, we can accept hypothesis 1, even though the relationship is not strong.

Hypothesis 2

As we can observe, as the zip code area has a higher poverty rate, the rate of gun violence increases. By calculating the correlation coefficient of poverty rate and GV level, which is 0.442, we can accept hypothesis 2.

Hypothesis 3

As we can observe, as the zip code area has a higher education level, the rate of gun violence does decrease. By calculating the correlation coefficient of education level (Bachelor’s or higher) and GV level, which is -0.384, we can accept hypothesis 3.

Linear Regression

We also used variables to build a linear regression model. The summary of multi-linear regression shows the variables are significantly related ( p < 0.05), and the model has relatively higher goodness of fit. (R^2)

Further Analysis

Although all the hypotheses we made were accepted, the correlation in all three variables are not that strong. Thus, we were thinking to find others variables that may have a higher correlation to the gun violence rate.

We found the correlation between the unemployment rate and poverty has a much higher coefficient to 0.6213, and the correlation coefficient between poverty and education level is 0.5668. Next step is to deal with multicollinearity.

And for predictive model construction, we will continue to train other models and adjust parameters to optimize our models. Also we may consider use ElasticNet Regression from the Abhiram's advice.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%