-
Notifications
You must be signed in to change notification settings - Fork 3
/
missing_value.Rmd
53 lines (34 loc) · 2.84 KB
/
missing_value.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Missing Value
```{r}
library(tidyverse)
load("data/tidy_survey_data.rda")
tidy_df <- tidy_survey %>%
select(UniqueID, qgender, qage, qrace, qeducation, qincome, qzipcmb, qSurveyZone, borough, travel_code, qlicense, gFREIGHT1_qFREIGHT1_mA,gFREIGHT1_qFREIGHT2_mA,gFREIGHT1_qFREIGHT3_mA,gFREIGHT1_qFREIGHT4_mA,gFREIGHTnew, qsafety1, qsafety2, qsafety3, qsafety4, qsafety5, qsafety6, Description,gIMPROVE1_qIMPROVE1_mA,gIMPROVE1_qIMPROVE2_mA,gIMPROVE1_qIMPROVE3_mA,gIMPROVE1_qIMPROVE4_mA, gIMPROVE1_qIMPROVE5_mA, qmarried, qsmartphone, qrent, qnyc, qchildren, qemployment, qzipwork1, qzipwork2, qzipwork3, qzipwork4, qzipwork5, qzipwork6, qtimework1, qtimework2, qtimework3, qtimework4, qtimework5, qtimework6, qtimehome1, qtimehome2, qtimehome3, qtimehome4, qtimehome5, qtimehome6, qindustry, `Work Place`, allwt) %>% filter(!is.na(UniqueID))
```
The first step to explore the data is to analyze its missing pattern. We use `visna` graph from `extracat` package to visualize the missing pattern.
```{r}
library(extracat)
visna(tidy_df,sort = "b")
```
It can be observed that the variable `qzipwork` which describes the Zipcode of the locations where the respondents work has the most missing value because it's an open question and a considerable amount of respondents do not report the location of their workplaces.
Also the `qsafety` variable group contains a large amount of missing values which will probably cause bias to our conclusions in analysis. `qsafety` mainly focuses on how the respondents feel about the safety of certain transportation. The 4 variables with high missing proportion are those concerning the safety of bicycle riding and driving. Naturally those who don't drive or ride bikes would discard the questions. As these questions are not related to the topic of interests, so we will remove these variables and focus on the rest.
The other variables have few or no missing value on which we can perform further exploratory analysis safely.
Given the condition of our data, we will mainly concentrate on these varibales:
1. dependent varibles:
- travel_code
- description
2. socioeconomic variables:
- gender
- age
- income
3. spatial varibles:
- qzipcmb (zipcode)
- borough
4. certain group of variables related to our topics:
- qnyc (years of living in New York City)
- qchildren (how many children do the respondents have?)
- qrent (do the respondents own their house or rent their residence?)
- gFREIGHT varible groups (how often do the respondents receive deliveries at home? )
5. weight:
- allwt (the weight for each record, provided by the survey)
In the process of exploration and analysis, we will also recode variables according to the situation and needs in analysis. The variables we focus on generally have very few missing values. So instead of replacing them with other values, we will just exclude them when conducting analysis and plotting.