-
Notifications
You must be signed in to change notification settings - Fork 13
/
10-Exploring-employee-salary-data-set.Rmd
660 lines (527 loc) · 32.4 KB
/
10-Exploring-employee-salary-data-set.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
# Exploring of employee salary data set
This original data set is taken from the Payroll Data from online "https://data.ct.gov/Government/State-Employee-Payroll-Data-Calendar-Year-2015-thr/virr-yb6n/data", Which include Calendar Year 2015 through the most recent pay period for Connecticut state employees. We take the subset only for 2017 fiscal year, which include the information of more than 1.77 million observations and 38 variables. There are 102101 employees listed there. For convenience of analysis, we randomly took 8500 employees from the original data. For our sampled data set ("state_empoyee_salary_data_2017.csv"), we are interested in the relationship between income with other variables like age, sex, ethnic, union and so on. We will focus on the full time employees who got 26 pay checks in 2017 fiscal year. First we will clean our sampled data set.
## Read in and clean data
```{r eval=FALSE}
raw0 <- read.csv("datasets/State_Employee_Payroll_Data_Calendar_Year_2015_through_Present.csv")
str(raw0)
employID <- unique(raw0$EmplId.Empl.Rcd)
sample <- sample(employID, 8500)
raw <- raw0[raw0$EmplId.Empl.Rcd %in% sample, ] %>% droplevels()
str(raw)
write.csv(raw, file = "datasets/state_employee_salary_data_2017.csv", row.names = FALSE)
```
From the sampled data set, We will pick full time employees as our goal and focus on the interested variables.
```{r message=FALSE}
library(dplyr)
raw <- read.csv("datasets/state_employee_salary_data_2017.csv")
raw1 <- filter(raw, Full.Part == "F") # pick full time employee
raw2 <- raw1[, -c(1:2, 5:9, 11:12, 19:23, 31:33, 35:36)] # keep only the interested variables
```
We remove the variables we are not interested in. Now we will check the situation of employees.
(ref:salary-EmployeeCount) Situation of check occurance of employees
```{r salary-EmployeeCount, fig.cap='(ref:salary-EmployeeCount)', fig.align='center'}
EmplId.Empl.Rcd <- as.character(raw2$EmplId.Empl.Rcd)
counts <- data.frame(table(EmplId.Empl.Rcd))
PersonOccurCount <- table(counts[, 2])
plot(PersonOccurCount, col = rainbow(30),
xlab = "Occurance of employee", ylab = "Count of employee")
```
The most employees occurrence is 26. We will focus on the employees who have 26 paychecks for whole year in 2017.
```{r}
subEmpl26 <- counts[which(counts$Freq != 26), ]
len <- length(subEmpl26$EmplId.Empl.Rcd)
for(i in 1:len){
j <- which(raw2$EmplId.Empl.Rcd == as.character(subEmpl26$EmplId.Empl.Rcd)[i])
if(i == 1) id = j
else if(i > 1) id = c(id, j)
}
raw26 <- raw2[-id, ] # remove employees with check count not equal 26
```
We removed employees with check count unequal to 26.
(ref:salary-CheckCount) The counts of checks for check date in 2017
```{r salary-CheckCount, fig.cap='(ref:salary-CheckCount)', fig.align='center'}
Check.Dt <- gsub(" 12:00:00 AM", "", raw26$Check.Dt)
counts <- table(Check.Dt)
barplot(counts, col = rainbow(26), axis.lty = 1, xlab = "Check Date", ylab = "Count")
```
The counts of checks for bi week almost the same, which means the employees kept same all the time in 2017 except 03/08/2017. We will remove the employee in this check date.
```{r}
empl38 <- raw26[which(raw26$Check.Dt == "03/08/2017 12:00:00 AM"), ]
raw26 <- raw26[-which(raw26$EmplId.Empl.Rcd == empl38$EmplId.Empl.Rcd), ]
```
We check the situation whether different genders are indexed for same employee.
```{r}
df <- raw26 %>%
group_by(EmplId.Empl.Rcd, Sex) %>%
summarise(Total.Gross = sum(Tot.Gross ))
dupliSex <- df[duplicated(df$EmplId.Empl.Rcd), ]
dupliSex
```
There is no employee having different sex recorded. We will check whether sex "U" employee is still in our data
```{r}
raw26[which(raw26$Sex == "U"), ]
```
sex "U" employees are already removed. Then we will check about the duplication situation of ethnic group.
```{r}
df <- raw26 %>%
group_by(EmplId.Empl.Rcd, Ethnic.Grp) %>%
summarise(Total.Gross = sum(Tot.Gross ))
dupliEthnic.Grp <- df[duplicated(df$EmplId.Empl.Rcd), ]
dupliEthnic.Grp
```
There is no different ethnic recorder for the same employee.
Now we notice that some employees even have gross income or biweekly rate less then zero. We consider to remove those employees.
```{r}
salary <- raw26
subGross0 <- salary[which(salary$Tot.Gross <= 0 | salary$Bi.Weekly.Comp.Rate <= 0), ]
len <- length(unique(subGross0$EmplId.Empl.Rcd))
for(i in 1:len){
j <- which(salary$EmplId.Empl.Rcd == unique(subGross0$EmplId.Empl.Rcd)[i])
if(i == 1) id = j
else if(i > 1) id = c(id, j)
}
salary <- salary[-id, ] #remove employees with negative or zero Tot.Gross and Bi.Weekly.Comp.Rate
```
We removed employees with negative or zero Tot.Gross and biweekly rate too. Now let's check the distribution of total gross.
(ref:salary-TotGross) Scatterplot of total gross income
```{r salary-TotGross, fig.cap='(ref:salary-TotGross)', fig.align='center'}
plot(salary$Tot.Gross, ylab = "Total gross income")
```
Most total gross point scatter under \$20000, some of them are even above \$60000.
We will not keep those employees with Job.Indicator as "S" too.
```{r}
subIndicator <- salary[which(salary$Job.Indicator == "S"), ]
len <- length(unique(subIndicator$EmplId.Empl.Rcd))
for(i in 1:len){
j <- which(salary$EmplId.Empl.Rcd == unique(subIndicator$EmplId.Empl.Rcd)[i])
if(i == 1) id = j
else if(i > 1) id = c(id, j)
}
salary <- salary[-id, ] #remove employees with Job indicator as "S"
```
Employees with Job indicator as "S" are moved. We also will not consider those employees with duplicated union descriptions.
```{r}
df <- salary %>%
group_by(EmplId.Empl.Rcd, Union.Descr) %>%
summarise(Total.Gross = sum(Tot.Gross ))
dupliUnion <- df[duplicated(df$EmplId.Empl.Rcd), ]
len <- length(dupliUnion$EmplId.Empl.Rcd)
for(i in 1:len){
j <- which(salary$EmplId.Empl.Rcd == dupliUnion$EmplId.Empl.Rcd[i])
if(i == 1) id = j
else if(i > 1) id = c(id, j)
}
salary <- salary[-id, ] # remove employees with different Union descreption
```
Finally, we cleaned our data according to our interest and we will group our left observations according to employee ID "EmplId.Empl.Rcd", and just select variables like sex, ethnic group, union description, age, total gross and biweekly rate. We take round age for each employee.
```{r}
salary <- salary %>% group_by(EmplId.Empl.Rcd, Sex, Ethnic.Grp, Union.Descr) %>%
summarise(Age = round(mean(Age), 0), Tot.Gross = mean(Tot.Gross),
Bi.Weekly.Rate = mean(Bi.Weekly.Comp.Rate)) %>%
group_by(EmplId.Empl.Rcd) %>% filter(n() == 1) %>% droplevels()
str(data.frame(salary))
```
There is one level marked as "" in ethnic group, which means unknown for us.
```{r}
levels(salary$Ethnic.Grp)[levels(salary$Ethnic.Grp) == ""] <- "unknown"
```
We replace level of empty space in Ethnic.Grp as "unknown".
```{r}
str(data.frame(salary))
summary(salary)
```
Now 3441 employees are included in our data set. These employees include both male and female, aged from 22 to 81, group into 8 different ethnics. They belong to 54 different Unions. The minimum total gross is \$568.7 and maximum is \$17530.3. Biweekly rate range from \$25.78 to \$12884.62.
As for this salary data, we will focus on the total gross income of these state employees. We are interested in the points such as:
How about the distribution of the total gross income?
Are male and female paid equally?
Is there any difference in total gross for different age and age group?
Do some ethnic group get better pay than the others?
What's the most important factor effect the gross income for these state employees?
From previous plots of total gross, we need to transform total gross first.
```{r}
salary$Gross.Log <- log(salary$Tot.Gross)
```
```{r}
#write.csv(salary, file = "datasets/salary.csv", row.names = FALSE)
#salary <- read.csv("datasets/salary.csv")
```
## Information about the variables
### Income and age
(ref:salary-HistQQ) Histogram and QQ plot for total gross, log of gross and age
```{r salary-HistQQ, fig.width=10, fig.height=8, fig.cap='(ref:salary-HistQQ)', fig.align='center'}
variable <- c("Tot.Gross", "Gross.Log", "Age") # pick up the numeric columns according to the names
par(mfrow = c(3, 2)) # layout in 3 rows and 3 columns
for (i in 1:length(variable)){
sub <- unlist(salary[variable[i]])
submean <- mean(sub)
hist(sub, main = paste("Hist. of", variable[i], sep = " "), xlab = variable[i])
abline(v = submean, col = "blue", lwd = 1)
qqnorm(sub, main = paste("Q-Q Plot of", variable[i], sep = " "))
qqline(sub)
if (i == 1) {s.t <- shapiro.test(sub)
} else {s.t <- rbind(s.t, shapiro.test(sub))
}
}
s.t <- s.t[, 1:2] # take first two columns of shapiro.test result
s.t <- cbind(variable, s.t) # add variable name for the result
s.t
```
Tot gross is not statistically normally distributed. It's extremely left skewed distributed.
The distribution of Gross.Log, transformation of Tot.Gross is closer to normality than Tot.Gross. The highest frequency of Gross.Log is round 8.
Highest frequency employees are aged between 45 and 50 years old. The distribution is close to bell shape.
Shapiro tests deny the normality of all these three numeric variables.
### Ethnic group
```{r}
library(ggplot2)
counts <- data.frame(sort(table(salary$Ethnic.Grp), decreasing = TRUE))
perc <- paste(round(counts$Freq/sum(counts$Freq), 2)*100, "%", sep = "")
ggplot(counts, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = 'identity', fill = rainbow(8)) +
geom_text(aes(x = c(8:1), y = Freq + 100 , label = perc), size = 3.5) +
labs(x = 'Ethnic Group', y = 'Freqency', title = 'Ethnic group by count') +
theme_bw() + # classic dark-on-light
coord_flip() + # flip the plot
theme(plot.title = element_text(hjust = 0.5))
```
61% of state employees are white, 14% are black. AMIND and PACIF are very limited.
### Sex
```{r}
library(plotrix)
counts <- sort(table(salary$Sex), decreasing = TRUE)
#labels with count number and percentage
p <- paste(round(counts/sum(counts), 2)*100, "%", sep = "")
lbls <- paste(names(counts), "\n", counts, p, sep = " ")
pie3D(counts, labels = lbls, explode = 0.05,
main = "Pie chart of sex", labelcex = 1.0, labelpos=c(1.8, 5.0))
```
There are around 4% more female than male in the state employees.
### Union descreption
```{r}
top10Union <- data.frame(sort(table(salary$Union.Descr), decreasing = TRUE)[1:10])
names(top10Union)[1] <- "Union.Descr"
topUnion <- salary %>%
group_by(Union.Descr) %>%
summarise(Avg.Gross.Log = round(mean(Gross.Log), 3)) %>%
arrange(desc(Avg.Gross.Log)) %>%
head(10)
df <- data.frame(top10Union, topUnion)
names(df)[1]<- "Top union by count"
names(df)[3]<- "Top union by log of gross"
library(knitr)
kable(df, booktabs=TRUE, caption = "Top union by count and average log value of gross")
```
The top unions by count are Service Maintain, Correctional officers, and Social & Human Services. But according to average gross, the first four are all come from university or college. Except Managerial, the top unions by count are totally different from top unions by average gross.
## Analysis of gross income
### Age
(ref:salary-GrossAge) Average gross and log of gross distribution by age
```{r salary-GrossAge, message=FALSE, fig.width=10, fig.height=5, fig.cap='(ref:salary-GrossAge)', fig.align='center'}
# p1 get bar plot
p1 <- salary %>%
group_by(Age) %>%
summarise(Avg.Gross = mean(Tot.Gross)) %>%
ggplot(aes(x = Age, y = Avg.Gross)) +
geom_bar(stat = 'identity', col = "lightblue") +
scale_x_continuous(breaks = seq(20, 90, 10)) +
labs(y = "Average gross income")
# p2 get density2D plot
p2 <- ggplot(salary, aes(x = Age, y = Gross.Log)) +
stat_density2d(aes(fill = ..density..^0.25), geom = "tile", contour = FALSE, n = 200) +
scale_fill_continuous(low = "white", high = "dodgerblue3") +
scale_x_continuous(breaks = seq(10, 100, 10)) +
scale_y_continuous(breaks = seq(2, 12, 1)) +
theme(legend.position = "none", panel.background = element_blank())
# display p1 and p2 in same row
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)
```
In total, the older the employees, the higher their gross income.
Lots of employees between age 50 and 60 have the similar log gross income around 8.
Now we separate age into groups with age gap of 10.
```{r}
salary$Age. <- cut(salary$Age, breaks = c(10, 20, 30, 40, 50, 60, 70, 80, 90), right = FALSE)
```
(ref:salary-LogGrossAge) Distribution of log of gross by age group
```{r salary-LogGrossAge, fig.cap='(ref:salary-LogGrossAge)', fig.align='center'}
stat <- salary %>% group_by(Age.) %>% summarise(Avg.Gross.Log = mean(Gross.Log))
ggplot() +
geom_boxplot(data = salary, aes(x = Age., y = Gross.Log, fill = Age.)) +
geom_point(data = stat, aes(x = Age., y = Avg.Gross.Log), color = "brown") +
geom_text(data = stat, aes(label = round(Avg.Gross.Log, 2), x = Age., y = Avg.Gross.Log - 0.1)) +
theme(plot.title = element_text(hjust = 0.5))
```
With average 10 years of age increasing, log value of gross increase step by step until after age 80.
```{r}
model <- aov(salary$Gross.Log~salary$Age.)
summary(model)
```
With p value of much less than 0.05, we reject the null hypothesis that there is no significant difference among these means of gross for different age groups.
```{r message=FALSE}
library(agricolae)
# LSD.test need provide degrees of freedom and mean square for errors. We can got it from aov.
res <- LSD.test(salary$Gross.Log, salary$Age.,
DFerror = model$df.residual, MSerror=anova(model)[["Mean Sq"]][2])
res$groups
```
From the information of pairwise comparison, the total different alpha beta index indicates that these 8 age groups are significantly different from each other in mean gross. The elder the age, the higher the gross income, except for age group between 80 and 90, its mean gross is the lowest comparing with all other age group.
(ref:salary-UnionAge) Percentage of union descriptions by age group
```{r salary-UnionAge, message=FALSE, fig.width=8, fig.height=8, fig.cap='(ref:salary-UnionAge)', fig.align='center'}
library(scales)
library(plotly)
p <- salary %>%
group_by(Union.Descr, Age) %>%
summarise(Avg.Gross = mean(Gross.Log)) %>%
arrange(desc(Avg.Gross)) %>%
ggplot(aes(x = Age, y = Avg.Gross, fill = Union.Descr)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = percent_format()) +
scale_x_continuous(breaks = seq(10, 100, 10)) +
labs(y = "Percentage") +
theme(panel.background = element_rect(fill = "black"),
panel.grid.major = element_blank(), panel.grid.minor=element_blank())
ggplotly(p)
```
This is the distribution of average gross percentage of Union Descriptions by age group. Use mouse we can find out the information of percentage occupation of union description for certain age. For example, for age 22, Administrative Clerical, Exempt/Elected/Appointed, Service/Maintenance and UCHC Univ Hlth Professional these four union share the total Log of gross almost equally, while for age 81, the only union is Legislative Management.
### Sex
(ref:salary-GrossSex) Log of gross comparison by sex group
```{r salary-GrossSex, fig.width=8, fig.height=4, fig.cap='(ref:salary-GrossSex)', fig.align='center'}
par(mfrow = c(1, 2))
# density plot for each sex
df1 <- salary[which(salary$Sex == "F"), ]
df2 <- salary[which(salary$Sex == "M"), ]
plot(density(df1$Gross.Log), col = "red", xlab = "Log of Gross",
main = "Density plot for each sex")
lines(density(df2$Gross.Log), col = "blue")
legend("topright", c("M","F"), lty = c(1, 1), col = c("blue", "red"))
# LSD comparison for sex grouop
model <- aov(salary$Gross.Log~salary$Sex)
res <- LSD.test(salary$Gross.Log, salary$Sex,
DFerror = model$df.residual, MSerror=anova(model)[["Mean Sq"]][2])
gross <- round(res$groups[, "salary$Gross.Log"], 2)
plot(res, xlab = "Sex Group", ylab = "Range between max and min",
main = "Sex groups and variation range")
text(x = c(seq(from = 1, to = 10, by = 1.2)), y = 8, gross)
```
Density plot shows blue male line is later than red female line, which means male has bigger log of gross than female in total. The density peak of female is earlier than male also tell the same story.
The total different alpha beta index indicates that these 2 sex groups are significantly different from each other in mean log of gross, male get higher gross income than female.
(ref:salary-EthnicSex) Log of gross comparison by ethnic and sex group
```{r salary-EthnicSex, message=FALSE, fig.width=10, fig.height=4, fig.cap='(ref:salary-EthnicSex)', fig.align='center'}
# p1 produce the point and line plot
p1 <- salary %>%
group_by(Ethnic.Grp, Sex) %>%
summarise(Avg.Gross = mean(Gross.Log)) %>%
ggplot(aes(x = Ethnic.Grp, y = Avg.Gross, colour = Sex, group = Sex)) +
geom_line(size = 1) +
geom_point(size = 4, shape = 19) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major = element_blank(),
legend.position = "bottom") +
labs(x = "Ethnic Group", y = "Average Gross Log")
# p2 produce the ensity ridge plot
library(ggridges)
p2 <- salary %>% group_by(Ethnic.Grp, Sex) %>%
ggplot(aes(x = Gross.Log, y = Ethnic.Grp, fill = Sex)) +
geom_density_ridges(alpha = 0.55) +
theme(legend.position = "bottom") +
xlim(6.0, 10.0)
# Lay out p1 and p2 in same row
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)
```
Both plots tell us that for other ethnic group except Asian, male have higher average gross than female. On average, Asian female earn similar as male. Amind male have more bigger log of gross than Amind female. From the point and line plot we can see the same situation for Pacif ethnic. Because of the tiny amount of PACIF employees, there is no ridge showing in the plot.
(ref:salary-Diverging) Diverging bars of stanardized log of gross income
```{r salary-Diverging, fig.width=8, fig.height=8, fig.cap='(ref:salary-Diverging)', fig.align='center'}
# Standardize Gross.Log first. Separate into two groups "Below average" and "Above average"
Sd.Gross <- round((salary$Gross.Log - mean(salary$Gross.Log))/sd(salary$Gross.Log), 2)
Gross.Type <- ifelse(Sd.Gross < 0, "Below average", "Above average")
Sex <- salary$Sex
Category <- salary$Union.Descr
df <- data.frame(Category, Sd.Gross, Gross.Type, Sex)
ggplot(df, aes(x = Category, y = Sd.Gross)) +
geom_bar(stat = 'identity', aes(fill = Gross.Type), width = .5) +
facet_wrap(~ Sex) +
labs(x = "Union Descreption", y = "Standardized log of gross") +
coord_flip() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "top")
```
From the figure we can see that for both male and female, the pattern of above and below average for almost all of the union description are similar. The only difference is the difference in total standardized log value of gross for male and female. For example, for union UCHC Univ Hlth Professionals on top, administrative Clerical on the bottom and other two unions like service/maintenance and Correctional Officers, all these four unions have more standardized log value of gross below average, on the same time, female UCHC Univ Hlth Professionals, administrative Clerical and male service/maintenance and Correctional Officers get this value less comparing with the other gender in the same union. But for union Uconn Faculty and Engineer Scien Tech, or Health professional and Judicial Professional, more standardized log value of gross above average, but female Uconn Faculty and Engineer Scien Tech and male Health professional and Judicial Professional get less than the other gender in the same union.
### Ethnic Group
(ref:salary-Ethinic) Bar and density ridge plot of log of gross by ethnic
```{r salary-Ethinic, fig.width=10, fig.height=4, message=FALSE, fig.cap='(ref:salary-Ethinic)', fig.align='center'}
# Bar plot with error bar for different ethnic group
p1 <- salary %>%
group_by(Ethnic.Grp) %>%
summarise(mean = mean(Gross.Log),
sd = sd(Gross.Log),
se = sd(Gross.Log)/sqrt(n())) %>%
ggplot(aes(x = Ethnic.Grp, y = mean, color = Ethnic.Grp, fill = Ethnic.Grp)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin =mean - se, ymax =mean + se), color = "black", width = 0.3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Density ridges for different ethnic group
library(ggridges)
p2 <- salary %>% group_by(Ethnic.Grp) %>%
ggplot(aes(x = Gross.Log, y = Ethnic.Grp, color = Ethnic.Grp, fill = Ethnic.Grp)) +
geom_density_ridges(alpha = 0.5) +
xlim(6.0, 10.0) +
coord_flip() +
theme(legend.position = "none")
# display two plots in same row
grid.arrange(p1, p2, nrow = 1)
```
From the bar figure we can see that Asian and White have relatively high gross incomes, Pacif and unknown get lower gross incomes. There is no big difference of variance for all other ethnic groups but Pacif, next Amind. Density ridge plot shows White, Asian and Nespec have biggest density above log of gross of 8.5, while unknown, Amind and Hispa have more density below 8. Ethnic Black's density keep even at both sides of log of gross of 8.
(ref:salary-EthnicRank) Ethnic rank plot by count and log of gross
```{r salary-EthnicRank, message=FALSE, fig.cap='(ref:salary-EthnicRank)', fig.align='center'}
# get ethnic order according frequency of count
CountEthnic <- data.frame(sort(table(salary$Ethnic.Grp), decreasing = TRUE))
# get ethnic order according log of gross income
GrossEthnic <- salary %>%
group_by(Ethnic.Grp) %>%
summarise(Avg.Gross = mean(Gross.Log)) %>%
arrange(desc(Avg.Gross)) %>%
pull(Ethnic.Grp)
# build data frame for rank of these two ethnic order
EthnicGroup <- unlist(list(CountEthnic[, 1], GrossEthnic))
Rank <- rep(1:8, 2)
EthnicType <- rep(c("CountEthnic","GrossEthnic"), each = 8)
dat <- data.frame(Rank, EthnicType, EthnicGroup)
# point plot two set of ethnics according rank and connect same ethnic by line
ggplot(dat, aes(y = Rank, x = EthnicType, label = EthnicGroup, group = EthnicGroup) )+
geom_line(size = 1) +
geom_point(aes(color = EthnicGroup), size = 5) +
scale_y_continuous(breaks = seq(0,10,1)) +
annotate("text", x = 0.7, y = Rank[1:8], label = EthnicGroup[1:8], hjust = 0, cex = 3.5) +
annotate("text", x = 2.1, y = Rank[9:16], label = EthnicGroup[9:16], hjust = 0, cex = 3.5)
```
According to count, most of the state employees are white and Black. Hispa and Nspec occupy the third and fourth in the list. By average gross, the first Ethnic group is Asian, then Nspec and White. Black is in 4th place by gross. Pacif keeps the last for both count and average gross rank.
```{r warning=FALSE, fig.width=6, fig.height=6}
model <- aov(Gross.Log ~ Ethnic.Grp, data = salary)
tukey <- TukeyHSD(model)
par(mar = c(4, 10, 2, 1))
psig = as.numeric(apply(tukey$Ethnic.Grp[, 2:3], 1, prod) >= 0) + 1
plot(tukey, col = psig, las = 1, cex.axis = 0.7, yaxt = "n")
for (j in 1:length(psig)){
axis(2, at = j,labels = rownames(tukey$Ethnic.Grp)[length(psig) - j + 1],
las = 1, cex.axis = .8, col.axis = psig[length(psig) - j + 1])
}
```
There is no significant difference between the mean grosses of black pairs but the red pairs in plot: unknown with Asian, Black, Nspec and White; Asian with Black, Hispa and White; Black with Hispa and White; Hispa with Nespa and White.
```{r}
ggplot(salary, aes(x = Age, y = Gross.Log, colour = Sex, shape = Sex)) +
geom_point() +
ggtitle("Scatterplot of log of gross vs Age by Sex and Ethnic") +
facet_wrap(~ Ethnic.Grp) + # make subplots by Ethnic Group
geom_smooth(aes(colour = Sex), method = 'lm',formula = y ~ x) + # add the regression line
theme(plot.title = element_text(hjust = 0.5), legend.position = c(0.85, 0.15))
```
In our data set, we have much more employees in white, but really little in Pacif. For both male and female, with increasing in age, log of gross increase slightly for all other ethnics but Nspec. For most ethnics, male and female have similar pattern for the relation between age and log of gross. For ethnic unknown and Amind, male's log of gross increase faster than female, but for Asian, male's log of gross increase slower than female.
## Analysis of gross income type
### Gross Type
According to the average log of gross for each employee, we separate the employees into three group: High.Gross with average log of gross greater than 8.5, Middle.Gross with average log of gross greater than 7.8, and Low.Gross with average log of gross equal or less than 7.8.
```{r}
#regroup gross by average log of gross for each EmplId.Empl.Rcd as Gross.Type
High.Gross <- as.vector(salary$EmplId.Empl.Rcd[which(salary$Gross.Log > 8.5)])
Middle.Gross <- as.vector(salary$EmplId.Empl.Rcd[which(salary$Gross.Log > 7.8)])
salary <- salary %>%
mutate(Gross.Type =
ifelse(EmplId.Empl.Rcd %in% High.Gross, "High",
ifelse(EmplId.Empl.Rcd %in% Middle.Gross, "Middle",
"Low")))
```
For each gross type, we try to get information about total number of observation, percentage, employee number, average log of gross, average of total gross and standardized total gross.
```{r}
library(janitor)
df1 <- salary %>% group_by(Gross.Type) %>%
tabyl(Gross.Type)
df2 <- salary %>% group_by(Gross.Type) %>%
summarise(Avg.Log.Gross = mean(Gross.Log), Avg.Tot.Gross = mean(Tot.Gross))
df <- merge(df1, df2, by="Gross.Type") %>% arrange(desc(Avg.Log.Gross))
df <- round(df[, 2:5], 2) # round all numeric variables to decimal 2
Gross.Type = c("High, Gross.Log>8.5", "Middle, Gross.Log>7.8","Low, Gross.Log<=7.8")
df <- cbind(Gross.Type, df)
kable(df, booktabs=TRUE, caption = "Table of gross type")
```
```{r}
df$Gross.Type <- factor(df$Gross.Type, levels = df$Gross.Type)
ggplot(df, aes(x = Gross.Type,
label = paste(Avg.Tot.Gross, paste(percent*100, "%", sep = ""), sep = ", "))) +
geom_bar(aes(y = n), stat = "identity", fill = rainbow(3)) +
geom_line(aes(y = Avg.Log.Gross*250, group = 1, colour = "Avg.Tot.Gross*250"), size = 2) +
geom_point(aes(y = Avg.Log.Gross*250, group = 1, colour = "Avg.Tot.Gross*250"), size = 4) +
labs(x = "Gross Type", y = "Employee Number",
title = "Average gross and employee distribution for each gross type") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(x= c(1, 2, 3), y = c(2000, 1900, 1800), size = 4, color = "black")
```
High gross group occupy 6% of the total employees. Around 68% employees earned middle gross, and around 26% employees earned low. The gross decrease straightly. The real average total gross change from 6307 dollars/Bi-weekly to 3380, then to 2066 for high, middle and low gross group. The average real gross income for high gross group is more than 3 times of that for low gross group.
### Ethnic and Sex and age
```{r}
salary %>% mutate(EthnicSex = paste(Ethnic.Grp, Sex, sep = "-")) %>%
tabyl(EthnicSex, Gross.Type, show_missing_levels = FALSE) %>%
adorn_totals("row") %>%
adorn_totals("col") %>%
adorn_percentages("all") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns %>%
adorn_title
```
In high gross type, white male are occupy much bigger portion than any other male or female ethnic group. For other ethnic group with high gross income, female portion is equal or higher than male.
Situation keeps similar as high gross type for middle gross type, white and Asian male occupy more portion than white and Asian female, while for other ethnics, female is more than male.
For low gross type, no matter the ethnic the employees belong, female is more than male.
In total, only for white, its male is more than female, for all other ethnics, female is more than male.
For all other ethnics, over half of the employees belong to middle gross type, but unknown group have more people in low gross type.
```{r}
library(ggmosaic)
h_mosaic <- ggplot(data = salary) +
geom_mosaic(aes(x = product(Sex, Ethnic.Grp), fill = Age.), na.rm=T, divider=mosaic("h")) +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, size=6),
legend.position="top",legend.text=element_text(size=8),
panel.background = element_rect(fill="black"), panel.grid.major = element_blank()) +
labs(x = "", y = "", title= "Mosaic Plot for Gross Group by Ethnic and Sex Group") +
facet_grid(Gross.Type ~ .)
ggplotly(h_mosaic, width = 800, height = 500) %>%
layout(legend = list(orientation = "h", y = 1))
```
Here is the mosaic plot of each sub group we discussed in last chunk. We add age group as another factor. More elder employees above 50 get higher gross income, especially the male elders aged between 70 and 80.
## LM and ANOVA analysis
According to our previous analysis we try to fit a model for log of gross with Age., Ethnic.Grp, Sex, and Union description. Here we try use the average log of gross for each employee.
```{r}
salaryglm <- salary[, -c(1, 6:7, 9:10)] # remove first col. of EmplId.Empl.Rcd
```
```{r warning=FALSE }
model0 <- lm(Gross.Log ~ ., data = salaryglm)
ss <- coef(summary(model0)) # take only coeficient of summary
ss.sig <- ss[ss[, "Pr(>|t|)"] < 0.05, ][1:20, ] # show first 20 coeficient with p value less than 0.05
ss.sig
# show Adjusted R-squared value
statistic <- paste("Adjusted R-squared",round(summary(model0)$adj.r.squared, 4), sep = ": ")
statistic
model1 <- aov(Gross.Log ~ ., data = salaryglm)
summary(model1)
```
LM model indicate that Average Gross is related with all factors including Age, Ethnic group, Sex, and Union description. ANOVA test illustrates Average Gross is significantly affected by all four factors, especially by age and sex.
## Conclusion
- Total Gross is not statistically normally distributed. It skewed to left. After taking log transformation, it is more close to normality.
- The elder the age, the higher the gross income, except for age group between 80 and 90.
- On average, males earn more than females in the same age, ethnic, union.
- The means of gross are significantly different for different ethnic groups. On average, Asian and white are the top two groups. PACIF and unknown are the last.
- The real average total gross change from 6307 dollars bi-weekly to 3380, and to 2066 for high , middle and low gross group. High, middle and low gross group occupy 6%, 68%, and 26% of the total employees. White male occupy much bigger portion in high gross group, especially the elder white male.
- Age group, Union description, Sex, and Ethnic group together will explained 50% of log variance of gross.
```{exercise}
Create a folder, download the data file "state_empoyee_salary_data_2017.csv" into it, and create an Rstudio project in the folder. Read in the data file, peak the structure of your data set and clean it step by step:
1) Select only columns of "EmplId.Empl.Rcd", "Bi.Weekly.Comp.Rate", "Age", "Ethnic", "Sex", "Full.Part", "City", get summary information for all these variables. Make sure EmplId.Empl.Rcd, sex, ethnic to be factor, age and biweekly rate to be numeric.
2) Pick up full time employees in city "Hartford".
3) Select only those employees with 26 pay checks for 2017 ficial year(Hint:Each employee has unique ID of "EmplId.Empl.Rcd").
4) Remove sex group other than "F" and "M" if there is any. Hint:Use str() after droplevels().
5) Replace level of empty space in Ethnic.Grp as “unknown”.
6) Remove emplolyees with Bi.Weekly.Comp.Rate equal or less than zero if there is any.
7) Group your data set by EmplId.Empl.Rcd, Ethnic.Grp, Sex. For each employee, take round of mean age and mean of Bi.Weekly.Comp.Rate as Age and Bi.Weekly.Rate for your data set.
8) Show structure and summary of your data set.
9) Show the distribution of "Bi-Weekly Comp Rate" and "Age" using histogram and QQ plot. Test the normality. Transform by taking log value if it's necessary.
10) Using pie plot to show the distribution of Sex.
11) List the count number and percent of employees in each ethnic group.
12) Using scatter plot to view the relationship between average Bi-Weekly Comp Rate and Age, add regression line and give your interpretation.
13) Using box plot to view the relationship between average Bi-Weekly Comp Rate by Ethnic group.
14) Using density plot to view the distribution of Bi-Weekly Comp Rate for different sex group, package ggplot2 is refered.
15) Do ANOVA test to check whether male are the same as female in Bi-Weekly Comp Rate in city "Hartford".
16) Explore the interaction of ethnic and sex to Bi-Weekly Comp Rate using density ridges plot and interpret your plot.