-
Notifications
You must be signed in to change notification settings - Fork 3
/
FindingLifePartner.Rmd
368 lines (271 loc) · 15.5 KB
/
FindingLifePartner.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
title: "Finding Life Partner"
author: "Ruixiong Shi, Danmo Wang, Liu Yi, Rong Wang, Yue Shen Gu"
date: "January 27, 2016"
output:
html_document:
theme: united
toc: yes
---
# 1 Introduction
<p>In this script we are interested to see some general features of the singles (including those who are separated, widowed and divorced) </p>
<p>Then we will find out that as a graduated master student, assuming I am free to choose the place to work, where would be the best states to meet my life partner?
Our ultimate goal is end up with a search engine that can show you the best state based on your preference of life partner, by assuming he/she would make rational rather than emotional decision.
To answer questions in the first part, we focus on their gender ratio, their age distribution, employment status, income and education level. </p>
### 1.1 Libraries we need
```{r message=FALSE}
library(rCharts)
library(htmltab)
library(rMaps)
library(rCharts)
library(ggplot2)
library(dplyr)
```
### 1.2 Load the data
<p>To save time, read the data select variables of interest and store it as RData for future use</p>
```{r eval=FALSE}
colstokeep<-c("PWGTP", "ST", "SCHL", "AGEP", "SEX", "ESR", "MSP", "WAGP",'CIT','COW','WKHP','RAC1P','FSCHP')
pop1<-fread('ss13pusa.csv', select = colstokeep)
pop2<-fread('ss13pusb.csv',select = colstokeep)
pop4<-rbind(pop1,pop2)
rm(pop1, pop2)
save(pop4, file="pop4.RData")
}else{
load("pop4.RData")
}
```
```{r echo=FALSE}
#setwd("/Users/ruixiongshi/Documents/cycle1-1")
load("./data/pop4.RData") #if you want to run the code on your end, change "." to your working directory
actualPopulation<-
htmltab("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population",1)
actualPopulation<-actualPopulation[1:52,3:4]
names(actualPopulation)<-c('StateName','ActualPop')
actualPopulation$StateName<-substring(actualPopulation$StateName,3)
actualPopulation$StateName[48]<-substring(actualPopulation$StateName[48],3)
actualPopulation$ActualPop<-as.numeric(gsub(',','',actualPopulation$ActualPop))
```
### 1.3 Data manipulation
<p>Removing all the data with NA value in some columns and removing people who are married or under 15. Then group them accordingly</p>
```{r}
#excluding married people, and missing values
sindata<-tbl_df(pop4)
sindata<-sindata %>%
na.omit() %>%
filter(MSP %in% c('Widowed','Divorced','Separated','Never married')) %>% #code 1&2 are married
group_by(SEX)
```
# 2 General information of single population
### 2.1 How many are they?
<p>First of all we want to know the percentage of the single people all over the state. </p>
```{r}
#Compute the percentage of weighted single people over weighted population
sum(sindata$PWGTP)/sum(pop4$PWGTP)
```
<p>Great! about 25% of population are single then we would like are single men more than women? </p>
### 2.2 The women/men ratio
```{r echo=FALSE}
#Compute the weighted count for single people and convert it to percentage
weisingledata<-sindata %>% count(SEX, wt=PWGTP)
weisingledata$n <- weisingledata$n/sum(sindata$PWGTP)
```
```{r}
#Draw the pie chart of single people by gender
genderdata<-aggregate(sindata$PWGTP,list(sindata$SEX),FUN=sum)
names(genderdata)<-c('SEX','Count') # returns female with 40110995 and male with 39512549
```
```{r echo=FALSE}
genderdata2<-data.frame(Gender=c(rep('Male',39512549),rep('Female',40110995)))
genderplot<-ggplot(genderdata2,aes(x=factor(1), fill=as.factor(Gender)))+geom_bar(width = 1)+ggtitle("Comparing single people in US by gender") + scale_fill_discrete(name = "Gender")+coord_polar(theta = 'y')+theme_bw()+ylab(" ")+xlab(" ")
genderplot
```
</p>We can see that women slightly outnumber men. Next, we would like to see their age distribution. </p>
### 2.3 How old are they?
```{r}
genderageplot<-ggplot(sindata,aes(x=SEX, y=AGEP, fill=as.factor(SEX)))+geom_boxplot()+ggtitle("Comparing age of single people in US by gender") + scale_fill_discrete(name = "Gender")+xlab("Gender")+ylab("Age") + scale_y_continuous(limits=c(0,100))+theme_bw()
```
```{r echo=FALSE}
genderageplot
```
<p>we can see women performs generally better then men in finding a job. The unemployment rate drops as the age increase. If you are looking for a partner under 30 years old and you do not want him/her without a job, then you are in the dangerous zone</p>
<p>It is clear that median age of single women are 5 years older than single men. If you are a lady who want to find a boyfriend under 30, then half of the single gentlemen is waiting for you. If you are a gentleman who look for ladies under 25 then 25% of total single ladies is waiting for you. </p>
<b><i>The next question is, are they good husband/wife for you? </i>.</b>
### 2.4 How about their employment status, income and education level?
#### 2.4.1 Employment status according to age</b>
```{r echo = FALSE}
empl<-pop4[!is.na(pop4$ESR),]
empl2<-aggregate(empl$PWGTP,by=list(empl$AGEP,empl$SEX),FUN=sum)
names(empl2)<-c('Age','Sex','TotalCount')
empl2$id<-paste0(empl2$Age,empl2$Sex)
```
<p>Though some people in the survey are marked as "with a job but not at work", in this part we choose only the person marked as unemployed</p>
```{r}
unempl<-empl[empl$ESR==3,] #choose the people only marked as unemployed
unempl2<-aggregate(unempl$PWGTP,by=list(unempl$AGEP,unempl$SEX),FUN=sum)
names(unempl2)<-c('Age','Sex','UnemplCount')
unempl2$id<-paste0(unempl2$Age,unempl2$Sex)
```
```{r echo=FALSE}
#Visualize them here
unempl3<-merge(empl2,unempl2[,c('id','UnemplCount')],by='id',all.x=T)
unempl3$UnemplRatio<-round(unempl3$UnemplCount/unempl3$TotalCount,2)
#plot graph
empplot<-ggplot(data=unempl3, aes(x=Age, y=UnemplRatio, group=Sex, colour=Sex)) +
geom_line() + geom_point() + scale_y_continuous(limits = c(0, 0.15))+ xlim(16,75)+ theme_classic()+theme(axis.text.x=element_text(angle=90,hjust=1))+xlab("Age")+ylab("Unemployment ratio")+ggtitle("Comparing unemployment rate for singles")
empplot
```
<p>we can see women perform generally better then men in finding a job. The unemployment rate drops as the age increase. If you are looking for a partner under 30 years old and you do not want him/her without a job, then you are in the dangerous zone</p>
#### 2.4.2 Income level
```{r echo=FALSE}
#filter the data
popudata <- pop4 %>%
na.omit()%>%
filter(MSP %in% c('Widowed','Divorced','Separated','Never married'))
```
<p>We break their income into 6 levels and calculate the weighted count</p>
```{r}
#break the WAGP (lower=0, upper=100000, by=20000)
popudata$WAGP2[popudata$WAGP %in% c(0:20000)] <- "0-20k"
popudata$WAGP2[popudata$WAGP %in% c(20000:40000)] <- "20-40k"
popudata$WAGP2[popudata$WAGP %in% c(40000:60000)] <- "40-60k"
popudata$WAGP2[popudata$WAGP %in% c(60000:80000)] <- "60-80k"
popudata$WAGP2[popudata$WAGP %in% c(80000:100000)] <- "80-100k"
popudata$WAGP2[popudata$WAGP %in% c(100000:1000000)] <- "over 100k"
#sum the weights
popudata<-aggregate(popudata$PWGTP,by=list(popudata$WAGP2,popudata$SEX),FUN=sum)
names(popudata)<-c('WAGP','SEX','PWGTP')
```
```{r echo=FALSE}
#plot chart for Count VS Salary
salaryplot <- ggplot(popudata,aes(x=WAGP, y=PWGTP,fill=factor(SEX)))+geom_bar(stat="identity",position="dodge")
salaryplot <- salaryplot +ylab("Count")+xlab("Annual Salary")+ggtitle("Salary for Single") + theme_bw() + scale_fill_discrete(name = "Sex")
salaryplot
```
<p>From the above plot, the low paid group (under 20k annual), for both female and male, dominate all other groups. </p>
#### 2.4.3 Education level
```{r echo=FALSE}
per_edu<-
pop4%>%
na.omit() %>%
filter(MSP %in%c('Widowed','Divorced','Separated','Never married')) #select person who is single
```
<p>Change the SCHL codes into different college degrees </p>
```{r eval=FALSE}
per_edu$SCHL[per_edu$SCHL < 20] <- "No degree"
per_edu$SCHL[per_edu$SCHL == 20] <- "Associated degree"
per_edu$SCHL[per_edu$SCHL == 21] <- "Bachelor's degree"
per_edu$SCHL[per_edu$SCHL == 22] <- "Master's degree"
per_edu$SCHL[per_edu$SCHL == 23] <- "Professional degree beyond a bachelor's degree"
per_edu$SCHL[per_edu$SCHL == 24] <- "Doctorate degree"
```
```{r echo=FALSE}
per_edu$SCHL2<-as.character(per_edu$SCHL)
per_edu$SCHL2[per_edu$SCHL%in%c('Less than 3 years old','No schooling completed',
'Nursery school, preschool, Kindergarten',
'Grade 1-12 and no diploma','Regular high school diploma',
'GED or alternative credential','Some college, no degree')] <- "No degree"
per_edu$SCHL2<-factor(per_edu$SCHL2)
```
```{r echo=FALSE}
per_edu<- aggregate(per_edu$PWGTP,by=list(per_edu$SCHL2,per_edu$SEX),FUN=sum)
names(per_edu)<-c('SCHL','SEX','PWGTP')
#arrange(per_edu,SCHL)
eduplot <- ggplot(per_edu,aes(x=SCHL, y=PWGTP,fill=factor(SEX)))+geom_bar(stat="identity",position="dodge")
eduplot <- eduplot +ylab("count")+xlab("Levels of ducation")+ggtitle("Education of Single") + scale_x_discrete(limits=c("No degree","Associate's degree","Bachelor's degree","Master's degree", "Professional degree beyond a bachelor's degree", "Doctorate degree")) + theme_bw()+theme(axis.text.x=element_text(angle=20,hjust=1))+scale_fill_discrete(name = "Sex")
eduplot
```
<p>More than half of single people do not have a college degree. </p>
<p>According to the "should I do a PhD?" post, we can see that as a master student, our annual salary after graduation is about 50k annual. If you are looking for someone who has similar qualification and income like you by randomly picking from single population, then you are wasting your time. </p>
# 3 In which states can I find him/her?
<p>As we have seen, finding a good partner is an intense competition actually, if you want to stand out, you must have some tricks. </p>
<b>Here is the trick: we are going to locate one or two states which has the largest number of good partners and highest proportion of them among total state population. </b>
<p>Now imagine I am a young lady from statistics department, assuming I am free to find a job in every state and my boyfriend criterions are:</p>
<i>"I want him to have an annual income over 100K, I do not care his age, education or whether he has married before. Basically I just want a sugar daddy. Which state should I find him?" </i>
### 3.1 Some library changes for this part
```{r message=FALSE}
detach(package:dplyr,unload=T)
library(plyr)# these two packages are clashed in some functions
```
### 3.2 Targeting the best state
```{r eval=FALSE}
state<-aggregate(pop3$PWGTP, by=list(pop3$abbr,pop3$name), FUN=sum)
names(state)<-c('State','StateName','TotalCountWithWeight')
# Plotting sugar daddy
sugarDaddy<-pop3[pop3$single=='Single'&pop3$SEX==1&!is.na(pop3$WAGP)&pop3$WAGP>=100000,]
sugarDaddy2<-aggregate(sugarDaddy$PWGTP,by=list(sugarDaddy$abbr), FUN=sum)
names(sugarDaddy2)<-c('State','CountWithWeight')
sugarDaddy3<-merge(sugarDaddy2,state,by='State',all.x=T)
sugarDaddy3$Perc<-round(sugarDaddy3$Count/sugarDaddy3$TotalCount*100,1)
sugarDaddy4<-merge(sugarDaddy3,actualPopulation,by='StateName',all.x=T)
sugarDaddy4$ExpectedCount2015<-round(sugarDaddy4$Perc*sugarDaddy4$ActualPop/100,0)
ichoropleth(Perc ~ State,legend=T,pal='YlOrRd',data=sugarDaddy4)
```
<b>Expected percentage of single males earning more than 100K annually</b>
<p align="left"><img src="doc/image/sugarDaddyPerc.png" ></p>
```{r eval=FALSE}
ichoropleth(ExpectedCount2015 ~ State,legend=T,pal='YlOrRd',data=sugarDaddy4)
```
<b>Expected total number of single males earning more than 100K annually</b>
<p align="left"><img src="doc/image/sugarDaddyNum.png" ></p>
<p>It seems like New York, California and Massachuset are the best states for hunting them. Great! </p>
# 4 Which industries are the best?
<p>In order to find my sugar daddy, I decide to live in either New York state or California to meet my sugar daddy then I need to know where can I meet him. The best idea will be we work together. So the next step I will figure out where they work. Luckily, data scientist can work in various field. </p>
```{r echo=FALSE}
state<-aggregate(pop4$PWGTP, by=list(pop4$abbr,pop4$name), FUN=sum)
names(state)<-c('State','StateName','TotalCountWithWeight')
```
```{r message=FALSE}
# look at industry of sugar daddys in CA and NY
sugardaddyNYCA<-pop4[pop4$single=='Single'&pop4$SEX=='Male'&!is.na(pop4$WAGP)&pop4$WAGP>=10000&pop4$abbr%in%c('CA','NY'),] #filter the people we need
sugardaddyNYCA2<-aggregate(sugardaddyNYCA$PWGTP,by=list(sugardaddyNYCA$abbr,sugardaddyNYCA$NAICS),FUN=sum)
names(sugardaddyNYCA2)<-c('State','Industry','CountWithWeight')
```
```{r echo=FALSE}
sugardaddyNYCA3<-merge(sugardaddyNYCA2,state,by='State',all.x=T)
sugardaddyNYCA3$Perc<-round(sugardaddyNYCA3$Count/sugardaddyNYCA3$TotalCount*100,2)
sugardaddyNYCA4<-merge(sugardaddyNYCA3,actualPopulation,by='StateName',all.x=T)
sugardaddyNYCA4$ExpectedCount2015<-round(sugardaddyNYCA4$Perc*sugardaddyNYCA4$ActualPop/100,0)
```
<p>Generate a table of the industries they work for NY</p>
```{r message=FALSE}
sugardaddyNY<-sugardaddyNYCA4[sugardaddyNYCA4$State=='NY',]
sugardaddyNY<-sugardaddyNY[rev(order(sugardaddyNY$CountWithWeight)),]
head(sugardaddyNY[,c('State','Industry','CountWithWeight','Perc')])
```
<p>Do the same thing for CA</p>
```{r message=FALSE}
sugardaddyCA<-sugardaddyNYCA4[sugardaddyNYCA4$State=='CA',]
sugardaddyCA<-sugardaddyCA[rev(order(sugardaddyCA$CountWithWeight)),]
head(sugardaddyCA[,c('State','Industry','CountWithWeight','Perc')])
```
<p>From the table above, the top 3 industries are retail, professional services and manufacturing industry. To hunt my sugar daddy, I will have higher chance if I can work in these industries in either New York or California</p>
# 5 Example for hunting dream ladies
<p>Imagine I am a young man whose girlfriend criterion are:</p>
<i>"she will be younger than 30 years old with at least a university degree and she must have a job"</i>
<p>Then I have the following plots:</p>
```{r eval=FALSE}
pgirl<-pop3[pop3$single=='Single'&pop3$SEX==2&pop3$AGEP<30&!is.na(pop3$SCHL)&pop3$SCHL>=21&!is.na(pop3$ESR)&pop3$ESR%in%c(1,2,4,5),]
pgirl2<-aggregate(pgirl$PWGTP,by=list(pgirl$abbr), FUN=sum)
names(pgirl2)<-c('State','CountWithWeight')
pgirl3<-merge(pgirl2,state,by='State',all.x=T)
pgirl3$Perc<-round(pgirl3$Count/pgirl3$TotalCount*100,1)
pgirl4<-merge(pgirl3,actualPopulation,by='StateName',all.x=T)
pgirl4$ExpectedCount2015<-round(pgirl4$Perc*pgirl4$ActualPop/100,0)
ichoropleth(Perc ~ State,data=pgirl4,pal = 'PuRd')
```
<b>Expected percentage of single females younger between 16-29 years old, graduated from college, and have a job</b>
<p align="left"><img src="doc/image/perfectGirlPerc.png" ></p>
```{r eval=FALSE}
ichoropleth(ExpectedCount2015 ~ State,data=pgirl4,pal = 'PuRd')
```
<b>Expected total number of single females younger between 16-29 years old, graduated from college, and have a job</b>
<p align="left"><img src="doc/image/perfectGirlNum.png" ></p>
<p>New York, California and Massachuset are great choice for me and I have more options such as Pennsylvania, Illinois and Nevada</p>
# 6 Conclusions
<p>As a master student, if you are single then you should be aware that to find a life partner who has similar qualification and income level as you is really hard after you leave school. So try to find one at school!
If you are a rich gentleman who do not care your wife's income, then obviously you have more choice. You do not need to rush. Girls will do their best to find you! </p>
# 7 Next Step
<p>I would like to build a search engine that automate the process of filtering and plotting and it will show the best states to find your life partner based on your preference</p>
<p>Below is a demo, hope you like it! </p>
<a href="https://yueshengu.shinyapps.io/findingLifePartner/" target="_blank" style="font-size:40px">Soulmate Exploration Engine</a>
```