-
Notifications
You must be signed in to change notification settings - Fork 0
/
analysis.Rmd
1017 lines (785 loc) · 39.6 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Suicide Prevention and Analysis"
author: "Václav Hlaváč"
output:
pdf_document: default
word_document: default
---
```{r setup, include=FALSE}
library(gridExtra)
library(factoextra)
library(caret)
library(glmnet)
library(MASS)
library(tidyverse)
library(xgboost)
library(Matrix)
theme_set(theme_gray())
```
**Suicide Understanding and Prevention**
This notebook contains the analysis of a suicide dataset, which was the final
assignment of the Statistical data analysis course.
## 1. Exploratory data analysis
```{r include=FALSE}
# Load the dataset
load(file="data/suicide.RData")
```
I begin by inspecting whether the dataset contains any NA values that would need to be taken care of.
```{r}
data %>% is.na() %>% colSums()
```
It seems that the dataset has already been conveniently cleaned beforehand.
Did all of the included countries contribute the same amount of data? Are there
any countries that reported their statistics far less often than other countries?
```{r}
years.count <- data %>% group_by(continent, country) %>%
summarise(count = n_distinct(year)) %>%
arrange(desc(count))
years.count %>%
ggplot(aes(x = count, fill = continent)) +
scale_x_continuous(n.breaks = 31) +
geom_bar()
```
It looks that a significant number of countries provided data about the number suicides within their population most of the years during 1985-2015. However there are some countries that supplied only less than 10 year worth of data.
### Suicide rate
Throughout the majority of this notebook, I will be working with suicide rates rather than with concrete numbers
of total suicides. The reason for this is that it allows us to perform better comparisons of suicide prevalence
between different countries and continents, as the populations of different areas differ, and the total number of
suicides in a population is correlated with the size of the population.
```{r}
country.summary <- data %>% group_by(year) %>%
summarise(population = sum(population),
suicides_no = sum(suicides_no))
cor(x = country.summary$population, y = country.summary$suicides_no)
```
If we wanted to be even more rigorous with our analysis, we could standardize our suicide rates by age distribution,
as is done by the World Health Organization. This allows for better comparison of statistics between countries with
different population age profiles.
### Global trend inspection
The first thing that we can inspect is the progression of suicide rates worldwide during the time period 1985-2015.
```{r}
global.suicide.rate <- data %>% group_by(year) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000
)
global.suicide.rate %>%
ggplot(aes(x = year, y = suicides_per_100k)) +
geom_point(color="orange", size=2.4) +
geom_line(color="orange", size=1.1) +
labs(
title = "Worldwide Suicide Rates",
x = "Year", y = "Suicides per 100k people"
) +
scale_x_continuous(breaks = seq(1985, 2015, 2))
```
This figure shows that there has been a great rise in suicides throughout the years 1989 and 1995.
However, with a few exceptions, the suicide rate seems to be on decline ever since then.
```{r}
differences <- data.frame(
difference = tail(global.suicide.rate$suicides_per_100k, -1) - head(global.suicide.rate$suicides_per_100k, -1),
year = tail(global.suicide.rate$year, -1))
differences <- differences %>%
mutate(changeNotPositive = difference <= 0)
differences %>%
ggplot(aes(x = year, y = difference, color = changeNotPositive)) +
geom_segment(aes(x = year, xend = year, y = 0, yend = difference)) +
geom_point(size = 2) +
scale_x_continuous(breaks = seq(1986, 2015, 2)) +
scale_y_continuous(breaks = seq(-1, 1.5, 0.5)) +
theme(legend.position = "none") +
labs(title = "Change in suicide rate in time",
subtitle = "Shows the change in suicide rate from the previous year",
x = "Year", y = "Difference")
```
The figure above shows us the differences in suicide rates between two subsequent years.
Rather than inspecting the data worldwide, we can take a look at the progression of suicide rates within each of the
continents and check wheter there appears to be any differences between them.
```{r fig.height=7}
continent_overall_plot <- data %>% group_by(continent) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = continent, y = suicides_per_100k, fill = continent)) +
geom_bar(stat = "identity") +
labs(title = "Overall differences in suicide rates between continents",
x = "Continent", y = "Suicides per 100k people") +
theme(legend.position = "none")
continent_by_year_plot <- data %>% group_by(year, continent) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = year, y = suicides_per_100k, color=continent)) +
geom_line() +
geom_point(size=0.8) +
scale_x_continuous(breaks = seq(1985, 2015, 5)) +
facet_wrap(~ continent, ncol = 1, scales = "free_y") +
labs(title = "Suicide Rate By Continent In Time",
x = "Year", y = "Suicides per 100k people") +
theme(legend.position = "none")
grid.arrange(continent_overall_plot, continent_by_year_plot, ncol = 2)
```
It turns out that there certainly is a difference in suicide rates between the continents. It is worth noting that the data for Oceania and the Americas suggest that the suicide rates have been on rise in the most recent years, which means that these continents do not comply with the worldwide trend.
The reason that the worldwide trend suggests that suicide rates are decreasing is the decrease of suicide rates within
European countries. The suicide rates in Europe are large enough to dominate the suicide rates within the other continents.
Because of that, they also dictate the worldwide trend. Europe also has the highest overall suicide rate during the inspected
time period.
I cannot recognize any trend within the Asian countries.
The data from African countries may be incomplete. There is a very pronounced and sudden decline in suicide rates after
the year 1995. I do not think that this data represents the real situation within the African countries, and suspect that
the quality of data collection and suicide reports might be the culprit here.
Let's inspect the European countries in more detail, since Europe has the highest suicide rate in the world.
```{r fig.height = 6}
data_europe <- data %>% filter(continent == "Europe") %>%
group_by(country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
arrange(desc(suicides_per_100k))
data_europe %>%
ggplot(aes(x = reorder(country, suicides_per_100k), y = suicides_per_100k, fill = suicides_per_100k)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Overall Suicide Rates in European countries",
x = "Country", y = "Suicides per 100k people")
```
The country with the largest suicide rate is Lithuania, followed by the Russian Federation and Belarus. It seems that a large number of the countries with high suicide rate seem to be East European.
```{r fig.height = 6}
data_europe %>%
ggplot(aes(x = reorder(country, suicides_no), y = suicides_no, fill = suicides_no)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Number of suicides in European countries",
x = "Country", y = "Number of suicides")
```
```{r}
data %>% filter(country == "Russian Federation") %>%
select(suicides_no) %>%
sum()
data %>% filter(country != "Russian Federation") %>%
select(suicides_no) %>%
sum()
```
Inspecting the concrete number of suicides in the European countries reveals that the Russian Federation is the country with the most suicides, and that the number of suicides is much greater than in the other countries. This can be explained by Russia being the European country with the largest population and also having the second highest overall suicide rate in Europe.
### Differences between sexes
A significant factor in suicide risk can be the sex of the person.
```{r}
data %>% group_by(year, sex) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
ggplot(aes(x = year, y = suicides_per_100k, color=sex)) +
geom_line(size=1.1) +
geom_point(size=2.4) +
labs(title = "Global Differences in Suicide Rates Between Sexes",
x = "Year", y = "Suicides per 100k people")
```
It is clear that the male part of the population is at a higher risk of suicide
than its female counterpart. Another somewhat interesting observation is that
the change in suicide rate of women in time is not as pronounced as is the
change of suicide rate of men.
```{r}
data %>% group_by(continent, sex) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = continent, y = suicides_per_100k, fill = sex)) +
geom_bar(stat = "identity", position = "stack") +
scale_y_continuous(breaks = seq(0, 40, 5)) +
coord_flip() +
theme(legend.position = "bottom") +
labs(title = "Overall differences in suicide rates between the continents",
subtitle = "along with differences between sexes within individual continents",
x = "Continent", y = "Suicides per 100k people")
```
We can also inspect the differences between the sexes on within individual
continents. This again reveals the same pattern for all of the continents,
that is that men are at higher risk of suicide than women.
```{r fig.height=7, fig.width=7}
data %>% group_by(continent, sex, year) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
) %>%
ggplot(aes(x = year, y = suicides_per_100k, color = sex)) +
geom_line() +
geom_point() +
facet_wrap(~ continent, ncol = 1, scales = "free_y", strip.position = "right") +
scale_x_continuous(breaks = seq(1985, 2015, 2)) +
theme(legend.position = "bottom")
```
By inspecting the progression of suicide rates of sexes within different continents, we can state that the suicide rates of women are consistently lower than those of men across all of the inspected continents.
### Age
Another factor that might play a role in the suicide risk is the age of the person.
```{r}
data %>% group_by(year, age) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
ggplot(aes(x = year, y = suicides_per_100k, color = age)) +
geom_line() +
geom_point(size = 0.9) +
scale_x_continuous(breaks = seq(1985, 2015, 2)) +
scale_color_discrete() +
labs(title = "Relationship between suicide rate and age",
x = "Year", y = "Suicides per 100k people")
```
It seems that if we inspect the suicide data globally, there in fact is a difference is suicide rates between different
age groups. More over, the risk of suicide seems to grow larger with the age of the person.
```{r fig.height=10, fig.width=8}
data %>% group_by(continent, year, age) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
ggplot(aes(x = year, y = suicides_per_100k, color = age)) +
geom_line() +
geom_point(size = 0.9) +
scale_x_continuous(breaks = seq(1985, 2015, 2)) +
scale_color_discrete() +
facet_wrap(~continent, ncol = 1) +
labs(title = "Relationship between suicide rate and age within individual continents",
x = "Year", y = "Suicides per 100k people")
```
However, if we inspect the dependence on suicide rate on age within individual continents, we can see that the previous
observation does not hold true everywhere in the world.
The differences between age groups are most pronounced in the Asian and European countries. There is also some difference
between the age groups in the Americas, though they do not appear as pronounced as in the two previously mentioned continents.
On the other hand, if we take a look at the data from Oceania, we can see that there does not appear to be any striking dependence of the suicide rate on the age group.
### Correlation Between GDP and Suicide Rates
Using our dataset, we can also try to find out whether the gross domestic product
of a country affects its suicide rate.
```{r}
data %>% group_by(continent, country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
mean_gdp_per_capita = mean(gdp_per_capita),
.groups = "keep") %>%
ggplot(aes(x = mean_gdp_per_capita, y = suicides_per_100k, color = continent)) +
geom_point() +
labs(title = "Relationship between suicide rate and GDP per capita",
subtitle = "Both variables are represented as means from the time period 1985-2015",
x = "GDP per capita [$]", y = "Suicides per 100k people")
```
The first attempt to investigate the relationship between these two variables can be plotting the
overall suicide rate of a country against its mean GDP per capita during the available time period.
However, this figure does not seem to reveal any obvious relationship between these variables.
We can also try to calculate the correlations between these two variables over the same time interval
for each of the countries individually. The correlations were computed using Spearman's rho, since
the relationship between the two variables is not expected to be linear. The correlations have been
computed only for the countries where 10 or more years of data has been available.
```{r}
correlation.gdp.suicide <- data.frame()
for (c in levels(data$country)) {
country.data <- data %>% filter(country == c) %>%
group_by(country, year, gdp_per_capita) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep")
if (nrow(country.data) < 10) {
next
}
corr <- cor.test(x = country.data$gdp_per_capita, y = country.data$suicides_per_100k, method = "spearman",
exact = FALSE)
correlation.gdp.suicide <- rbind(
correlation.gdp.suicide,
list(
country = c,
correlation = corr$estimate,
p.value = corr$p.value,
significant = corr$p.value < 0.05
)
)
}
correlation.gdp.suicide <- arrange(correlation.gdp.suicide, correlation.gdp.suicide$correlation)
```
```{r}
head(correlation.gdp.suicide)
tail(correlation.gdp.suicide)
```
We can see that this provided us with a much more better insight into the data, and that there truly is a relationship between the suicide rate in a country and the gross domestic product of that country.
This relationship seems to be quite strong for a considerable amount of countries.
Some of the countries achieve a large positive correlation and some achieve a large negative correlation.
However, they are not distributed equally, as negative correlations are present more often than the positive ones.
All the correlations (with the exception of Cyprus) whose absolute value is larger than $0.4$ are statistically significant with confidence level $\alpha = 0.05$.
```{r}
correlation.gdp.suicide %>%
filter(p.value < 0.05) %>%
mutate(isNotPositive = correlation <= 0) %>%
ggplot(aes(x = correlation, fill = isNotPositive)) +
geom_histogram(binwidth = 0.1, center = 0.1, color = "#e9ecef") +
scale_x_continuous(breaks = seq(-1, 1, 0.1)) +
scale_y_continuous(breaks = seq(0, 12, 2)) +
labs(title = "Histogram of statistically significant correlations of suicide rate and GDP") +
theme(legend.position = "none")
```
Let's inspect this relationship further by plotting the suicide rate against GDP
for six of the countries that achieved the largest negative correlations.
```{r fig.width=8}
top.corr.countries <- head(correlation.gdp.suicide)$country
top.corr.correlations <- head(correlation.gdp.suicide)$correlation
my_labeller <- function(countries) {
correlation <- map(
countries,
function(x) correlation.gdp.suicide[correlation.gdp.suicide$country == x, ]$correlation
) %>% unlist
correlation <- round(correlation, digits = 2)
result <- paste(as.character(countries), ", rho = ", correlation, sep = "")
}
data %>% filter(country %in% top.corr.countries) %>%
group_by(country, year, gdp_per_capita) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
mutate(country_order = factor(country, levels = top.corr.countries)) %>%
ggplot(aes(x = gdp_per_capita, y = suicides_per_100k, color = country_order)) +
geom_point() +
scale_colour_viridis_d(end = 0.8) +
facet_wrap(~ country_order, scales = "free",
labeller = as_labeller(my_labeller)) +
labs(title = "Countries with the largest negative correlation of gross domestic product and suicide rate",
subtitle = "Correlation is reported using Spearman's rho",
x = "GDP [$]", y = "Suicides per 100k people") +
theme(legend.position = "none")
```
We can see that plotting the data allows us to easily notice the relationship between these two variables.
It is also worth noting that all of these countries achieving the largest negative correlations are European.
We can do the same for countries that achieved the largest positive correlation.
```{r fig.width=8}
top.corr.countries <- head(arrange(correlation.gdp.suicide, desc(correlation.gdp.suicide$correlation)))
top.corr.countries <- top.corr.countries$country
my_labeller <- function(countries) {
correlation <- map(
countries,
function(x) correlation.gdp.suicide[correlation.gdp.suicide$country == x, ]$correlation
) %>% unlist
correlation <- round(correlation, digits = 2)
result <- paste(as.character(countries), ", rho = ", correlation, sep = "")
}
data %>% filter(country %in% top.corr.countries) %>%
group_by(country, year, gdp_per_capita) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep") %>%
mutate(country_order = factor(country, levels = top.corr.countries)) %>%
ggplot(aes(x = gdp_per_capita, y = suicides_per_100k, color = country_order)) +
geom_point() +
scale_colour_viridis_d(begin = 0.8, end = 0) +
facet_wrap(~ country_order, scales = "free",
labeller = as_labeller(my_labeller)) +
labs(title = "Countries with the largest correlation of gross domestic product and suicide rate",
subtitle = "Correlation is reported using Spearman's rho",
x = "GDP [$]", y = "Suicides per 100k people") +
theme(legend.position = "none")
```
Once again, the interaction between suicide rate and GDP is easily recognizable, although in some cases
(e.g. Montenegro) it does not seem to be as well structured as it has been for the negative correlations.
Another thing I would like to investigate is whether there is a connection between the overall suicide rate
of a country and its dependence on the GDP of the country.
```{r}
data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(continent, country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita),
.groups = "keep") %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
# filter(p.value < 0.05) %>%
ggplot(aes(x = correlation, y = suicides_per_100k, color = continent, shape = significant)) +
geom_point() +
labs(title = "Relationship between suicide rate and its correlation with GDP",
subtitle = "Correlation is reported using Spearman's Rho",
x = "Correlation", y = "Suicides per 100k people")
```
The plot includes the correlations even if they were not statistically significant.
It seems that the countries, where the suicide rate is highly correlated with the gross domestic
product, tend to have higher suicide rates than the countries where the correlation with GDP
is not very strong. This can be observed on both ends of the correlation's value range, however it is
far more pronounced for countries that achieve a negative correlation.
We can also observe that the suicide rates of countries that achieve a large negative correlation tend
to be larger than of countries that achieve a large positive correlation.
Once again, it is worth noting that the majority of countries that achieve a negative correlation is European.
Let's see what correlations did individual European countries achieve.
```{r fig.height=4, fig.width=8}
countries.continent <- distinct(data[c("country", "continent")])
correlation.gdp.suicide %>%
inner_join(countries.continent, by = "country") %>%
filter(continent == "Europe") %>%
ggplot(aes(x = reorder(country, correlation), y = correlation, fill = correlation)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
scale_y_continuous(limits = c(-1, 1), breaks = seq(-1, 1, 0.1)) +
labs(title = "Correlations of European countries",
x = "Country", y = "Correlation") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.position = "none")
```
We can see that most of the European countries truly achieve large negative correlations of
suicide rates and GDP. However, there are a few exceptions, where these two variables
are correlated positively.
### Clustering
The plot of an average suicide rate of a country against its correlation with
the country's GDP per capita was really interesting to me, and therefore I would
like to try to separate the individual countries using K-means clustering.
First, let us select the data that will be used for clustering.
```{r}
clustering.data <- data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita)) %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
mutate(suicides_per_100k = scale(suicides_per_100k),
correlation = scale(correlation)) %>%
dplyr::select("correlation", "suicides_per_100k")
```
We can utilize the elbow method in order to select the number of clusters.
```{r}
set.seed(42)
k.max <- 15
wss <- sapply(1:k.max,
function(k) { kmeans(clustering.data, k, nstart=50,iter.max = 15 )$tot.withinss })
elbow <- data.frame(k = c(1:k.max), wss = wss)
elbow %>%
ggplot(aes(x = k, y = wss, color = "orange")) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1, k.max, 1)) +
labs(title = "The change of the total within cluster sum of squares with number of clusters",
x = "Number of clusters", y = "Total within cluster sum of squares") +
theme(legend.position = "none")
```
We can see that the decline slows down after creating 4 clusters, and therefore
we will set $k = 4$.
```{r}
clustering.data <- data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita)) %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
mutate(suicides_per_100k = scale(suicides_per_100k),
correlation = scale(correlation)) %>%
dplyr::select("correlation", "suicides_per_100k")
set.seed(1234)
clustering <- kmeans(clustering.data, 4)
fviz_cluster(clustering, data = clustering.data, labelsize = 0) +
labs(title = "Clustering countries based on their dependence on GDP",
subtitle = "Both features have been normalized",
x = "GDP and Suicide rate correlation",
y = "Suicides per 100k people")
```
It looks like that the first cluster is comprised of countries where the suicide rate
has a strong negative correlation with the GDP and the suicide rate is above average.
The second cluster is also made out of countries with strong correlation, however
it is positive instead of negative. The suicide rate is also higher than average for
most of the countries within this cluster.
The remaining two clusters are composed of countries whose suicide rates are mostly
below the average.
### Dimension Reduction
The $\texttt{generation}$ feature is redundant, since we can assign a generation to a group of population by combining the information
from the features $\texttt{year}$ and $\texttt{age}$.
Furthermore, the features $\texttt{gdp_for_year}$ and $\texttt{gdp_per_capita}$ are mutually redundant. That is because we can always
use one of these features in combination with the sum of total population in a country in a certain year to get the value of the other
feature, e.g. $\texttt{gdp_per_capita} = \frac{\texttt{gdp_for_year}}{\texttt{total_population}}$.
```{r}
data %>% group_by(country, year, gdp_for_year, gdp_per_capita) %>%
summarise(population = sum(population), .groups = "keep") %>%
mutate(x = as.integer(round(gdp_for_year / population)),
y = abs(x - gdp_per_capita)) %>%
pull(y) %>%
mean
```
### Exploratory Data Analysis Conclusion
The exploratory data analysis helped us uncover many interesting insights. First of all, we have seen that
the world experienced a sudden rise in suicide rates during 1989-1995, however these rates these have been on
decline since then. We also inspected the difference in suicide rates between the individual continents.
This revealed that Europe has the highest suicide rate, followed by Asia. Inspection of the suicide rates
among European countries has shown us that Lithuania has the highest suicide rate, however the number of suicides has been the largest in Russia.
We have also seen that men are at a higher risk of committing suicide than women,
and that this is true across all of the continents and the investigated time period.
It also became apparent that the age plays a role in the risk of committing suicide.
There seems to be a difference in suicide rates between the age groups, although
this difference has not been seen in every continent.
An interesting relationship between the suicide rate within a country and the country's gross domestic
product has been revealed as well. This relationship appears to be quite substantial
in many cases, and is almost always present within the European countries. There also appears
to be a connection between the correlation of suicide rate and GDP and the suicide rate as such.
## 2. Hypotheses Testing
I would like to test several hypotheses based on the analysis that we have produced earlier.
The confidence level for all of the hypothesis tests is set to be $\alpha = 0.05$.
### Suicide rates of men and women
The first hypothesis that we will test is that that men tend to commit
suicide more often than women. Therefore, we will test if the suicide rate of men
was higher than that of women during the time interval 1985-2015.
Let $\mu_m$ and $\mu_w$ be the means of suicide rates of men and women, respectively.
The null hypothesis is then
$$ \mu_m \leq \mu_w $$
and the alternative hypothesis is
$$ \mu_m > \mu_w $$
where $\mu_m$ and $\mu_w$ are the means of suicide rates of men and women during
the time period 1985-2015, respectively. The confidence level is set to be $\alpha = 0.05$.
In order to test this hypothesis, we will make use of the Welch's Two Sample t-test,
which is a modification of Student's t-test supporting unequal variances of the samples.
```{r}
sex.hypothesis.data <- data %>% group_by(year, sex) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
)
rates.male <- sex.hypothesis.data %>% filter(sex == "Male") %>% pull(suicides_per_100k)
rates.female <- sex.hypothesis.data %>% filter(sex == "Female") %>% pull(suicides_per_100k)
t.test(
x = rates.male,
y = rates.female,
alternative = "greater"
)
```
We can see that the resulting p-value of the Welch's Two Sample t-test
is smaller than our confidence level, and because of that we can reject the null hypothesis
in favor of the alternative hypothesis. We have confirmed our initial suspicion,
that is that men have higher suicide rates than women.
### Difference in suicide rates between age groups
Another interesting insight gained from the exploratory analysis was that there
is an apparent difference in suicide rates between individual age groups.
Usually, we could use ANOVA to test if at least one of the groups differs significantly
from the rest of the groups. Unfortunately, our data violates certain assumptions of this method,
namely the normality assumption, which requires every group to be normally distributed.
We can easily show the violation of this assumption by inspecting one of the age groups,
for example the group containing people older than 75 years. We can check whether it is
normally distributed by taking a look at the histogram of differences of suicide rates
throughout the years and their mean.
```{r}
age.old.rates <- data %>%
filter(age == "75+") %>%
group_by(year) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000)
(data.frame(residual = age.old.rates$suicides_per_100k - mean(age.old.rates$suicides_per_100k))) %>%
ggplot(aes(x = residual)) +
scale_fill_viridis_c() +
geom_histogram(binwidth = 2,
center = 1,
fill = viridis::viridis(1, begin = 0.7),
color = "#e9ecef") +
labs(title = "Histogram of differences of yearly suicide rates and their mean",
subtitle = "for people that are 75+ years old",
x = "Residual", y = "Count")
```
This histogram indicates that the distribution of suicide rates in fact does not originate
from the normal distribution.
Due to this assumption being violated, we are forced to use a different method,
the Kruskal-Wallis test, which can be used even if the normality assumption does not hold.
The Kruska-Wallis test is a non-parametric test, and as such it does not assume anything about
the distribution of the data. It allows us to test a null hypothesis that is somewhat similar to that of ANOVA.
Let $g_1, \dots, g_6$ be the data of different age groups from our data.
The null hypothesis is that the data in groups $g_1, \dots, g_6$ comes from the same population.
The alternative hypothesis is that at least one of the groups comes from a different population.
```{r}
age.hypothesis.data <- data %>%
group_by(year, age) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
.groups = "keep"
)
kruskal.test(suicides_per_100k ~ age, data = age.hypothesis.data)
```
The p-value of the test is less than our confidence level, and so we reject the null hypothesis in
favour of the alternative hypothesis. The conclusion is that there is indeed a difference between
the individual age groups.
### Higher suicide rates in countries with negative correlation
During the analysis we have seen that countries that achieve large negative correlation have
a larger suicide rate than countries that achieve a large positive correlation. It might be
interesting to test this hypothesis.
Let us create two subsets of the countries based on their correlations. The first group will contain
countries that achieved correlations smaller than $-0.6$, and the other group will contain countries
whose suicide rate and GDP correlation was larger than $0.6$.
Let $\mu_l$ and $\mu_h$ be the mean suicide rates of countries of the first and second group, respectively.
The null hypothesis is then
$$ \mu_l \leq \mu_h $$
and the alternative hypothesis is
$$ \mu_l > \mu_h $$
Once again, we will make use of the Welch's Two Sample t-test.
```{r}
# Extract the suicide rates of the countries with significant correlations
signif.corr <- data %>%
filter(country %in% correlation.gdp.suicide$country) %>%
group_by(continent, country) %>%
summarise(
population = sum(population),
suicides_no = sum(suicides_no),
suicides_per_100k = (suicides_no / population) * 100000,
gdp_per_capita = mean(gdp_per_capita),
.groups = "keep") %>%
inner_join(correlation.gdp.suicide, by = "country") %>%
filter(p.value < 0.05)
t.test(x = signif.corr %>% filter(correlation < -0.6) %>% pull(suicides_per_100k),
y = signif.corr %>% filter(correlation > 0.6) %>% pull(suicides_per_100k),
alternative = "greater")
```
The resulting p-value is less than our predefined confidence level, therefore we reject the null hypothesis
in favor of the alternative hypothesis.
Another hypothesis that I wanted to test is that countries with large absolute values of the suicide rate
and GDP correlation have higher suicide rates than countries that have a small absolute correlation.
However, I did not want to base this hypothesis on the data contained in this dataset, since the small correlations did not turn out to be statistically significant.
## 3. Suicide Rate Prediction Model
Having done the analysis of the dataset, it might also be interesting to construct
a predictive model that would predict the suicide rate for a certain group of population.
For this purpose I will make use of LASSO, a regression method that performs
an automatic feature selection. We will try to fit two different models and then
evaluate them both using cross-validation and a test dataset.
### Encode categorical features
First off, we will have to encode our ordinal and nominal variables, so that
LASSO can make proper use of them.
```{r}
# Compute the suicide rate
ml.data <- data %>%
mutate(suicide_rate = (suicides_no / population) * 100000)
selected.columns <- c("suicide_rate", "population", "age", "sex", "gdp_per_capita",
"gdp_for_year", "country", "continent", "year")
# Hot one encoding the nominal variables
dummies <- dummyVars( ~ sex + continent + country, data = ml.data)
ml.data.encoded.nominal <- predict(dummies, newdata = ml.data)
# Tidy up the column names so that no space characters are present
colnames(ml.data.encoded.nominal) <- make.names(colnames(ml.data.encoded.nominal), unique = TRUE)
# Append the encoded ordinal variables and put everything back together
ml.data.encoded <- as_tibble(cbind(ml.data[, c("population", "gdp_per_capita", "gdp_for_year", "year")],
age = as.numeric(ml.data$age),
generation = as.numeric(ml.data$generation),
ml.data.encoded.nominal))
```
### Simple Lasso
The first model that we can try to fit using LASSO is a simple linear model.
```{r}
set.seed(123)
formula = as.formula(" ~ .")
# Partition the dataset into the training and testing subsets
train.idx <- createDataPartition(ml.data$suicide_rate, p = 0.7, list = FALSE)
y.train <- ml.data[train.idx, ]$suicide_rate
y.test <- ml.data[-train.idx, ]$suicide_rate
# Encode the data into required model
X.train <- data.frame(ml.data.encoded[train.idx, ])
X.train <- sparse.model.matrix(formula, data = X.train)
X.test <- data.frame(ml.data.encoded[-train.idx, ])
X.test <- sparse.model.matrix(formula, data = X.test)
# Run cross-validated LASSO
lasso.base.model <- cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
lasso.base.model
```
```{r}
# Performance on the test dataset
print("Test dataset performance")
prediction <- predict(lasso.base.model, s = lasso.base.model$lambda.1se, newx = X.test)
postResample(pred = prediction, y.test)
```
The most regularized fit has been achieved when lambda equals `r lasso.base.model$lambda.1se`. The $R^2$ of this model
is fairly frequently around 0.45, which would mean that it explains 45% of the variance in our original data. I believe
we can do better than this.
### Lasso with polynomial features
We have seen that the previous model did not fit the data very well. The model can be made more complex in order
to provide it with more flexibility, which would hopefully allow it to perform better. We will extend the previous
model with polynomial features.
```{r}
set.seed(123)
# Create the model formula
# This formula includes all of the features as well as their squares and all of the
# interactions between different features up to second degree
formula <- as.formula(paste(" ~ .^2 + ",
paste("poly(",
colnames(ml.data.encoded),
", degree = 2, raw = TRUE)[, 2]",
collapse = " + ")))
# Partition the dataset into the training and testing subsets
train.idx <- createDataPartition(ml.data$suicide_rate, p = 0.7, list = FALSE)
y.train <- ml.data[train.idx, ]$suicide_rate
y.test <- ml.data[-train.idx, ]$suicide_rate
X.train <- data.frame(ml.data.encoded[train.idx, ])
X.train <- sparse.model.matrix(formula, data = X.train)
X.test <- data.frame(ml.data.encoded[-train.idx, ])
X.test <- sparse.model.matrix(formula, data = X.test)
# Run the cross-validated LASSO
lasso.poly.model <- cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
lasso.poly.model
cat("\nMost regularized fit attained for lambda = ")
cat(lasso.poly.model$lambda.1se)
```
We can see that the most regularized fit has been attained with lambda equal to
`r lasso.poly.model$lambda.1se`.
```{r}
print("Test dataset performance")
prediction <- predict(lasso.poly.model, s = lasso.poly.model$lambda.1se, newx = X.test)
postResample(pred = prediction, y.test)
```
Based on the performance of this model on the test dataset, we can say that it is indeed better at making
predictions than our previous model was.
Let's take a look at what features where assigned the largest coefficients.
```{r}
coeffs.poly <- coef(lasso.poly.model, s = lasso.poly.model$lambda.1se)
head(coeffs.poly[order(abs(coeffs.poly), decreasing = TRUE), ], n = 20)
```
It seems that the model decided to base its prediction mainly on the interactions between the
categorical features sex and country. I am surprised that large coefficients were not assigned to
features containing the $\texttt{gdp_per_capita}$ features, as we have seen that it is often correlated
with the suicide rate.
## 4. Conclusion