analysis.Rmd

---
title: "Suicide Prevention and Analysis"
author: "Václav Hlaváč"
output:
  pdf_document: default
  word_document: default
---


```{r setup, include=FALSE}
library(gridExtra)
library(factoextra)
library(caret)
library(glmnet)
library(MASS)
library(tidyverse)
library(xgboost)
library(Matrix)
theme_set(theme_gray())
```


**Suicide Understanding and Prevention**

This notebook contains the analysis of a suicide dataset, which was the final
assignment of the Statistical data analysis course.

## 1. Exploratory data analysis

```{r include=FALSE}
# Load the dataset
load(file="data/suicide.RData")
```

I begin by inspecting whether the dataset contains any NA values that would need to be taken care of.
```{r}
data %>% is.na() %>% colSums()
```

It seems that the dataset has already been conveniently cleaned beforehand.

Did all of the included countries contribute the same amount of data? Are there
any countries that reported their statistics far less often than other countries?

```{r}
years.count <- data %>% group_by(continent, country) %>%
  summarise(count = n_distinct(year)) %>%
  arrange(desc(count))

years.count %>%
  ggplot(aes(x = count, fill = continent)) +
  scale_x_continuous(n.breaks = 31) +
  geom_bar()
```
It looks that a significant number of countries provided data about the number suicides within their population most of the years during 1985-2015. However there are some countries that supplied only less than 10 year worth of data.

### Suicide rate

Throughout the majority of this notebook, I will be working with suicide rates rather than with concrete numbers
of total suicides. The reason for this is that it allows us to perform better comparisons of suicide prevalence
between different countries and continents, as the populations of different areas differ, and the total number of
suicides in a population is correlated with the size of the population.

```{r}
country.summary <- data %>% group_by(year) %>%
  summarise(population = sum(population),
            suicides_no = sum(suicides_no))

cor(x = country.summary$population, y = country.summary$suicides_no)
```

If we wanted to be even more rigorous with our analysis, we could standardize our suicide rates by age distribution,
as is done by the World Health Organization. This allows for better comparison of statistics between countries with
different population age profiles.


### Global trend inspection

The first thing that we can inspect is the progression of suicide rates worldwide during the time period 1985-2015.

```{r}

global.suicide.rate <- data %>% group_by(year) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000
  ) 

global.suicide.rate %>%
  ggplot(aes(x = year, y = suicides_per_100k)) +
  geom_point(color="orange", size=2.4) +
  geom_line(color="orange", size=1.1) +
  labs(
    title = "Worldwide Suicide Rates",
    x = "Year", y = "Suicides per 100k people"
  ) +
  scale_x_continuous(breaks = seq(1985, 2015, 2))
  
```

This figure shows that there has been a great rise in suicides throughout the years 1989 and 1995.
However, with a few exceptions, the suicide rate seems to be on decline ever since then.

```{r}
differences <- data.frame(
  difference = tail(global.suicide.rate$suicides_per_100k, -1) - head(global.suicide.rate$suicides_per_100k, -1),
  year = tail(global.suicide.rate$year, -1))

differences <- differences %>%
  mutate(changeNotPositive = difference <= 0)

differences %>%
  ggplot(aes(x = year, y = difference, color = changeNotPositive)) +
  geom_segment(aes(x = year, xend = year, y = 0, yend = difference)) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1986, 2015, 2)) +
  scale_y_continuous(breaks = seq(-1, 1.5, 0.5)) +
  theme(legend.position = "none") +
  labs(title = "Change in suicide rate in time",
       subtitle = "Shows the change in suicide rate from the previous year",
       x = "Year", y = "Difference")
```
The figure above shows us the differences in suicide rates between two subsequent years.

Rather than inspecting the data worldwide, we can take a look at the progression of suicide rates within each of the
continents and check wheter there appears to be any differences between them.

```{r fig.height=7}

continent_overall_plot <- data %>% group_by(continent) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep"
  ) %>%
  ggplot(aes(x = continent, y = suicides_per_100k, fill = continent)) +
  geom_bar(stat = "identity") +
  labs(title = "Overall differences in suicide rates between continents",
       x = "Continent", y = "Suicides per 100k people") +
  theme(legend.position = "none")

continent_by_year_plot <- data %>% group_by(year, continent) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep"
  ) %>%
  ggplot(aes(x = year, y = suicides_per_100k, color=continent)) +
  geom_line() +
  geom_point(size=0.8) +
  scale_x_continuous(breaks = seq(1985, 2015, 5)) +
  facet_wrap(~ continent, ncol = 1, scales = "free_y") +
  labs(title = "Suicide Rate By Continent In Time",
       x = "Year", y = "Suicides per 100k people") +
  theme(legend.position = "none")

grid.arrange(continent_overall_plot, continent_by_year_plot, ncol = 2)
```
It turns out that there certainly is a difference in suicide rates between the continents. It is worth noting that the data for Oceania and the Americas suggest that the suicide rates have been on rise in the most recent years, which means that these continents do not comply with the worldwide trend.

The reason that the worldwide trend suggests that suicide rates are decreasing is the decrease of suicide rates within
European countries. The suicide rates in Europe are large enough to dominate the suicide rates within the other continents.
Because of that, they also dictate the worldwide trend. Europe also has the highest overall suicide rate during the inspected
time period.

I cannot recognize any trend within the Asian countries.

The data from African countries may be incomplete. There is a very pronounced and sudden decline in suicide rates after 
the year 1995. I do not think that this data represents the real situation within the African countries, and suspect that
the quality of data collection and suicide reports might be the culprit here.

Let's inspect the European countries in more detail, since Europe has the highest suicide rate in the world.

```{r fig.height = 6}

data_europe <- data %>% filter(continent == "Europe") %>%
  group_by(country) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep"
  ) %>%
  arrange(desc(suicides_per_100k))

data_europe %>%
  ggplot(aes(x = reorder(country, suicides_per_100k), y = suicides_per_100k, fill = suicides_per_100k)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_c() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "Overall Suicide Rates in European countries",
       x = "Country", y = "Suicides per 100k people")
  
```

The country with the largest suicide rate is Lithuania, followed by the Russian Federation and Belarus. It seems that a large number of the countries with high suicide rate seem to be East European.

```{r fig.height = 6}

data_europe %>%
  ggplot(aes(x = reorder(country, suicides_no), y = suicides_no, fill = suicides_no)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_c() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "Number of suicides in European countries",
       x = "Country", y = "Number of suicides")
  
```
```{r}
data %>% filter(country == "Russian Federation") %>%
  select(suicides_no) %>%
  sum()

data %>% filter(country != "Russian Federation") %>%
  select(suicides_no) %>%
  sum()
```

Inspecting the concrete number of suicides in the European countries reveals that the Russian Federation is the country with the most suicides, and that the number of suicides is much greater than in the other countries. This can be explained by Russia being the European country with the largest population and also having the second highest overall suicide rate in Europe.

### Differences between sexes

A significant factor in suicide risk can be the sex of the person.

```{r}
data %>% group_by(year, sex) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep") %>%
  ggplot(aes(x = year, y = suicides_per_100k, color=sex)) +
  geom_line(size=1.1) + 
  geom_point(size=2.4) +
  labs(title = "Global Differences in Suicide Rates Between Sexes",
       x = "Year", y = "Suicides per 100k people")
```

It is clear that the male part of the population is at a higher risk of suicide
than its female counterpart. Another somewhat interesting observation is that
the change in suicide rate of women in time is not as pronounced as is the
change of suicide rate of men.

```{r}
data %>% group_by(continent, sex) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep"
  ) %>%
  ggplot(aes(x = continent, y = suicides_per_100k, fill = sex)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_y_continuous(breaks = seq(0, 40, 5)) +
  coord_flip() +
  theme(legend.position = "bottom") +
  labs(title = "Overall differences in suicide rates between the continents",
       subtitle = "along with differences between sexes within individual continents",
       x = "Continent", y = "Suicides per 100k people")
```
We can also inspect the differences between the sexes on within individual
continents. This again reveals the same pattern for all of the continents,
that is that men are at higher risk of suicide than women.

```{r fig.height=7, fig.width=7}
data %>% group_by(continent, sex, year) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep"
  ) %>%
  ggplot(aes(x = year, y = suicides_per_100k, color = sex)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ continent, ncol = 1, scales = "free_y", strip.position = "right") +
  scale_x_continuous(breaks = seq(1985, 2015, 2)) +
  theme(legend.position = "bottom")
```
By inspecting the progression of suicide rates of sexes within different continents, we can state that the suicide rates of women are consistently lower than those of men across all of the inspected continents.

### Age

Another factor that might play a role in the suicide risk is the age of the person.

```{r}
data %>% group_by(year, age) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep") %>%
  ggplot(aes(x = year, y = suicides_per_100k, color = age)) +
  geom_line() +
  geom_point(size = 0.9) +
  scale_x_continuous(breaks = seq(1985, 2015, 2)) +
  scale_color_discrete() +
  labs(title = "Relationship between suicide rate and age",
       x = "Year", y = "Suicides per 100k people")
```

It seems that if we inspect the suicide data globally, there in fact is a difference is suicide rates between different
age groups. More over, the risk of suicide seems to grow larger with the age of the person.

```{r fig.height=10, fig.width=8}
data %>% group_by(continent, year, age) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep") %>%
  ggplot(aes(x = year, y = suicides_per_100k, color = age)) +
  geom_line() +
  geom_point(size = 0.9) +
  scale_x_continuous(breaks = seq(1985, 2015, 2)) +
  scale_color_discrete() +
  facet_wrap(~continent, ncol = 1) +
  labs(title = "Relationship between suicide rate and age within individual continents",
       x = "Year", y = "Suicides per 100k people")
```

However, if we inspect the dependence on suicide rate on age within individual continents, we can see that the previous
observation does not hold true everywhere in the world.

The differences between age groups are most pronounced in the Asian and European countries. There is also some difference
between the age groups in the Americas, though they do not appear as pronounced as in the two previously mentioned continents.

On the other hand, if we take a look at the data from Oceania, we can see that there does not appear to be any striking dependence of the suicide rate on the age group.


### Correlation Between GDP and Suicide Rates

Using our dataset, we can also try to find out whether the gross domestic product
of a country affects its suicide rate. 

```{r}

data %>% group_by(continent, country) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    mean_gdp_per_capita = mean(gdp_per_capita),
    .groups = "keep") %>%
  ggplot(aes(x = mean_gdp_per_capita, y = suicides_per_100k, color = continent)) +
  geom_point() +
  labs(title = "Relationship between suicide rate and GDP per capita",
       subtitle = "Both variables are represented as means from the time period 1985-2015",
       x = "GDP per capita [$]", y = "Suicides per 100k people")
  

```
The first attempt to investigate the relationship between these two variables can be plotting the
overall suicide rate of a country against its mean GDP per capita during the available time period.
However, this figure does not seem to reveal any obvious relationship between these variables.

We can also try to calculate the correlations between these two variables over the same time interval
for each of the countries individually. The correlations were computed using Spearman's rho, since
the relationship between the two variables is not expected to be linear. The correlations have been
computed only for the countries where 10 or more years of data has been available.

```{r}

correlation.gdp.suicide <- data.frame()

for (c in levels(data$country)) {

  country.data <- data %>% filter(country == c) %>%
    group_by(country, year, gdp_per_capita) %>%
    summarise(
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep")
  
  if (nrow(country.data) < 10) {
    next
  }
  
  corr <- cor.test(x = country.data$gdp_per_capita, y = country.data$suicides_per_100k, method = "spearman",
                   exact = FALSE)
  
  correlation.gdp.suicide <- rbind(
    correlation.gdp.suicide,
    list(
      country = c,
      correlation = corr$estimate,
      p.value = corr$p.value,
      significant = corr$p.value < 0.05
    )
  )
}


correlation.gdp.suicide <- arrange(correlation.gdp.suicide, correlation.gdp.suicide$correlation)

```

```{r}
head(correlation.gdp.suicide)
tail(correlation.gdp.suicide)
```

We can see that this provided us with a much more better insight into the data, and that there truly is a relationship between the suicide rate in a country and the gross domestic product of that country.
This relationship seems to be quite strong for a considerable amount of countries.
Some of the countries achieve a large positive correlation and some achieve a large negative correlation.
However, they are not distributed equally, as negative correlations are present more often than the positive ones.
All the correlations (with the exception of Cyprus) whose absolute value is larger than $0.4$ are statistically significant with confidence level $\alpha = 0.05$.

```{r}
correlation.gdp.suicide %>%
  filter(p.value < 0.05) %>%
  mutate(isNotPositive = correlation <= 0) %>%
  ggplot(aes(x = correlation, fill = isNotPositive)) +
  geom_histogram(binwidth = 0.1, center = 0.1, color = "#e9ecef") +
  scale_x_continuous(breaks = seq(-1, 1, 0.1)) +
  scale_y_continuous(breaks = seq(0, 12, 2)) +
  labs(title = "Histogram of statistically significant correlations of suicide rate and GDP") +
  theme(legend.position = "none")
```

Let's inspect this relationship further by plotting the suicide rate against GDP
for six of the countries that achieved the largest negative correlations.

```{r fig.width=8}

top.corr.countries <- head(correlation.gdp.suicide)$country
top.corr.correlations <- head(correlation.gdp.suicide)$correlation

my_labeller <- function(countries) {
  correlation <- map(
    countries,
    function(x) correlation.gdp.suicide[correlation.gdp.suicide$country == x, ]$correlation
  ) %>% unlist
  correlation <- round(correlation, digits = 2)
  result <- paste(as.character(countries), ", rho = ", correlation, sep = "")
}

data %>% filter(country %in% top.corr.countries) %>%
  group_by(country, year, gdp_per_capita) %>%
  summarise( 
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep") %>%
  mutate(country_order = factor(country, levels = top.corr.countries)) %>%
  ggplot(aes(x = gdp_per_capita, y = suicides_per_100k, color = country_order)) +
  geom_point() +
  scale_colour_viridis_d(end = 0.8) +
  facet_wrap(~ country_order, scales = "free",
             labeller = as_labeller(my_labeller)) +
  labs(title = "Countries with the largest negative correlation of gross domestic product and suicide rate",
       subtitle = "Correlation is reported using Spearman's rho",
       x = "GDP [$]", y = "Suicides per 100k people") +
  theme(legend.position = "none")

```

We can see that plotting the data allows us to easily notice the relationship between these two variables.
It is also worth noting that all of these countries achieving the largest negative correlations are European.

We can do the same for countries that achieved the largest positive correlation.

```{r fig.width=8}

top.corr.countries <- head(arrange(correlation.gdp.suicide, desc(correlation.gdp.suicide$correlation)))
top.corr.countries <- top.corr.countries$country

my_labeller <- function(countries) {
  correlation <- map(
    countries,
    function(x) correlation.gdp.suicide[correlation.gdp.suicide$country == x, ]$correlation
  ) %>% unlist
  correlation <- round(correlation, digits = 2)
  result <- paste(as.character(countries), ", rho = ", correlation, sep = "")
}

data %>% filter(country %in% top.corr.countries) %>%
  group_by(country, year, gdp_per_capita) %>%
  summarise(
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = (suicides_no / population) * 100000,
    .groups = "keep") %>%
  mutate(country_order = factor(country, levels = top.corr.countries)) %>%
  ggplot(aes(x = gdp_per_capita, y = suicides_per_100k, color = country_order)) +
  geom_point() +
  scale_colour_viridis_d(begin = 0.8, end = 0) +
  facet_wrap(~ country_order, scales = "free",
             labeller = as_labeller(my_labeller)) +
  labs(title = "Countries with the largest correlation of gross domestic product and suicide rate",
       subtitle = "Correlation is reported using Spearman's rho",
       x = "GDP [$]", y = "Suicides per 100k people") +
  theme(legend.position = "none")
```

Once again, the interaction between suicide rate and GDP is easily recognizable, although in some cases
(e.g. Montenegro) it does not seem to be as well structured as it has been for the negative correlations.

Another thing I would like to investigate is whether there is a connection between the overall suicide rate
of a country and its dependence on the GDP of the country.


```{r}
data %>%
  filter(country %in% correlation.gdp.suicide$country) %>%
  group_by(continent, country) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    gdp_per_capita = mean(gdp_per_capita),
    .groups = "keep") %>%
  inner_join(correlation.gdp.suicide, by = "country") %>%
  # filter(p.value < 0.05) %>% 
  ggplot(aes(x = correlation, y = suicides_per_100k, color = continent, shape = significant)) +
  geom_point() +
  labs(title = "Relationship between suicide rate and its correlation with GDP",
       subtitle = "Correlation is reported using Spearman's Rho",
       x = "Correlation", y = "Suicides per 100k people")
```
The plot includes the correlations even if they were not statistically significant.
It seems that the countries, where the suicide rate is highly correlated with the gross domestic
product, tend to have higher suicide rates than the countries where the correlation with GDP
is not very strong. This can be observed on both ends of the correlation's value range, however it is
far more pronounced for countries that achieve a negative correlation.

We can also observe that the suicide rates of countries that achieve a large negative correlation tend
to be larger than of countries that achieve a large positive correlation.

Once again, it is worth noting that the majority of countries that achieve a negative correlation is European.
Let's see what correlations did individual European countries achieve.

```{r fig.height=4, fig.width=8}

countries.continent <- distinct(data[c("country", "continent")])

correlation.gdp.suicide %>%
  inner_join(countries.continent, by = "country") %>%
  filter(continent == "Europe") %>%
  ggplot(aes(x = reorder(country, correlation), y = correlation, fill = correlation)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_c() +
  scale_y_continuous(limits = c(-1, 1), breaks = seq(-1, 1, 0.1)) +
  labs(title = "Correlations of European countries",
       x = "Country", y = "Correlation") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        legend.position = "none")

```

We can see that most of the European countries truly achieve large negative correlations of
suicide rates and GDP. However, there are a few exceptions, where these two variables
are correlated positively.


### Clustering

The plot of an average suicide rate of a country against its correlation with
the country's GDP per capita was really interesting to me, and therefore I would
like to try to separate the individual countries using K-means clustering.

First, let us select the data that will be used for clustering.

```{r}
clustering.data <- data %>%
  filter(country %in% correlation.gdp.suicide$country) %>%
  group_by(country) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    gdp_per_capita = mean(gdp_per_capita)) %>%
  inner_join(correlation.gdp.suicide, by = "country") %>%
  mutate(suicides_per_100k = scale(suicides_per_100k),
         correlation = scale(correlation)) %>%
  dplyr::select("correlation", "suicides_per_100k")
```

We can utilize the elbow method in order to select the number of clusters.

```{r}
set.seed(42)
k.max <- 15
wss <- sapply(1:k.max, 
              function(k) { kmeans(clustering.data, k, nstart=50,iter.max = 15 )$tot.withinss })

elbow <- data.frame(k = c(1:k.max), wss = wss)

elbow %>%
  ggplot(aes(x = k, y = wss, color = "orange")) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = seq(1, k.max, 1)) +
  labs(title = "The change of the total within cluster sum of squares with number of clusters",
       x = "Number of clusters", y = "Total within cluster sum of squares") +
  theme(legend.position = "none")

```

We can see that the decline slows down after creating 4 clusters, and therefore
we will set $k = 4$.

```{r}
clustering.data <- data %>%
  filter(country %in% correlation.gdp.suicide$country) %>%
  group_by(country) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    gdp_per_capita = mean(gdp_per_capita)) %>%
  inner_join(correlation.gdp.suicide, by = "country") %>%
  mutate(suicides_per_100k = scale(suicides_per_100k),
         correlation = scale(correlation)) %>%
  dplyr::select("correlation", "suicides_per_100k")

set.seed(1234)
clustering <- kmeans(clustering.data, 4)

fviz_cluster(clustering, data = clustering.data, labelsize = 0) +
  labs(title = "Clustering countries based on their dependence on GDP",
       subtitle = "Both features have been normalized",
       x = "GDP and Suicide rate correlation",
       y = "Suicides per 100k people")


```

It looks like that the first cluster is comprised of countries where the suicide rate
has a strong negative correlation with the GDP and the suicide rate is above average.
The second cluster is also made out of countries with strong correlation, however
it is positive instead of negative. The suicide rate is also higher than average for
most of the countries within this cluster.

The remaining two clusters are composed of countries whose suicide rates are mostly
below the average.


### Dimension Reduction

The $\texttt{generation}$ feature is redundant, since we can assign a generation to a group of population by combining the information
from the features $\texttt{year}$ and $\texttt{age}$.

Furthermore, the features $\texttt{gdp_for_year}$ and $\texttt{gdp_per_capita}$ are mutually redundant. That is because we can always
use one of these features in combination with the sum of total population in a country in a certain year to get the value of the other
feature, e.g. $\texttt{gdp_per_capita} = \frac{\texttt{gdp_for_year}}{\texttt{total_population}}$.

```{r}
data %>% group_by(country, year, gdp_for_year, gdp_per_capita) %>%
  summarise(population = sum(population), .groups = "keep") %>%
  mutate(x = as.integer(round(gdp_for_year / population)),
         y = abs(x - gdp_per_capita)) %>%
  pull(y) %>%
  mean
```


### Exploratory Data Analysis Conclusion

The exploratory data analysis helped us uncover many interesting insights. First of all, we have seen that
the world experienced a sudden rise in suicide rates during 1989-1995, however these rates these have been on
decline since then. We also inspected the difference in suicide rates between the individual continents.
This revealed that Europe has the highest suicide rate, followed by Asia. Inspection of the suicide rates
among European countries has shown us that Lithuania has the highest suicide rate, however the number of suicides has been the largest in Russia.

We have also seen that men are at a higher risk of committing suicide than women,
and that this is true across all of the continents and the investigated time period.

It also became apparent that the age plays a role in the risk of committing suicide.
There seems to be a difference in suicide rates between the age groups, although
this difference has not been seen in every continent.

An interesting relationship between the suicide rate within a country and the country's gross domestic
product has been revealed as well. This relationship appears to be quite substantial
in many cases, and is almost always present within the European countries. There also appears
to be a connection between the correlation of suicide rate and GDP and the suicide rate as such.


## 2. Hypotheses Testing

I would like to test several hypotheses based on the analysis that we have produced earlier.
The confidence level for all of the hypothesis tests is set to be $\alpha = 0.05$.

### Suicide rates of men and women

The first hypothesis that we will test is that that men tend to commit
suicide more often than women. Therefore, we will test if the suicide rate of men
was higher than that of women during the time interval 1985-2015.

Let $\mu_m$ and $\mu_w$ be the means of suicide rates of men and women, respectively.

The null hypothesis is then
$$ \mu_m \leq \mu_w $$ 
and the alternative hypothesis is
$$ \mu_m > \mu_w $$
where $\mu_m$ and $\mu_w$ are the means of suicide rates of men and women during
the time period 1985-2015, respectively. The confidence level is set to be $\alpha = 0.05$.

In order to test this hypothesis, we will make use of the Welch's Two Sample t-test,
which is a modification of Student's t-test supporting unequal variances of the samples.


```{r}

sex.hypothesis.data <- data %>% group_by(year, sex) %>%
    summarise(
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = (suicides_no / population) * 100000,
      .groups = "keep"
    )

rates.male <- sex.hypothesis.data %>% filter(sex == "Male") %>% pull(suicides_per_100k)
rates.female <- sex.hypothesis.data %>% filter(sex == "Female") %>% pull(suicides_per_100k)

t.test(
  x = rates.male,
  y = rates.female,
  alternative = "greater"
)

```

We can see that the resulting p-value of the Welch's Two Sample t-test
is smaller than our confidence level, and because of that we can reject the null hypothesis
in favor of the alternative hypothesis. We have confirmed our initial suspicion,
that is that men have higher suicide rates than women.


### Difference in suicide rates between age groups

Another interesting insight gained from the exploratory analysis was that there
is an apparent difference in suicide rates between individual age groups.

Usually, we could use ANOVA to test if at least one of the groups differs significantly
from the rest of the groups. Unfortunately, our data violates certain assumptions of this method,
namely the normality assumption, which requires every group to be normally distributed.

We can easily show the violation of this assumption by inspecting one of the age groups,
for example the group containing people older than 75 years. We can check whether it is
normally distributed by taking a look at the histogram of differences of suicide rates
throughout the years and their mean.

```{r}

age.old.rates <- data %>%
  filter(age == "75+") %>%
  group_by(year) %>%
  summarise(
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = (suicides_no / population) * 100000)

(data.frame(residual = age.old.rates$suicides_per_100k - mean(age.old.rates$suicides_per_100k))) %>%
  ggplot(aes(x = residual)) +
  scale_fill_viridis_c() +
  geom_histogram(binwidth = 2,
                 center = 1,
                 fill = viridis::viridis(1, begin = 0.7),
                 color = "#e9ecef") +
  labs(title = "Histogram of differences of yearly suicide rates and their mean",
       subtitle = "for people that are 75+ years old",
       x = "Residual", y = "Count")

```

This histogram indicates that the distribution of suicide rates in fact does not originate
from the normal distribution.

Due to this assumption being violated, we are forced to use a different method,
the Kruskal-Wallis test, which can be used even if the normality assumption does not hold.
The Kruska-Wallis test is a non-parametric test, and as such it does not assume anything about
the distribution of the data. It allows us to test a null hypothesis that is somewhat similar to that of ANOVA.

Let $g_1, \dots, g_6$ be the data of different age groups from our data.
The null hypothesis is that the data in groups $g_1, \dots, g_6$ comes from the same population.
The alternative hypothesis is that at least one of the groups comes from a different population.


```{r}

age.hypothesis.data <- data %>%
  group_by(year, age) %>%
  summarise(
      population = sum(population),
      suicides_no = sum(suicides_no),
      suicides_per_100k = (suicides_no / population) * 100000,
      .groups = "keep"
    )

kruskal.test(suicides_per_100k ~ age, data = age.hypothesis.data)

```

The p-value of the test is less than our confidence level, and so we reject the null hypothesis in
favour of the alternative hypothesis. The conclusion is that there is indeed a difference between
the individual age groups.

### Higher suicide rates in countries with negative correlation

During the analysis we have seen that countries that achieve large negative correlation have
a larger suicide rate than countries that achieve a large positive correlation. It might be
interesting to test this hypothesis.

Let us create two subsets of the countries based on their correlations. The first group will contain
countries that achieved correlations smaller than $-0.6$, and the other group will contain countries
whose suicide rate and GDP correlation was larger than $0.6$.

Let $\mu_l$ and $\mu_h$ be the mean suicide rates of countries of the first and second group, respectively.

The null hypothesis is then
$$ \mu_l \leq \mu_h $$
and the alternative hypothesis is
$$ \mu_l > \mu_h $$

Once again, we will make use of the Welch's Two Sample t-test.


```{r}

# Extract the suicide rates of the countries with significant correlations
signif.corr <- data %>%
  filter(country %in% correlation.gdp.suicide$country) %>%
  group_by(continent, country) %>%
  summarise(
    population = sum(population),
    suicides_no = sum(suicides_no),
    suicides_per_100k = (suicides_no / population) * 100000,
    gdp_per_capita = mean(gdp_per_capita),
    .groups = "keep") %>%
  inner_join(correlation.gdp.suicide, by = "country") %>%
  filter(p.value < 0.05)

t.test(x = signif.corr %>% filter(correlation < -0.6) %>% pull(suicides_per_100k),
       y = signif.corr %>% filter(correlation > 0.6) %>% pull(suicides_per_100k),
       alternative = "greater")
  
```

The resulting p-value is less than our predefined confidence level, therefore we reject the null hypothesis
in favor of the alternative hypothesis.

Another hypothesis that I wanted to test is that countries with large absolute values of the suicide rate
and GDP correlation have higher suicide rates than countries that have a small absolute correlation.
However, I did not want to base this hypothesis on the data contained in this dataset, since the small correlations did not turn out to be statistically significant.


## 3. Suicide Rate Prediction Model

Having done the analysis of the dataset, it might also be interesting to construct
a predictive model that would predict the suicide rate for a certain group of population.

For this purpose I will make use of LASSO, a regression method that performs
an automatic feature selection. We will try to fit two different models and then
evaluate them both using cross-validation and a test dataset.

### Encode categorical features

First off, we will have to encode our ordinal and nominal variables, so that
LASSO can make proper use of them.

```{r}

# Compute the suicide rate
ml.data <- data %>%
  mutate(suicide_rate = (suicides_no / population) * 100000)

selected.columns <- c("suicide_rate", "population", "age", "sex", "gdp_per_capita",
                      "gdp_for_year", "country", "continent", "year")

# Hot one encoding the nominal variables
dummies <- dummyVars( ~ sex + continent + country, data = ml.data)
ml.data.encoded.nominal <- predict(dummies, newdata = ml.data)

# Tidy up the column names so that no space characters are present
colnames(ml.data.encoded.nominal) <- make.names(colnames(ml.data.encoded.nominal), unique = TRUE)

# Append the encoded ordinal variables and put everything back together
ml.data.encoded <- as_tibble(cbind(ml.data[, c("population", "gdp_per_capita", "gdp_for_year", "year")],
                         age = as.numeric(ml.data$age),
                         generation = as.numeric(ml.data$generation),
                         ml.data.encoded.nominal))

```


### Simple Lasso

The first model that we can try to fit using LASSO is a simple linear model.

```{r}

set.seed(123)

formula = as.formula(" ~ .")

# Partition the dataset into the training and testing subsets
train.idx <- createDataPartition(ml.data$suicide_rate, p = 0.7, list = FALSE)

y.train <- ml.data[train.idx, ]$suicide_rate
y.test <- ml.data[-train.idx, ]$suicide_rate

# Encode the data into required model
X.train <- data.frame(ml.data.encoded[train.idx, ])
X.train <- sparse.model.matrix(formula, data = X.train)

X.test <- data.frame(ml.data.encoded[-train.idx, ])
X.test <- sparse.model.matrix(formula, data = X.test)

# Run cross-validated LASSO
lasso.base.model <- cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
lasso.base.model

```

```{r}
# Performance on the test dataset
print("Test dataset performance")
prediction <- predict(lasso.base.model, s = lasso.base.model$lambda.1se, newx = X.test)
postResample(pred = prediction, y.test)
```

The most regularized fit has been achieved when lambda equals `r lasso.base.model$lambda.1se`. The $R^2$ of this model
is fairly frequently around 0.45, which would mean that it explains 45% of the variance in our original data. I believe
we can do better than this.

### Lasso with polynomial features

We have seen that the previous model did not fit the data very well. The model can be made more complex in order
to provide it with more flexibility, which would hopefully allow it to perform better. We will extend the previous
model with polynomial features.

```{r}

set.seed(123)

# Create the model formula
# This formula includes all of the features as well as their squares and all of the
# interactions between different features up to second degree
formula <- as.formula(paste(" ~ .^2 + ",
                            paste("poly(",
                                  colnames(ml.data.encoded),
                                  ", degree = 2, raw = TRUE)[, 2]",
                                  collapse = " + ")))

# Partition the dataset into the training and testing subsets
train.idx <- createDataPartition(ml.data$suicide_rate, p = 0.7, list = FALSE)


y.train <- ml.data[train.idx, ]$suicide_rate
y.test <- ml.data[-train.idx, ]$suicide_rate

X.train <- data.frame(ml.data.encoded[train.idx, ])
X.train <- sparse.model.matrix(formula, data = X.train)

X.test <- data.frame(ml.data.encoded[-train.idx, ])
X.test <- sparse.model.matrix(formula, data = X.test)

# Run the cross-validated LASSO
lasso.poly.model <- cv.glmnet(x = X.train, y = y.train, type.measure = "mse")
lasso.poly.model

cat("\nMost regularized fit attained for lambda = ")
cat(lasso.poly.model$lambda.1se)

```

We can see that the most regularized fit has been attained with lambda equal to
`r lasso.poly.model$lambda.1se`.

```{r}
print("Test dataset performance")
prediction <- predict(lasso.poly.model, s = lasso.poly.model$lambda.1se, newx = X.test)
postResample(pred = prediction, y.test)
```

Based on the performance of this model on the test dataset, we can say that it is indeed better at making
predictions than our previous model was.

Let's take a look at what features where assigned the largest coefficients.

```{r}

coeffs.poly <- coef(lasso.poly.model, s = lasso.poly.model$lambda.1se)
head(coeffs.poly[order(abs(coeffs.poly), decreasing = TRUE), ], n = 20)

```

It seems that the model decided to base its prediction mainly on the interactions between the
categorical features sex and country. I am surprised that large coefficients were not assigned to
features containing the $\texttt{gdp_per_capita}$ features, as we have seen that it is often correlated
with the suicide rate.

## 4. Conclusion

After exploring this dataset we can say that we have found several suicide patterns.
It came to our attention that men are significantly more likely to commit suicide than
women, which we have also tested using a t-test. This was also found to be true by the WHO suicide
report. We also revealed that there is a difference in suicide rates between different age categories,
and that mainly in Europe and Asia, the suicide risk grows larger with the age of a person.
Last but not least, suicide rates within many countries were strongly correlated with
the GDP per capita of the country.

We found that Europe has the highest overall suicide rate out of all of the continents, with Russia
and Lithuania having the highest suicide rates. 

While working with this dataset, I wondered what other patterns or connections we might be
able to find, given that we would expand our dataset with new data. For example, It would be
interesting to check whether there is a relationship of a suicide rate within a country and the country's
happiness index. I would also be interested in the connection between a country's
suicide rate and the alcohol consumption level within its population.