Skip to content

Commit

Permalink
4/8/2019
Browse files Browse the repository at this point in the history
  • Loading branch information
ffelite committed Apr 8, 2019
1 parent b8de711 commit bb3baff
Show file tree
Hide file tree
Showing 11 changed files with 112 additions and 52 deletions.
79 changes: 56 additions & 23 deletions 08-Exploring-state-data-set.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,14 @@ str(sta)
summary(sta)
```

## Exploring state data
## Visualizing data

Let's start from visualizing the distributions of numeric variables. In many cases, we want to know if our data follows a normal distribution or not. Here are some ways that we can check the normality of a variable.

- Histogram: Does the histogram approaches to a normal density curve? If yes, then the variable more likely follows a normal distribution.

Firstly, let's see whether the numeric variables are normally distributed or not.There are multiple ways to check the normality of numeric data:
- histogram: Does the histogram approaches to a normal density curve? If yes, then the variable more likely follows a normal distribution.
- Q\-Q plot: Does the sample quantiles almost fall into a straight line ? If yes, then the variable more likely followa a normal distribution.

- Shapiro-Wilk test: This is a widely used normality test. The null hypothesis is that a variable follows a normal distribution. Small p-value indicates a non-normality of the variable.

```{r message=FALSE, fig.width=8, fig.height=8}
Expand All @@ -61,7 +64,7 @@ for (i in 1:length(a)){ # Use a for loop to test normality of all variables list
# frame as a numeric vector. Functions can be applied to a vector directly.
hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
qqnorm(sub, main = paste("Q-Q Plot of", a[i], sep = " ")) #Q-Q plot
qqline(sub) # Add a qq plot line.
qqline(sub) # Add a QQ plot line.
if (i == 1) {s.t <- shapiro.test(sub) # Normality test for population
} else {s.t <- rbind(s.t, shapiro.test(sub)) # Bind a new test result to previous results in row.
}
Expand All @@ -73,43 +76,44 @@ s.t

From the histograms and QQplots we can see that the distribution of Population, Illiteracy and Area skewed to the left. Income and Life.Exp distributed close to normal. The shapiro tests show that Income, Life.Exp and Frost are normally distributed with p value greater than 0.05, while Murder and HS.Grad are almost normally distributed with p value really close to 0.05. There is no evidence that Population, Illiteracy and Area have normal distribution.

As for the categorical variable region, here is the region information including the count and percentage of states.
In the state data, there are a categorical variable *region* which contains 4 observations. What is the distribution of the categorical variable? Let's take a look at the number of observations(states) in each region and the corresponding percentage.

(ref:state-region) State count in each region

```{r state-region, fig.cap='(ref:state-region)', fig.align='center'}
counts <- sort(table(sta$Region), decreasing = TRUE)
percentages <- 100 * counts / length(sta$Region)
barplot(percentages, ylab = "Percentage", col = "lightblue")
text(x=seq(0.7, 5, 1.2), 2, paste("n=", counts))
counts <- sort(table(sta$Region), decreasing = TRUE) #Count the number of states in each region
percentages <- 100 * counts / length(sta$Region) #Calculate the percentage of states in each region
barplot(percentages, ylab = "Percentage", col = "lightblue") #Obtain bar chart based on percentages
text(x=seq(0.7, 5, 1.2), 2, paste("n=", counts)) #Add count to each bar
```

Bar plot tells us that we have relatively more states in South(16) and less states in Northeast(9). North Central and West have similar number of states(12 and 13).

If we want to know whether the populations in California and New York are more than the other states like what we have in now days, or the population of South Dakota comparing with other states, we use Lollipop plot to show the population of all states.
If we want to know whether the populations in California and New York are more than the other states like what we have in now days, or the population of South Dakota comparing with other states, we use Lollipop plot to show the population of all states. A lollipop plot is a hybrid of a scatter plot and a barplot. It shows the relationship between two variables. One must be numerical variable, and the other one can be numerical or categorical variable.A lollipop chart is constituted by a point (made with geom_point) and a segment (made by geom_segment). Therefore you can modify this two components using the usual arguments: ‘size’, ‘color’, ‘linetype’, ‘alpha’, ‘shape’, etc. Here is the lollipop chart shows the relationship between state and population.

(ref:state-pop) Loppipop plot of population in each state

```{r state-pop, fig.cap='(ref:state-pop)', fig.align='center'}
library(ggplot2)
ggplot(sta, aes(x = State, y = Population)) +
geom_point(size = 3) +
geom_point(size=3, color="red", fill=alpha("orange", 0.3), alpha=0.5, shape=20, stroke=2) +
geom_segment(aes(x = State, xend = State, y = 0, yend = Population)) +
labs(title = "Lollipop Chart for Population") +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 65, vjust = 0.6))
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, vjust = 0.5))
```

From the plot we can see even in early days, California and New York are the top two states in population. South Dakota have little population even in 1970s.

Other questions we may ask are: how about the murder rate distribution in early days? Is it the same for different states and different regions? What are the main effect factors to murder rate? Can we use model of other factors to explain their contribution to murder rate?
Other questions we may ask are: how about the murder rate distribution in early days? Is it the same for different states and different regions? What are the main effect factors to murder rate? Can we use model of other factors to explain their contribution to murder rate? A choropleth map may give us an overall view.

(ref:state-map) Map of murder rate distribution

```{r state-map, fig.cap='(ref:state-map)', fig.align='center'}
library(maps)
sta$region <- tolower(state.name) # create new character vector with lowercase states names
states <- map_data("state") # extract state data
map <- merge(states, sta, by = "region", all.x = T) # merge states and state.x77 data
library(ggplot2)
sta$region <- tolower(state.name) # Create new character vector with lowercase states names
states <- map_data("state") # Extract state data
map <- merge(states, sta, by = "region", all.x = T) # Merge states and state.x77 data
map <- map[order(map$order), ]
ggplot(map, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = Murder)) +
Expand Down Expand Up @@ -137,27 +141,39 @@ ggplot(sta, aes(x = Murder, y = Region, fill = Region)) +
The ridgeline plot tells us that murder rate skewed to the left for region west, northeast and north central, but skewed to the right for region south, which confirm with map above that south has big murder rate than other regions.

```{exercise}
Use lollipop plots to explore the distribution of Illiteracy in state.x77 data set and give brief interpretation. Hint: You can combine state.abb to state.x77 or use the row names of state.x77 data set directly.
Similar to Figure \@ref(fig:state-map), use lollipop plots to obtain the Illiteracy map in state.x77 data set and give a brief interpretation. Hint: You can combine state.abb to state.x77 or use the row names of state.x77 data set directly. You can start from importing the data:
```

```{r eval=F}
tem <- data.frame(state.x77)
sta <- cbind(state.abb, tem, state.region)
colnames(sta)[10] <- "Region"
......
```

---

```{exercise}
Use ridgeline plot to explore the regional distribution of Illiteracy for state.x77 and state.region data sets and interpret your figure.
Similar to Figure \@ref(fig:state-murder), use ridgeline plot to explore the regional distribution of Illiteracy for state.x77 and state.region data sets and interpret your figure.
```

---

## Analyzing the relationship among variables

To visulize the linear relationship among variables in a plot, a scatter matrix is the best choice. A scatter matrix is a pair-wise scatter plot of multiple variables presented in a matrix format. It measures the linear relationship among variables. The range of correlation coefficient is [-1, 1]. The coefficient -1 implies two variables are strictly negative related, such as $y=-x$. And coefficient 1 implies positive related, such as $y=2x+1$. Here is an example of how to make a scatter matrix.

(ref:state-corrplot) Corrplot for numeric variables

```{r state-corrplot, message=FALSE, fig.width=6, fig.height=6, fig.cap='(ref:state-corrplot)', fig.align='center'}
st <- sta[, 2:9] #take numeric variables as goal matrix
library(ellipse)
library(corrplot)
corMatrix <- cor(as.matrix(st)) # correlation matrix
col <- colorRampPalette(c("#7F0000", "red", "#FF7F00", "yellow", "#7FFF7F",
"cyan", "#007FFF", "blue", "#00007F"))
corMatrix <- cor(as.matrix(st)) # Calculate correlation matrix
col <- colorRampPalette(c("red", "yellow", "blue")) #Color values. Red, yellow and blue represent that the coefficients are -1, 0 and 1 respectively. You can use more than 3 colors to represent the coefficients ranging from -1 to 1.
corrplot.mixed(corMatrix, order = "AOE", lower = "number", lower.col = "black",
number.cex = .8, upper = "ellipse", upper.col = col(10),
diag = "u", tl.pos = "lt", tl.col = "black")
diag = "u", tl.pos = "lt", tl.col = "black") #Type ?corrplot.mixed in the Console to get help in detail.
```

On the top-right of correlation figure we can see the red and narrow shape between Murder and Life.Exp which shows high negative correlation, the blue narrow shape between Murder and Illiteracy which shows high positive correlation, the red-orange narrow shape between Murder and Frost, HS.Grad which show median negative correlation, also the orange shape between Murder and Income which shows small negative correlation and light-blue shape between Murder and both Area and Population which show small positive correlation.
Expand All @@ -167,9 +183,11 @@ The pearson and spearman correlation matrix on the bottom-left gives us the r va
Positive correlation between Murder and Illiteracy with r value of 0.70, which means the lower education level the state have, the higher murder rate chance it will happen in that state; Negative correlations between Murder and Life.Exp, Frost, with r value of -0.78, and -0.54 illustrate that the more occurrence of murder, the shorter life expectation the state will have; And the colder of the weather, the lower chance the murder will occur: too cold to murder?!

```{exercise}
According to the corrplot, Figure \@ref(fig:state-corrplot), explain the correlation between Illiteracy and other variables.
Similar to Figure \@ref(fig:state-corrplot), plot a scatter matrix among 7 variables: *mpg*, *cyl*, *disp*, *hp*, *drat*, *wt* and *qsec* in the data set *mtcars*. Give a brief interpretation of the scatter matrix plot.
```

---

Now let's see the cluster situation of these variables.

(ref:state-dendrogram) Cluster dendrogram for state numeric variables
Expand All @@ -180,6 +198,12 @@ plot(hclust(as.dist(1 - cor(as.matrix(st))))) # hierarchical clustering

The cluster Dendrogram tells us that there are two clusters for these variables. Murder is mostly close to Illiteracy, and then to Population and Area. Similar situation, HS.Grad is mostly close to Income, and then to Life.Exp and Frost. Though illiteracy and HS.Grad are in different cluster, we know for the same state, illiteracy is highly correlated with high school graduation rate , the lower the illiteracy, the higher the high school graduation rate. r value of -0.66 between Illiteracy and HS.Grad in the corrplot tells the same story.

```{exercise}
Similar to Figure \@ref(fig:state-dendrogram), plot a cluster dendrogram of the 7 variables: *mpg*, *cyl*, *disp*, *hp*, *drat*, *wt* and *qsec* in the data set *mtcars*. Give a brief interpretation of your output.
```

---

we can use density plot to see the distribution of Illiteracy by region.

(ref:state-illiteracy) Illiteracy distribution by region
Expand All @@ -190,6 +214,12 @@ ggplot(sta, aes(x = Illiteracy, fill = Region)) + geom_density(alpha = 0.3)

We can see that north central region has narrow density distribution with most Illiteracy less than 1 percent of population. While south region has an open distribution with illiteracy covered from 0.5 to 3, and most south states have illiteracy between 1.5 and 2.2. Though region west has a spread out distribution too, but it's left skewed, which means there are still lots of west states with illiteracy less than 1% of population. Most northeast region states have illiteracy less then 1.5% of population.

```{exercise}
Similar to Figure \@ref(fig:state-illiteracy), use density plot to see the distribution of *mpg* by *cyl* in the data set *mtcars*.
```

---

Because of the relationship of Murder with both Population and Area, We add one more column of Pop.Density for the population per square miles of area to see the correlation between Murder and this density.

(ref:state-population) Box plot of population by region
Expand Down Expand Up @@ -257,6 +287,8 @@ Most southern states has lower HS.Grad high, low Life.Exp but higher murder freq
Use scatter plot to analyze the correlation between Illiteracy and those variables in the other cluster shown in Figure \@ref(fig:state-dendrogram). Interpret your plot.
```

---

## Peeking the whole picture of the data set

(ref:state-heatmap) Heat map for whole state data set
Expand Down Expand Up @@ -333,6 +365,7 @@ $Murder = 105.9 - 1.445 * Life.Exp + 0.000259 * Population + 1.861 * Illiteracy
Do linear model analysis for Illiteracy and interpret your result. Hint: Check the corrplot figure \@ref(fig:state-corrplot) and pay attention to the high correlation between murder rate and life expectancy.
```

---

## Conclusion

Expand Down
Binary file modified _bookdown_files/book_files/figure-html/state-corrplot-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified _bookdown_files/book_files/figure-html/state-pop-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified _bookdown_files/book_files/figure-html/unnamed-chunk-10-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified _bookdown_files/book_files/figure-html/unnamed-chunk-8-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/book_files/figure-html/state-corrplot-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/book_files/figure-html/state-pop-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/book_files/figure-html/unnamed-chunk-10-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/book_files/figure-html/unnamed-chunk-8-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit bb3baff

Please sign in to comment.