4/8/2019

gexijin · Apr 8, 2019 · bb3baff · bb3baff
1 parent b8de711
commit bb3baff
Show file tree

Hide file tree

Showing 11 changed files with 112 additions and 52 deletions.
diff --git a/08-Exploring-state-data-set.Rmd b/08-Exploring-state-data-set.Rmd
@@ -41,11 +41,14 @@ str(sta)
 summary(sta)
 ```
 
-## Exploring state data
+## Visualizing data
+
+Let's start from visualizing the distributions of numeric variables. In many cases, we want to know if our data follows a normal distribution or not. Here are some ways that we can check the normality of a variable.
+
+- Histogram: Does the histogram approaches to a normal density curve? If yes, then the variable more likely follows a normal distribution.
 
-Firstly, let's see whether the numeric variables are normally distributed or not.There are multiple ways to check the normality of numeric data: 
-- histogram: Does the histogram approaches to a normal density curve? If yes, then the variable more likely follows a normal distribution.
 - Q\-Q plot: Does the sample quantiles almost fall into a straight line ? If yes, then the variable more likely followa a normal distribution.
+
 - Shapiro-Wilk test: This is a widely used normality test. The null hypothesis is that a variable follows a normal distribution. Small p-value indicates a non-normality of the variable.
 
 ```{r message=FALSE, fig.width=8, fig.height=8}
@@ -61,7 +64,7 @@ for (i in 1:length(a)){ # Use a for loop to test normality of all variables list
                       # frame as a numeric vector. Functions can be applied to a vector directly.  
   hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
   qqnorm(sub, main = paste("Q-Q Plot of", a[i], sep = " ")) #Q-Q plot
-  qqline(sub)           # Add a qq plot line. 
+  qqline(sub)           # Add a QQ plot line. 
   if (i == 1) {s.t <- shapiro.test(sub) # Normality test for population
   } else {s.t <- rbind(s.t, shapiro.test(sub)) # Bind a new test result to previous results in row. 
  }
@@ -73,43 +76,44 @@ s.t
 
 From the histograms and QQplots we can see that the distribution of Population, Illiteracy and Area skewed to the left. Income and Life.Exp distributed close to normal. The shapiro tests show that Income, Life.Exp and Frost are normally distributed with p value greater than 0.05, while Murder and HS.Grad are almost normally distributed with p value really close to 0.05. There is no evidence that Population, Illiteracy and Area have normal distribution.
 
-As for the categorical variable region, here is the region information including the count and percentage of states.
+In the state data, there are a categorical variable *region* which contains 4 observations. What is the distribution of the categorical variable? Let's take a look at the number of observations(states) in each region and the corresponding percentage.
 
 (ref:state-region) State count in each region
 
 ```{r state-region, fig.cap='(ref:state-region)', fig.align='center'}
-counts <- sort(table(sta$Region), decreasing = TRUE)
-percentages <- 100 * counts / length(sta$Region)
-barplot(percentages, ylab = "Percentage", col = "lightblue")
-text(x=seq(0.7, 5, 1.2), 2, paste("n=", counts))
+counts <- sort(table(sta$Region), decreasing = TRUE)  #Count the number of states in each region
+percentages <- 100 * counts / length(sta$Region)      #Calculate the percentage of states in each region
+barplot(percentages, ylab = "Percentage", col = "lightblue") #Obtain bar chart based on percentages
+text(x=seq(0.7, 5, 1.2), 2, paste("n=", counts))      #Add count to each bar
 ```
 
 Bar plot tells us that we have relatively more states in South(16) and less states in Northeast(9). North Central and West have similar number of states(12 and 13).
 
-If we want to know whether the populations in California and New York are more than the other states like what we have in now days, or the population of South Dakota comparing with other states, we use Lollipop plot to show the population of all states.
+If we want to know whether the populations in California and New York are more than the other states like what we have in now days, or the population of South Dakota comparing with other states, we use Lollipop plot to show the population of all states. A lollipop plot is a hybrid of a scatter plot and a barplot. It shows the relationship between two variables. One must be numerical variable, and the other one can be numerical or categorical variable.A lollipop chart is constituted by a point (made with geom_point) and a segment (made by geom_segment). Therefore you can modify this two components using the usual arguments: ‘size’, ‘color’, ‘linetype’, ‘alpha’, ‘shape’, etc. Here is the lollipop chart shows the relationship between state and population.
 
 (ref:state-pop) Loppipop plot of population in each state
 
 ```{r state-pop, fig.cap='(ref:state-pop)', fig.align='center'}
 library(ggplot2)
 ggplot(sta, aes(x = State, y = Population)) +
-  geom_point(size = 3) +
+  geom_point(size=3, color="red", fill=alpha("orange", 0.3), alpha=0.5, shape=20, stroke=2) +
   geom_segment(aes(x = State, xend = State, y = 0, yend = Population)) +
   labs(title = "Lollipop Chart for Population") +
-  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 65, vjust = 0.6))
+  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, vjust = 0.5))
 ```
 
 From the plot we can see even in early days, California and New York are the top two states in population. South Dakota have little population even in 1970s.
 
-Other questions we may ask are: how about the murder rate distribution in early days? Is it the same for different states and different regions? What are the main effect factors to murder rate? Can we use model of other factors to explain their contribution to murder rate?
+Other questions we may ask are: how about the murder rate distribution in early days? Is it the same for different states and different regions? What are the main effect factors to murder rate? Can we use model of other factors to explain their contribution to murder rate? A choropleth map may give us an overall view.
 
 (ref:state-map) Map of murder rate distribution
 
 ```{r state-map, fig.cap='(ref:state-map)', fig.align='center'}
 library(maps)
-sta$region <- tolower(state.name)  # create new character vector with lowercase states names
-states <- map_data("state")  # extract state data
-map <- merge(states, sta, by = "region", all.x = T)  # merge states and state.x77 data
+library(ggplot2)
+sta$region <- tolower(state.name)  # Create new character vector with lowercase states names
+states <- map_data("state")        # Extract state data
+map <- merge(states, sta, by = "region", all.x = T)  # Merge states and state.x77 data
 map <- map[order(map$order), ]
 ggplot(map, aes(x = long, y = lat, group = group)) +  
   geom_polygon(aes(fill = Murder)) +   
@@ -137,27 +141,39 @@ ggplot(sta, aes(x = Murder, y = Region, fill = Region)) +
 The ridgeline plot tells us that murder rate skewed to the left for region west, northeast and north central, but skewed to the right for region south, which confirm with map above that south has big murder rate than other regions. 
 
 ```{exercise}
-Use lollipop plots to explore the distribution of Illiteracy in state.x77 data set and give brief interpretation. Hint: You can combine state.abb to state.x77 or use the row names of state.x77 data set directly. 
+Similar to Figure \@ref(fig:state-map), use lollipop plots to obtain the Illiteracy map in state.x77 data set and give a brief interpretation. Hint: You can combine state.abb to state.x77 or use the row names of state.x77 data set directly. You can start from importing the data:
+```
+
+```{r eval=F}
+tem <- data.frame(state.x77) 
+sta <- cbind(state.abb, tem, state.region)
+colnames(sta)[10] <- "Region" 
+......
 ```
 
+---
+
 ```{exercise}
-Use ridgeline plot to explore the regional distribution of Illiteracy for state.x77 and state.region data sets and interpret your figure.
+Similar to Figure \@ref(fig:state-murder), use ridgeline plot to explore the regional distribution of Illiteracy for state.x77 and state.region data sets and interpret your figure.
 ```
 
+---
+
 ## Analyzing the relationship among variables
 
+To visulize the linear relationship among variables in a plot, a scatter matrix is the best choice. A scatter matrix is a pair-wise scatter plot of multiple variables presented in a matrix format. It measures the linear relationship among variables. The range of correlation coefficient is [-1, 1]. The coefficient -1 implies two variables are strictly negative related, such as $y=-x$. And coefficient 1 implies positive related, such as $y=2x+1$. Here is an example of how to make a scatter matrix.
+
 (ref:state-corrplot) Corrplot for numeric variables
 
 ```{r state-corrplot, message=FALSE, fig.width=6, fig.height=6, fig.cap='(ref:state-corrplot)', fig.align='center'}
 st <- sta[, 2:9] #take numeric variables as goal matrix
 library(ellipse) 
 library(corrplot)
-corMatrix <- cor(as.matrix(st)) # correlation matrix
-col <- colorRampPalette(c("#7F0000", "red", "#FF7F00", "yellow", "#7FFF7F",
-                           "cyan", "#007FFF", "blue", "#00007F"))
+corMatrix <- cor(as.matrix(st)) # Calculate correlation matrix
+col <- colorRampPalette(c("red", "yellow", "blue"))   #Color values. Red, yellow and blue represent that the coefficients are -1, 0 and 1 respectively. You can use more than 3 colors to represent the coefficients ranging from -1 to 1.
 corrplot.mixed(corMatrix, order = "AOE", lower = "number", lower.col = "black", 
                number.cex = .8, upper = "ellipse",  upper.col = col(10), 
-               diag = "u", tl.pos = "lt", tl.col = "black")
+               diag = "u", tl.pos = "lt", tl.col = "black") #Type ?corrplot.mixed in the Console to get help in detail.
 ```
 
 On the top-right of correlation figure we can see the red and narrow shape between Murder and Life.Exp which shows high negative correlation, the blue narrow shape between Murder and Illiteracy which shows high positive correlation, the red-orange narrow shape between Murder and Frost, HS.Grad which show median negative correlation, also the orange shape between Murder and Income which shows small negative correlation and light-blue shape between Murder and both Area and Population which show small positive correlation.
@@ -167,9 +183,11 @@ The pearson and spearman correlation matrix on the bottom-left gives us the r va
 Positive correlation between Murder and Illiteracy with r value of 0.70, which means the lower education level the state have, the higher murder rate chance it will happen in that state; Negative correlations between Murder and Life.Exp, Frost, with r value of -0.78, and -0.54 illustrate that the more occurrence of murder, the shorter life expectation the state will have; And the colder of the weather, the lower chance the murder will occur: too cold to murder?!
 
 ```{exercise}
-According to the corrplot, Figure \@ref(fig:state-corrplot), explain the correlation between Illiteracy and other variables.
+Similar to Figure \@ref(fig:state-corrplot), plot a scatter matrix among 7 variables: *mpg*, *cyl*, *disp*, *hp*, *drat*, *wt* and *qsec* in the data set *mtcars*. Give a brief interpretation of the scatter matrix plot.
 ```
 
+---
+
 Now let's see the cluster situation of these variables.
 
 (ref:state-dendrogram) Cluster dendrogram for state numeric variables
@@ -180,6 +198,12 @@ plot(hclust(as.dist(1 - cor(as.matrix(st)))))  # hierarchical clustering
 
 The cluster Dendrogram tells us that there are two clusters for these variables. Murder is mostly close to Illiteracy, and then to Population and Area. Similar situation, HS.Grad is mostly close to Income, and then to Life.Exp and Frost. Though illiteracy and HS.Grad are in different cluster, we know for the same state, illiteracy is highly correlated with high school graduation rate , the lower the illiteracy, the higher the high school graduation rate. r value of -0.66 between Illiteracy and HS.Grad in the corrplot tells the same story.
 
+```{exercise}
+Similar to Figure \@ref(fig:state-dendrogram), plot a cluster dendrogram of the 7 variables: *mpg*, *cyl*, *disp*, *hp*, *drat*, *wt* and *qsec* in the data set *mtcars*. Give a brief interpretation of your output.
+```
+
+---
+
 we can use density plot to see the distribution of Illiteracy by region.
 
 (ref:state-illiteracy) Illiteracy distribution by region
@@ -190,6 +214,12 @@ ggplot(sta, aes(x = Illiteracy, fill = Region)) + geom_density(alpha = 0.3)
 
 We can see that north central region has narrow density distribution with most Illiteracy less than 1 percent of population. While south region has an open distribution with illiteracy covered from 0.5 to 3, and most south states have illiteracy between 1.5 and 2.2. Though region west has a spread out distribution too, but it's left skewed, which means there are still lots of west states with illiteracy less than 1% of population. Most northeast region states have illiteracy less then 1.5% of population. 
 
+```{exercise}
+Similar to Figure \@ref(fig:state-illiteracy), use density plot to see the distribution of *mpg* by *cyl* in the data set *mtcars*. 
+```
+
+---
+
 Because of the relationship of Murder with both Population and Area, We add one more column of Pop.Density for the population per square miles of area to see the correlation between Murder and this density.
 
 (ref:state-population) Box plot of population by region
@@ -257,6 +287,8 @@ Most southern states has lower HS.Grad high, low Life.Exp but higher murder freq
 Use scatter plot to analyze the correlation between Illiteracy and those variables in the other cluster shown in Figure \@ref(fig:state-dendrogram). Interpret your plot.
 ```
 
+---
+
 ## Peeking the whole picture of the data set 
 
 (ref:state-heatmap) Heat map for whole state data set
@@ -333,6 +365,7 @@ $Murder = 105.9 - 1.445 * Life.Exp + 0.000259 * Population + 1.861 * Illiteracy
 Do linear model analysis for Illiteracy and interpret your result. Hint: Check the corrplot figure \@ref(fig:state-corrplot) and pay attention to the high correlation between murder rate and life expectancy.
 ```
 
+---
 
 ## Conclusion
 

diff --git a/_bookdown_files/book_files/figure-html/state-corrplot-1.png b/_bookdown_files/book_files/figure-html/state-corrplot-1.png
diff --git a/_bookdown_files/book_files/figure-html/state-pop-1.png b/_bookdown_files/book_files/figure-html/state-pop-1.png
diff --git a/_bookdown_files/book_files/figure-html/unnamed-chunk-10-1.png b/_bookdown_files/book_files/figure-html/unnamed-chunk-10-1.png
diff --git a/_bookdown_files/book_files/figure-html/unnamed-chunk-8-1.png b/_bookdown_files/book_files/figure-html/unnamed-chunk-8-1.png
diff --git a/docs/book_files/figure-html/state-corrplot-1.png b/docs/book_files/figure-html/state-corrplot-1.png
diff --git a/docs/book_files/figure-html/state-pop-1.png b/docs/book_files/figure-html/state-pop-1.png
diff --git a/docs/book_files/figure-html/unnamed-chunk-10-1.png b/docs/book_files/figure-html/unnamed-chunk-10-1.png
diff --git a/docs/book_files/figure-html/unnamed-chunk-8-1.png b/docs/book_files/figure-html/unnamed-chunk-8-1.png