small updates fix typos

anasrana · Oct 6, 2019 · 19487ca · 19487ca
1 parent a6c77a5
commit 19487ca
Show file tree

Hide file tree

Showing 3 changed files with 49 additions and 52 deletions.
diff --git a/01-linear-regression.Rmd b/01-linear-regression.Rmd
@@ -1,8 +1,8 @@
-# Practical: Linear Regression 
+# Practical: Linear Regression
 
 In this practical you will go through some of the basics of linear modelling in `R` as well as simulating data. The practical contains the following elements:
 
-- simulate linear regression model 
+- simulate linear regression model
 - investigate parameters
 - characterise prediction accurary
 - correlation of real world data
@@ -52,7 +52,7 @@ Next we simulate the error term:
 
 ```{r }
 # simulate the noise terms, rnorm requires the standard deviation
-e <- rnorm(N, mean = 0, sd = sqrt(sigma2)) 
+e <- rnorm(N, mean = 0, sd = sqrt(sigma2))
 ```
 
 Finally we have all the parameters and variables to simulate the response variable $y$:
@@ -69,12 +69,12 @@ We will plot our data using `ggplot2` so the data need to be in a `data.frame` o
 sim_data <- data.frame(x = x, y = y)
 
 # create a new scatter plot using ggplot2
-ggplot(sim_data, aes(x = x, y = y)) + 
+ggplot(sim_data, aes(x = x, y = y)) +
   geom_point()
 
 ```
 
-We define the true data `y_true` to be the true linear relationship between the covariate and the response without the noise. 
+We define the true data `y_true` to be the true linear relationship between the covariate and the response without the noise.
 
 ```{r }
 # Compute true y values
@@ -97,7 +97,7 @@ print(lr_plot)
 
 ## Fitting simple linear regression model
 
-## Least squared estimation
+### Least squared estimation
 
 Now that you have simulated data you can use it to regress $y$ on $x$, since this is simulated data we know the parameters and can make a comparison. In `R` we can use the function `lm()` for this, by default it implements a least squares estimate:
 
@@ -125,7 +125,7 @@ b1_hat <- ls_coef[2] # alternative ls_fit$coefficients[2]
 y_hat <- b0_hat + b1_hat * x
 sim_data$y_hat <- y_hat # add to the existing data frame
 
-# Create scatter plot and lines for the original and fitted 
+# Create scatter plot and lines for the original and fitted
 lr_plot <- ggplot(sim_data, aes(x = x, y = y)) +
   geom_point() +
   geom_line(aes(x = x, y = y_true), colour = "red", size = 1.3) +
@@ -136,7 +136,7 @@ lr_plot <- ggplot(sim_data, aes(x = x, y = y)) +
   print(lr_plot)
 ```
 
-The estimated parameters and the plot shows a good correspondence between fitted regression parameters and the true relationship between $y$ and $x$. We can check this by plotting the residuals, this data is stored as the `residuals` parameter in the `ls_fit` object. 
+The estimated parameters and the plot shows a good correspondence between fitted regression parameters and the true relationship between $y$ and $x$. We can check this by plotting the residuals, this data is stored as the `residuals` parameter in the `ls_fit` object.
 
 ```{r resid-scatter}
 # Residuals
@@ -154,7 +154,7 @@ hist(ls_residual)
 
 ```
 
-We expect the mean and variance of the residuals to be close to the level used to generate the data. 
+We expect the mean and variance of the residuals to be close to the level used to generate the data.
 
 ```{r }
 print(mean(ls_residual))
@@ -164,7 +164,7 @@ print(var(ls_residual))
 
 This is as expected since subtracting a good fit from the data leaves $\epsilon$ which has $0$ mean and $0.5$ variance.
 
-# Maximum likelihood estimation
+### Maximum likelihood estimation
 
 Next you will look at maximum likelihood estimation based on the same data you simulated earlier. This is a bit more involved as it requires you to explicitly write the function you wish to minimise. The function we use is part of the `bbmle` package.
 
@@ -196,17 +196,17 @@ mle_fit <- mle2(mle_ll, start = list(beta0 = -1, beta1 = 20, sigma = 10))
 summary(mle_fit)
 ```
 
-The estimated parameters using the maximum likelihood are also a very good estimate of the true values. 
+The estimated parameters using the maximum likelihood are also a very good estimate of the true values.
 
-# Effect of variance
+## Effect of variance
 
 Now investigate the quality of the predictions further by simulating more data sets and seeing how the variance affects the quality of the fit as indicated by the mean-squared error (mse).
 
 To start you will define some parameter for the simulations, the number of simulations to run for each variance, and the variance values to try.
 
 ```{r }
 # number of simulations for each noise level
-n_simulations <- 100 
+n_simulations <- 100
 
 # A vector of noise levels to try
 sigma_v <- c(0.1, 0.4, 1.0, 2.0, 4.0, 6.0, 8.0)
@@ -216,33 +216,33 @@ n_sigma <- length(sigma_v)
 mse_matrix <- matrix(0, nrow = n_simulations, ncol = n_sigma)
 
 # name row and column
-rownames(mse_matrix) <- c(1:n_simulations) 
-colnames(mse_matrix) <- sigma_v 
+rownames(mse_matrix) <- c(1:n_simulations)
+colnames(mse_matrix) <- sigma_v
 ```
 
-Next we will write a nested `for` loop. The first loop will be over the variances and a second loop over the number of repeats. We will simulate the data, perform a fit with `lm()`. We can use the `fitted()` function on the resulting object to extract the fitted values $\hat{y}$ and use this to compute the mean-squared error from the true value $y$. 
+Next we will write a nested `for` loop. The first loop will be over the variances and a second loop over the number of repeats. We will simulate the data, perform a fit with `lm()`. We can use the `fitted()` function on the resulting object to extract the fitted values $\hat{y}$ and use this to compute the mean-squared error from the true value $y$.
 
 ```{r }
 # loop over variance
 for (i in 1:n_sigma) {
   sigma2 <- sigma_v[i]
-  
+
   # for each simulation
   for (it in 1:n_simulations) {
-    
+
     # Simulate the data
-    x <- rnorm(N, mean = 0, sd = 1) 
-    e <- rnorm(N, mean = 0, sd = sqrt(sigma2)) 
-    y <- b0 + b1 * x + e 
-    
+    x <- rnorm(N, mean = 0, sd = 1)
+    e <- rnorm(N, mean = 0, sd = sqrt(sigma2))
+    y <- b0 + b1 * x + e
+
     # set up a data frame and run lm()
     sim_data <- data.frame(x = x, y = y)
     lm_fit <- lm(y ~ x, data = sim_data)
-    
+
     # compute the mean squared error between the fit and the actual y's
     y_hat <- fitted(lm_fit)
     mse_matrix[it, i] <- mean((y_hat - y)^2)
-    
+
   }
 }
 ```
@@ -256,29 +256,29 @@ library(reshape2)
 # convert the matrix into a data frame for ggplot2
 mse_df <- melt(mse_matrix)
 # rename the columns
-names(mse_df) <- c("Simulation", "variance", "MSE") 
+names(mse_df) <- c("Simulation", "variance", "MSE")
 
-# now use a boxplot to look at the relationship between 
+# now use a boxplot to look at the relationship between
 # mean-squared prediction error and variance
 mse_plt <- ggplot(mse_df, aes(x = variance, y = MSE)) +
   geom_boxplot(aes(group = variance))
 
 print(mse_plt)
 ```
 
-You can see that the variances of the mse and the value of the mse go up with increasing variance in the simulation. 
+You can see that the variances of the mse and the value of the mse go up with increasing variance in the simulation.
 
 What changes do you need to make to the above function to plot the accuracy of the estimated regression coefficients as a function of variance?
 
-# Exercise
+## Exercise
 
-## Part I
+### Part I
 
-Read in the data in `storks.txt`, compute the correlation and comment on it. 
+Read in the data in `storks.txt`, compute the correlation and comment on it.
 
 The data represents `no of storks` (column 1) in Oldenburg Germany from $1930 - 1939$ and the number of people (column 2).
 
-## Part II
+### Part II
 
 Fit a simple linear model to the two data sets supplied (`lr_data1.Rdata` and `lr_data2.Rdata`). In both files the $(x,y)$ data is saved in two vectors, $x$ and $y$.
 
@@ -296,7 +296,7 @@ plot(x, y)
 Fit the linear model and comment on the differences between the data.
 
 
-## Part III
+### Part III
 
 Investigate how the sample size will affect the quality of the fit using mse, use the code for investigating the affect of variance as inspiration.
 

diff --git a/README.md b/README.md
@@ -2,4 +2,4 @@
 
 This repo contains practicals for Module 1 of the [MSc Bioinformatics 2018/19](https://www.birmingham.ac.uk/postgraduate/courses/taught/med/bioinformatics.aspx) at the University of Birmingham. 
 
-To access it you can go to the link [anasrana.github.io/module1-practical_Bham/](https://anasrana.github.io/module1-practical_Bham/). The data for the module is kept in the data folder ([here](https://github.com/anasrana/module1-practical_Bham/tree/master/data)). 
+To access it you can go to the link [anasrana.github.io/module1-practical_Bham/](https://anasrana.github.io/module1-practical_Bham). The data for the module is kept in the data folder ([here](https://github.com/anasrana/module1-practical_Bham/tree/master/data)). 
diff --git a/index.Rmd b/index.Rmd
@@ -1,38 +1,35 @@
---- 
+---
 title: "Essentials of Mathematics and Statistics"
 author: "Anas A Rana"
 date: "`r Sys.Date()`"
 bibliography:
 - book.bib
 - packages.bib
-description: This contains practicals for the second week of module 1 (_Essentials
-  of Mathematics and Statistics_).
+description: This contains practicals for the second week of module 1 (_Essentials of Mathematics and Statistics_).
 documentclass: book
 link-citations: yes
 site: bookdown::gitbook
 subtitle: 'Practical: Module 1'
 biblio-style: apalike
+url: 'https\://anasrana.github.io/module1-practical_Bham/'
+github-repo: "anasrana/module1-practical_Bham"
 ---
 
-# Prerequisites
+# Introduction
+
+This is part of the practical for [Module 1 - Essentials of Mathematics and Statistics](https://canvas.bham.ac.uk/courses/39247/pages/essentials-of-biology-mathematics-and-statistics) part of the [MSc Bioinformatics 2018/19](https://www.birmingham.ac.uk/postgraduate/courses/taught/med/bioinformatics.aspx) at the University of Birmingham.
 
-This is a _sample_ book written in **Markdown**. You can use anything that Pandoc's Markdown supports, e.g., a math equation $a^2 + b^2 = c^2$.
+This website hosts the practicals for week two of **Module 1**, which covers:
 
-The **bookdown** package can be installed from CRAN or Github:
+- Linear regression
+- Principal Component Analysis (PCA)
+- Multivariate Regression
+- Generalised Linear Models
 
-```{r eval=FALSE}
-install.packages("bookdown")
-# or the development version
-# devtools::install_github("rstudio/bookdown")
-```
+## Prerequisites
 
-Remember each Rmd file contains one and only one chapter, and a chapter is defined by the first-level heading `#`.
+Ensure you have attended Module 1 lectures and have the installed `R` and/or `Rstudio`. `Rstudio` is recommended, any packages required for each practical are mentioned at the beginning of each practical.
 
-To compile this example to PDF, you need XeLaTeX. You are recommended to install TinyTeX (which includes XeLaTeX): <https://yihui.name/tinytex/>.
+## Data sets
 
-```{r include=FALSE}
-# automatically create a bib database for R packages
-knitr::write_bib(c(
-  .packages(), 'bookdown', 'knitr', 'rmarkdown'
-), 'packages.bib')
-```
+For each practical there are some datasets required, you will find links to those in data folder [here](https://github.com/anasrana/module1-practical_Bham/tree/master/data).
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,4 +2,4 @@

		This repo contains practicals for Module 1 of the [MSc Bioinformatics 2018/19](https://www.birmingham.ac.uk/postgraduate/courses/taught/med/bioinformatics.aspx) at the University of Birmingham.

		To access it you can go to the link [anasrana.github.io/module1-practical_Bham/](https://anasrana.github.io/module1-practical_Bham/). The data for the module is kept in the data folder ([here](https://github.com/anasrana/module1-practical_Bham/tree/master/data)).
		To access it you can go to the link [anasrana.github.io/module1-practical_Bham/](https://anasrana.github.io/module1-practical_Bham). The data for the module is kept in the data folder ([here](https://github.com/anasrana/module1-practical_Bham/tree/master/data)).