linear_regression/linear_regession.Rmd

---
title: 'Gradient Descent with Linear Regression'
author: "Ethen Liu"
date: "October 30, 2015"
output:
  rmdformats::readthedown:
    highlight: pygments
---

<style type="text/css">
p { /* Normal  */
   font-size: 18px;
}
body { /* Normal  */
   font-size: 18px;
}
td {  /* Table  */
   font-size: 14px;
}
h1 { /* Header 1 */
 font-size: 32px;
}
h2 { /* Header 2 */
 font-size: 26px;
}
h3 { /* Header 3 */
 font-size: 22px;
}
code.r{ /* Code block */
  font-size: 14px;
}
pre { /* Code block */
  font-size: 14px
}
</style>

> All the code (in the "linear_regession_code" folder) and the data (housing.txt) used in this documentation can be found [here](https://github.com/ethen8181/machine-learning/blob/master/linear_regression).


For supervised learning problems like linear regression, the way it works is when given some set of numbers input variable, we wish to predict another set of numbers. For instance given the number of bedrooms and the size of the house, we wish to predict the price in which the house will be sold. So what we want to know, is how much do "variables" such as the number of bedrooms or the size of the house affects the house's price. One easy approach to calculate these "variables" is by gradient descent.


# Getting Started With Gradient Descent 

Let's start from a basic formula $1.2\times(x-2)^2 + 3.2$. If you still remember basic calculus, the first derivative of a function gives you the optimal solution to it (whether it be for max or min problems). In this case, it's by solving $2\times1.2\times(x-2)=0$.

```{r, message=FALSE, warning=FALSE}

library(grid)
library(dplyr)
library(scales)
library(ggplot2)
setwd("/Users/ethen/machine-learning/linear_regression")


# original formula 
Formula <- function(x) 1.2 * (x-2)^2 + 3.2

# visualize the function, and the optimal solution
ggplot( data.frame( x = c(0, 4) ), aes(x) ) + 
stat_function(fun = Formula) + 
geom_point( data = data.frame( x = 2, y = Formula(2) ), aes(x, y), 
	        color = "blue", size = 3 ) + 
ggtitle( expression( 1.2 * (x-2)^2 + 3.2 ) )

```

By solving the first derivative of the function or simply eyeballing the graph, we can easily tell that the minimum value to the formula is when x equals 2. Plugging the value back into the formula gives us the solution `r Formula(2)`. 

That was easy, however if the function starts to get really complicated then solving this won't be this simple. This is where "gradient descent" comes in, and the formula to this buzzword is listed below.

$$\text{Repeat until converge} \{ x:=x-\alpha\triangledown F(x) \}$$

- The notation := stands for overwriting the value on the left of := with values on the right of := .
- $\triangledown$ stands for the gradient of a function, which is the collection of all its partial derivatives into a vector ( taking the first derivative of the function with respect to all possible x ).
- $\alpha$ stands for the learning rate which is set manually.

Let's break that down piece by piece. Putting the formula in plain English: Imagine gradient descent as when you're at the top of a mountain and you want to get down to the very bottom, you have to choose two things. First the direction you wish to descend and second the size of the steps you wish to take. After choosing both of these things, you will keep on taking that step size and that direction until you reach the bottom. Now back to the formula. $\alpha$ corresponds to the size of the steps you wish to take and $\triangledown F(x)$ gives you the direction that you should take for your given formula. The following code, is a small example starting with one iteration of the process. Note that in order for the formula to start calculating you will have to assign an initial value for x.

```{r}

# first derivative of the formula above
Derivative <- function(x) 2 * 1.2 * (x-2) 

# define the alpha value (learning rate)
learning_rate <- .6

# define the initial value 
x <- .1

( iteration <- data.frame( x = x, value = Formula(x) ) )

```

Here, we defined the learning rate to be `r learning_rate` for our algorithm and the initial value of x equals `r x` leads to the value `r Formula(x)`, which is still far off from the optimal solution `r Formula(2)`. Let's use these user-defined initial value and learning rate on our gradient descent algorithm.

```{r}
#### One iteration :
# apply the formula of gradient descent
x <- x- learning_rate * Derivative(x)

# output
rbind( iteration, c( x, Formula(x) ) )

```

Row one of the output denotes the intial x and the formula's value when plugging in x, and the second row denotes the value after the first iteraion. We can see that the gradient descent algorithm starts tuning x with the goal of finding smaller value for formula. 

Hopefully, after that one iteration example, everything is a now bit more clear. Before we apply the whole algorithm, I still owe you the definition of "repeat until converge" from the algorithm. 

There usually two ways of applying the notion "repeat until converge" into code. One : Keep updating the x value until the difference between this iteration and the last one, is smaller than epsilon (we use epsilon to denote a user-defined small value). Two : The process of updating the x value surpass a user-define iteration. Often times, you can use both in your first trial, since you probably have no idea when will the algorithm converge, and this is what you'll be seeing in the following code.

Now we're ready to apply the whole thing. 

```{r}

#### Gradient Descent implementation :

## Define parameters :
# x_new : initial guess for the x value
# x_old : assign a random value to start the first iteration 
x_new <- .1 
x_old <- 0

# define the alpha value (learning rate)
learning_rate <- .6

# define the epilson value, maximum iteration allowed 
epsilon <- .05
step <- 1
iteration <- 10

# records the x and y value for visualization ; add the inital guess 
xtrace <- list() ; ytrace <- list()
xtrace[[1]] <- x_new ; ytrace[[1]] <- Formula(x_new)
cbind( xtrace, ytrace )
```

Elaboration on what i defined :

- `x_old` and `x_new` to calculate the difference of the x value between two iterations, the two numbers are different so that the loop can still work for the first iteration of the while loop, as you'll see later.
- `epsilon` If the difference between `x_old` and `x_new` is smaller than this value then the algorithm will halt.
- `iteration` The maximum iteration to train the algorithm. That is, if the difference of the x value on the `r iteration`th iteration and `r iteration` still larger than the `epsilon` value, the algorithm will still halt.
- `xtrace` and `ytrace` stores the x and its corresponding formula value for each iteration. It's good to store these values to get a sense of how fast the algorithm converges.

```{r}

while( abs(x_new - x_old) > epsilon & step <= iteration ) {
	# update iteration count 
	step <- step + 1	
	
	# gradient descent
	x_old <- x_new
	x_new <- x_old - learning_rate * Derivative(x_old)
	
	# record keeping 
	xtrace[[step]] <- x_new
	ytrace[[step]] <- Formula(x_new)	
}

# create the data points' dataframe
record <- data.frame( x = do.call(rbind, xtrace), y = do.call(rbind, ytrace) )
record

```

From the output above, we can see that the algorithm converges at iteration `r nrow(record)` before the user-specified maximum iteration, with the x and formula value that's close to the original optimal value. Let's create the visualization for the process.

```{r}

# create the segment between each points (gradient steps)
segment <- data.frame( x = double(), y = double(), xend = double(), yend = double() )
for( i in 1:( nrow(record)-1 ) ) {
	segment[i, ] <- cbind( record[i, ], record[i + 1, ] )
}

# visualize the gradient descent's value 
ggplot( data.frame( x = c(0, 4) ), aes(x) ) + 
stat_function(fun = Formula) + 
ggtitle( expression( 1.2 * (x-2)^2 + 3.2 ) ) + 
geom_point( data = record, aes(x, y), color = "red", size = 3, alpha = .8, shape = 2 ) +
geom_segment( data = segment , aes(x = x, y = y, xend = xend, yend = yend), 
              color = "blue", alpha = .8, arrow = arrow( length = unit(0.25, "cm") ) )

```

The visualization gives us a clearer picture that after assigning an inital value of x and parameters such as `epsilon`, `learning_rate`, `iteration`, the gradient descent algorithm will start manipulate the x value until it meets the converge criteria, and an interesting feature is that when the algorithms starts to get closer and closer to the optimal value, it will take smaller "steps " and wander nearby before converging.

Some takeways for this section :

1. The paramters' value you choose will most likely affect your result. Especially for `learning_rate`, for this parameter, if you chosen a value that is too big, the algorithm may skip the optimal value and if you chosen a value that is too small, then the algorithm may take too long to converge. You can try it out yourself ~ .
2. The formula $1.2\times(x-2)^2 + 3.2$ is the "cost function" for this problem, that is we plug in value x into this function to determine whether or not we can reached the optimum. We'll be using this name in the following tutorial, so make sure you have it in your mind.

## Applying it to Linear Regression

Now all of that was just preamble, to get ourselves familiar with gradient descent, it's now time to apply the same concept to linear regression. We'll use the a simple housing data to illustrate the idea.

```{r}

housing <- read.table("housing.txt", header = TRUE, sep = ",")
list( head(housing), dim(housing) )

```

This is a dataset that contains `r ncol(housing)` columns (or so called variables) including the number of bedrooms, the area (size) of the house, and `r nrow(housing)` rows. For this example, what linear regression will try to do is to train a model using this dataset and in the future after only obtaining the area and bedroom number we want to be able to predict the prices of other houses. So how can we use gradient descent to formulate this linear regression model? Recall that the algorithm's formula is :

$$\text{Repeat until converge} \{ x:=x-\alpha\triangledown F(x) \}$$

Now all we have to do is to define the appropriate cost function of the linear regression and plug it back in to formula above! Before we give it to you, we'll first denote some simple math notations.

- m : the number of training examples. Here we will use all `r nrow(housing)` rows.
- n : the number of "input" variables, in this case, it is 2, the number of bedrooms and the area (size) of the house.
- $x^{(i)}$ : the ith row of the "input" variable in the dataset. This does not denote x raised to the power of i!
- $y^{(i)}$ : the ith row "output" variables. In this case the output variable is the price of the house.
- Formula for linear regression :

$$ h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \dotsm + \theta_{n}x_{n}  $$

Here the $\theta_{j}$s (j going from 0 to n) denotes the weights or so called coeffficients of the model. Here we only have 2 variables, so j only goes up to 2, and the $h_{\theta}(x)$ for this dataset would be $\theta_{0} + \theta_{1}x_{area} + \theta_{2}x_{bedrooms}$. And for conciseness, suppose we define $x_{0}$ to be a vector of all 1s (This is a trick that's used in the gradient descent function!!). In that case the formula can be rewritten as :

$$ h_{\theta}(x) = \sum_{j=0}^n \theta_{j}x_{j} $$

So for linear regression, these parameters $\theta_{j}$s from the formula above are what we want find out or train. So given a training dataset, how do we learn them? One reasonable method is to make the value produced by function F(x) to be as close to the original $y^{(i)}$ as possible (at least for the training dataset we now have). That is after plugging in the combinations of the number of bedrooms and the area size of the house into the function, we want the house price calculated by the function to be as close to the original value of the house price as possible (for every row of dataset). This gives us the cost function $F(\theta)$ below.

$$ F(\theta) = \frac{1}{2m} \sum_{i=1}^m ( h_{\theta}(x^{(i)}) - y^{(i)} )^2 $$

The $\frac{1}{2}$ is there to minimize the math loading for later when we take the first derivative of the function. Again, the meaning for the formula above means after plugging in our input variables $x^{(i)}$ (recall that i denotes the ith row in the dataset) into the function and obtaining the value, which is the $h_{\theta}(x^{(i)})$ part. We will calculate its distance with the original $y^{(i)}$. Therefore the process of the gradient descent is to start some value for $\theta$ and keep updating it to reduce $F_{\theta}$, with the goal of minimizing the summed up differences for all rows. This summed of difference is often referred to as the sum squared error. 

Next, we'll state without proving that after taking the first derivative of the function $F_{\theta}$ and putting it back inside the gradient descent algorithm we'll obtain the formula below:

$$ \theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^m ( h_{\theta}(x^{(i)}) - y^{(i)} ) x_{j}^{(i)} $$

Please refer to this [link](http://math.stackexchange.com/questions/70728/partial-derivative-in-gradient-descent-for-two-variables) if you're interested in the proof.

Now all that formula may look a might formidable. The notion is, after you have calculated difference (error) between $h_{\theta}(x^{(i)}) - y^{(i)}$ for each row i you multiply that difference to all of the $x_{j}$s for each row and sum them up. A small example using two rows :

```{r}

# suppose you've already calculated that the difference of 
# the two rows are 100 and 200 respectively, then 
# using the first two rows of the input variables
housing[ c(1, 2), -3 ]

# multiply 100 with row 1 
( row1 <- 100 * housing[1, -3] )

# multuply 200 with row 2
( row2 <- 200 * housing[1, -3] )

# sum each row up
list( area = sum( row1[1] + row2[1] ), bedrooms = sum( row1[2] + row2[2] ) )

```

These value will be the results you obtain for the $\sum_{i=1}^m ( h_{\theta}(x^{(i)}) - y^{(i)} ) x_{j}^{(i)}$ part with m equals 2. Hopefully that part is now clear !

Another important concept before getting to the code for gradient descent with linear regression. The tricky thing about this approach is that often times it requires normalization or so called feature scaling, or else the algorithm will most likely leads to really bad results. Take our example housing dataset for example. 

```{r}

head(housing)

```

You'll notice that houses' area (sizes) are approximately 1000 times larger than bedroom counts. Therefore when input variables differ by orders of magnitudes, performing some kind of normalization is often times required. Here we'll apply the widely used z-score normalization, where given a input variable (column) you subtract every row element with its mean and then divide it by its standard deviation (see code below). 

```{r}

# z-score normalization
# this is equivalent to the "scale" function built in with R
Normalize <- function(x) ( x - mean(x) ) / sd(x)

```

Just want to mention, after the z-score normalization, 0 means that the original number is the average for that column, and all other normalized-number means how many standard deviations is the number away from the mean.

- **Note** : Keep in mind that when performing the normalization you should also store the values that was used the scale each of the feature. Because after we're done learning the parameters of the model, next time when we're given a new house information containing only the bedroom number and house size we will like to predict its house price. And before directly plugging in the numbers to the model, we will also have to normalize them.

Now the [`GradientDescent`][GradientDescent] function for linear regression. Parameters for the function :

- `data` The whole data frame type data.   
- `target` Takes a character stating column name that serves as the output variable.  
- `learning_rate` Learning rate for the gradient descent algorithm. 
- `iteration` Halting criterion : maximum iteration allowed for training the gradient descent algorithm.
- `epsilon` Halting criterion : If the trained parameter's difference between the two iteration is smaller than this value then the algorithm will halt. Default to 0.001.
- `normalize` Boolean value indicating whether to performing z-score normalization for the input variables. Default to true.
- `method` Specify either "batch" or "stochastic" for the gradient descent method. Use batch for now, this will be explained later!! 
- The function returns a list consisting of : 
    - Data frame that records the theta value (model's parameter) for each iteration.
    - Data frame that records the cost value for each iteration.
    - Data frame that stores the noramalized mean and standard deviation for each input column.

```{r, message=FALSE}

# source in the function to concise the documentation
source("linear_regession_code/gradient_descent.R")

# parameters :
# learning rate of 0.05, 500 iteration, method = "batch"
trace_b <- GradientDescent( data = housing, target = "price",  
	                        learning_rate = 0.05, iteration = 500, method = "batch" )
# print out the final parameters
parameters_b <- trace_b$theta[ nrow(trace_b$theta), ]
parameters_b

```

We've now obtain the parameters of the linear model calculated by the gradient descent method. Let's use it and compare with R's built-in `lm` function, which also calculates the linear regression's parameters.

```{r}

# linear regression 
normed <- apply(housing[, -3], 2, scale)
normed_data <- data.frame( cbind(normed, price = housing$price) )
model <- lm(price ~ ., data = normed_data)
model

```

From the result above, after setting the learning rate to be 0.05 and training our gradient descent algorithm for 500 iterations, the parameters are really close to the results given by the linear regression function in R. So in this case the linear regression model obtained by our gradient descent algorithm is : 

$$ h_{\theta}(x) = `r comma( round(parameters_b[1]) )` + `r comma( round(parameters_b[2]) )`x_{area} + `r round(parameters_b[3])`x_{bedroom} $$

Now that we've gotten ourselves a bit famliar with gradient descent. I still own you an explanation about the `method` parameter from the gradient descent function call above. Let's recall the math formula for this approach : 
$$ \theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^m ( h_{\theta}(x^{(i)}) - y^{(i)} ) x_{j}^{(i)} $$

First the more formal name to this approach is **batch gradient descent**. From the formula, the part $\sum_{i=1}^m$ tells you that this method looks at every example in the entire training set on every step of the update. Normally, however, the size of our training data is usually very large. For the example dataset we used in this document, the size of m was only `r nrow(housing)`, hence this approach will work out perfectly fine. But if we were to use this approach on a large training dataset, updating the parameters for each iteration will take us linear time, \mathcal{O}(m), because this method has to scan through the entire training set before taking a single step. An alternative approach to this is the **stochastic gradient descent**, math formula as follow :

$$ \text{for } i \text{ in } 1, 2, \dotsm, m :  $$
$$ \theta_{j} := \theta_{j} - \alpha ( h_{\theta}(x^{(i)}) - y^{(i)} ) x_{j}^{(i)} $$

The notion for stochastic gradient descent is, for each iteration, we will update the parameters according to the error of a single training example only, I repeat a single. Then for the next iteration, it will use the next training example, and so on. As for the for loop, the order of which training example of use can also be selected randomly. The trade off for this method is that, since it's computing the value to "descent" using only a single training example, the final parameters it obtain will oscillating around the optimal value, but never actually reaches it; though in practice most of the values near the minimum will be reasonably good approximations to the true minimum.

In sum, compared to batch gradient descent, this method is more widely used for LARGE training datasets, as it converges faster.


## Conclusions

Some takeways about gradient descent : 

1. We can visualize the costs obtained from each iteration.

```{r}

costs_df <- data.frame( iteration = 1:nrow(trace_b$cost), 
	                    costs = trace_b$cost / 1e+8 )

ggplot( costs_df, aes(iteration, costs) ) + geom_line()

```

As you can see from the plot, at the beginning when we randomly assign the theta value to our model, the magnitudes of the update is huge. But as it approaches the optimum, then the algorithm will find little need to update the parameters, and therefore, the value of the cost does not change much. This matches the behavior it had with our simple equation example in the Getting Started with Gradient Descent section.

2. Note that gradient descent does not equal to linear regression, there're other approaches to solve or obtain the model's parameters, however, this algorithm, with a little tuning, is still widely used in many other places, such as **artifical neural network**.

3. If your input variables or features comes in different magnitudes, then normalizing them is an important preprocessing step before applying gradient descent. Or else the results will most likely be heavily affected by the input variables with the larger magnitudes, and in that case the algorithm might either returns a really bad result or worse, will not even work.

4. Tuning parameters including the learning rate, epsilon and iterations will surely affect the result and the speed to converge, and this requires some trial and error. 


# Appendix on Linear Regression 

Notes on the output of the summary on linear regression.

```{r}

summary(model)

```

## Residuals

Linear regression is a model that assumes that the data is normally distributed and it's coefficient is estimated to minimize the sum of squares of the residuals. The term **residuals** is basically referring to the difference between the actual value of the output and the model's prediction. 

```{r}

summary(model$residuals)

# original price - predicted price by the model (stored in the fitted value)
summary(normed_data$price - model$fitted.values)

```

A ideal linear model's residual should have a mean of 0, median near 0 and the 1st and 3rd quantile should be symmetric with neither of them too large or too few when compared with the other.


## t-value

Before training the model each input variable has a coefficient of 0 with respect to the output ( indicating no linear relationship ) . The term **t-value** measures how many standard deviation is our estimated coefficient away from the original value of 0. The calculation is basically to divide our coefficient estimate by our standard error. It's actually ( coefficient - 0 ) / standard error, since the baseline assumption is 0 but the minus 0 is dropped since it doesn't affect the outcome. 

```{r, message=FALSE, warning=FALSE}
library(broom)
( coefficient <- broom::tidy(model) )

# t-value
coefficient$estimate / coefficient$std.error

```

A high t-value indicates that we should keep the corresponding feature in our final linear model, since it has a higher chance of being a nonzero coefficient.

## p-value

Probably one of the most important indicators. In order to compute this value, the first concept that we need to have is the degrees of freedom. You can think of it as the number of training data you have left considering the number of coefficients you wish to solve. Which happens to be the number of observations in our training data minus the model coefficient's total count. 

Back to the **p-value**. This is a value that turns the t-value into probabilities that tells us: How likely should our coefficient be zero instead of the current value that we've estimated? Or in a sense what is the probability of our t-value being larger or smaller than the ones that our model estimated?  

The probability is calculated based on the student's t distribution, which requires us the specify the degrees of freedom. We can calculate it by converting the t-value into absolute values first, then calculate the upper tail of the distribution to obtain the probabality of being larger and multiply it by 2 to include the probability of being smaller.  

```{r}
# degrees of freedom
( df <- nrow(normed_data) - nrow(coefficient) )

# p-value
pt( abs(coefficient$statistic), df = df, lower.tail = FALSE ) * 2

```

The general rule of thumb is that we should include coefficients that have p-value lesser than **0.05**. Other really important things to mention about this p-value is that:

- We should never compare p-values against each other to gauge which input variable is more important than the others.
- A input variable with a high p-value does not necessarily indicate that it has no linear relationship with the output. It only claims after including all the other input variables to our model, this feature does not provide any new information to the output.


## R Squared

When evaluating the fit of the linear model, another indicator that's worth your attention is the $R^2$ value. In short, we want the model we've trained to capture the output's variance as much as possible. The $R^2$ measures the proportion of the output variance that is explained by the model. It can be calculated by:

$$R^2 = 1 - \frac{RSS}{TSS} = 1- \frac{ \sum^n_{i=1} e_i^2 }{ \sum^n_{i=1} ( y_i - \bar{y} )^2 }$$

Where:

- `RSS` : Stands for Residual Sum of Squares ( the measurement that linear model tries of minimize ). This value captures the variance that is left between our prediction and the actual values of the output.
- `TSS` : Stands for Total Sum of Squares, which measures the total variance in the output variable. 
- $e_i$ : Residuals that we've mentioned above. 
- n : Total number of your data point.
- $y_i$ : Your output variable.
- $\bar{y}$ : The average of your original output variable ( pronounced as y bar. ).

```{r}

# @y  : original output value
# @py : predicted ouput value
RSquared <- function(y , py) {
	rss <- sum( (y - py)^2 )
	tss <- sum( ( y - mean(y) )^2 )
	return(1 - rss / tss)
}

summary(model)$r.squared
RSquared(normed_data$price, model$fitted.values)

```

This value ranges between 0 and 1. A value close to 1 indicates that the linear model is a good fit as it explains most of the output variable's variance. A low value suggests that you should include other input variables and try to improve the model's fit.

## Adjusted R Square

The adjusted $R^2$ is measuring the same thing as the $R^2$, but adjusted for the complexity of the model, i.e. the count of the model's coefficient. The general idea is, if we add another input variable to our current model, the $R^2$ of the new model will still increase even if the new added input variable has no statistical power at all. Hence, adjusted $R^2$ tries to account for this sense of overfitting. The math formula to it:

$$ R^2_{adjusted} = 1 - ( 1 - R^2 ) \times \frac{ n - 1 }{ n - k - 1 } $$

Where:

- n : Total number of your data point.
- k : Number of coefficients, remember to exclude the intercept.

```{r}

k  <- nrow(coefficient) - 1

# @y  : original output value
# @py : predicted ouput value
# @k  : number of the model's coefficient, excluding the intercept
AdjustedRSquared <- function(y, py, k) {	 
	n  <- length(y)
	r2 <- RSquared(y, py)	
	return( 1 - (1 - r2) * (n - 1) / (n - k - 1) )
}

summary(model)$adj.r.squared
AdjustedRSquared( normed_data$price, model$fitted.values, k )

```


## Visualizations

Useful visualizations when working with linear regression.

[`LMPlot`][LMPlot] : Four visualizations that works with linear regression. Input parameters : 

- `model` Linear regression model object.
- `actual` Data's actual (original) output value.
- The function returns a list consisting of :
    - plot : four side by side visualizations. Wrap grid.draw over it to plot it.
    - outlier : If there're outliers detected using the cooks distance measure, return the possbile outlier's index. If not return NULL.

```{r, fig.height=6, fig.width=10, message=FALSE, warning=FALSE}

source("linear_regession_code/LMPlot.R")
lm_plot <- LMPlot(model = model, actual = normed_data$price)
grid.draw(lm_plot$plot)

```

Interpretation of the plot :

- **Upper Left:** Visualizes value predicted by your model versus the actual value of your original data. If the model is considered to a good estimate of the outcome,  there should be strong correlation between the model’s predictions and its actual results, the line should be a close fit to y = x, going from bottom left to upper right.
- **Upper Right:** Visualizes the cooks distance and the predicted value. Cooks distance is a measurement that goes with linear models. The rule of thumb is, if a data point's cook distance larger than 1 or larger than the 4 / number of data, then that data point is considered to be a possible outlier.
- **Bottom Left:** Visualizes the residuals and the linear model’s predicted value. Ideally, the residuals should be symmetrically distributed around the lower digits of the y-axis, with no clear patterns what so ever. Meaning it should not be larger or smaller in a specific range. If there’re patterns detected, it could indicate the current input features does not have a linear relationship with the output. Or there’re still variance in the output that we did not capture and new input variables should be added.
- **Bottom Right:** Visualizes a QQ plot of the residuals. This is way of checking whether there's patterns for your residuals. The plot will be very close to the y = x straight line if the residuals is a close approximation to a normal distribution, which is the ideal scenario. 

As you can see this function will also color the outliers for you if it is detected by the cooks distance measurement ( If no outlier is detected then the plot will be single colored ). And the will return the indices of the possible outliers.

```{r}

lm_plot$outlier

```

This suggests that data point `r lm_plot$outlier` from our data are possible outliers and should be taken care of. We can simply remove it from our data when training the model if there's only a few of them, but if there's a significant amount of outliers then we should sit down and take a closer look as to why.

**Variance Inflation Factor:**

Feature selection is an extremely important task for every single statistical (machine learning) models. 

One common notions is that we should remove variables that have linear relationships with other variables as after obtaining one of them, you can use the relationship to trace back the result of the other. With that being said, collinearity is a property that denotes when two features are approximately in a linear relationship. A commonly seen illustration of collinearity is when seeing coefficients with an unexpected sign in your model. e.g. you obtain a negative coefficient on educational background for a linear model that predicts income, which does not match our intuition.

One simple approach to identify collinearity among input variables is the use of **variance inflation factors (VIF)**. The score for each input variable is obtained by fitting a linear regression model in which we treat each of the features as the output feature and all the remaining features as regular input features. Then compute VIF for our chosen feature using the formula: 

$$ VIF = \frac{1}{1 – R^2} $$

```{r, message=FALSE,warning=FALSE}
library(car)
car::vif(model)

# example of calculating the vif score for area
area_model <- lm(area ~ .-price, data = normed_data)
area_r2 <- RSquared(y = normed_data$area, py = area_model$fitted.values)
1 / (1 - area_r2)

```

The interpretation is quite straightforward, the higher the value, the higher the collinearity. The rule of thumb for this value is those with a score of 4 and above is a suspect; those with the scores larger than 10 indicates a strong likelihood of collinearity and that feature should be discarded.


# R Session Information  

```{r}
sessionInfo()
```

[GradientDescent]: https://github.com/ethen8181/machine-learning/blob/master/linear_regression/linear_regession_code/gradient_descent.R
[LMPlot]: https://github.com/ethen8181/machine-learning/blob/master/linear_regression/linear_regession_code/LMPlot.R

# References

- [Gradient Descent Example](http://www.r-bloggers.com/gradient-descent-in-r/)
- [Linear Regression with Gradient Descent](http://cs229.stanford.edu/notes/cs229-notes1.pdf)
- [Interpretation of linear regression's summary](http://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output)