Skip to content

Commit

Permalink
Merge pull request #93 from ScPoEcon/tex
Browse files Browse the repository at this point in the history
close issue #91. rename chapters. bump version.
  • Loading branch information
floswald authored Oct 5, 2018
2 parents c15428f + 54f3b0c commit 9211da3
Show file tree
Hide file tree
Showing 15 changed files with 58 additions and 48 deletions.
File renamed without changes.
File renamed without changes.
38 changes: 21 additions & 17 deletions 04-linear-reg.Rmd → 03-linear-reg.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,10 @@ abline(a = 0,b = 5,lw=3)

That is slightly better. However, the line seems at too high a level - the point at which it crosses the y-axis is called the *intercept*; and it's too high. We just learned how to represent a *line*, i.e. with two numbers called *intercept* and *slope*. Let's write down a simple formula which represents a line where some outcome $z$ is related to a variable $x$:

$$
\begin{equation}
z = b_0 + b_1 x (\#eq:bline)
$$
\end{equation}

Here $b_0$ represents the value of the intercept (i.e. $z$ when $x=0$), and $b_1$ is the value of the slope. The question for us is now: How to choose the number $b_0$ and $b_1$ such that the result is the **good** line?

### Choosing the Best Line
Expand Down Expand Up @@ -179,9 +180,10 @@ plot_data = generate_data(sigma = 2)

In order to be able to reason about good or bad line, we need to denote the *output* of equation \@ref(eq:bline). We call the value $\hat{y}_i$ the *predicted value* for obseration $i$, after having chosen some particular values $b_0$ and $b_1$:

$$
\begin{equation}
\hat{y}_i = b_0 + b_1 x_i (\#eq:abline-pred)
$$
\end{equation}

In general it is likely that we won't be able to choose $b_0$ and $b_1$ in such as way as to provide a perfect prediction, i.e. one where $\hat{y}_i = y_i$ for all $i$. That is, we expect to make an *error* in our prediction $\hat{y}_i$, so let's denote this value $e_i$. If we acknowlegdge that we will make errors, let's at least make them as small as possible! Exactly this is going to be our task now.

Suppose we have the following set of `r nrow(plot_data)` observations on `x` and `y`, and we put the *best* straight line into it, that we can think of. It would look like this:
Expand All @@ -192,9 +194,10 @@ plot_unexp_dev(plot_data)

Here, the red arrows indicate the **distance** between the prediction (i.e. the black line) to each data point, in other words, each arrow is a particular $e_i$. An upward pointing arrow indicates a positive value of a particular $e_i$, and vice versa for downward pointing arrows. The erros are also called *residuals*, which comes from the way can write the equation for this relationship between two particular values $(y_i,x_i)$ belonging to observation $i$:

$$
\begin{equation}
y_i = b_0 + b_1 x_i + e_i (\#eq:abline)
$$
\end{equation}

You realize of course that $\hat{y}_i = y_i - e_i$, which just means that our prediction is the observed value $y_i$ minus any error $e_i$ we make. In other words, $e_i$ is what is left to be explained on top of the line $b_0 + b_1 x_i$, hence, it's a residual to explain $y_i$. Here are $y,\hat{y}$ and the resulting $e$ which are plotted in figure \@ref(fig:line-arrows):

```{r,echo=FALSE}
Expand Down Expand Up @@ -243,14 +246,15 @@ aboutApp("SSR_cone")

The method to compute (or *estimate*) $b_0$ and $b_1$ we illustrated above is called *Ordinary Least Squares*, or OLS. $b_0$ and $b_1$ are therefore also often called the *OLS coefficients*. By solving problem \@ref(eq:ols-min) one can derive an explicit formula for them:

$$
\begin{equation}
b_1 = \frac{cov(x,y)}{var(x)}, (\#eq:beta1hat)
$$
\end{equation}

i.e. the estimate of the slope coefficient is the covariance between $x$ and $y$ divided by the variance of $x$, both computed from our sample of data. With $b_1$ in hand, we can get the estimate for the intercept as

$$
\begin{equation}
b_0 = \bar{y} - b_1 \bar{x}. (\#eq:beta0hat)
$$
\end{equation}

where $\bar{z}$ denotes the sample mean of variable $z$. The interpretation of the OLS slope coefficient $b_1$ is as follows. Given a line as in $y = b_0 + b_1 x$,

Expand All @@ -261,9 +265,10 @@ where $\bar{z}$ denotes the sample mean of variable $z$. The interpretation of t

There are several important special cases for the linear regression introduced above. Let's start with the most obvious one: What is the meaning of running a regression *without any regressor*, i.e. without a $x$? Our line becomes very simple. Instead of \@ref(eq:bline), we get

$$
\begin{equation}
y = b_0. (\#eq:b0line)
$$
\end{equation}

This means that our minization problem in \@ref(eq:ols-min) *also* becomes very simple: We only have to choose $b_0$! We have

$$
Expand Down Expand Up @@ -454,9 +459,9 @@ abline(reg=l1,lw=2)

Somehow when looking at \@ref(fig:non-line-cars-ols) one is not totally convinced that the straight line is a good summary of this relationship. For values $x\in[50,120]$ the line seems to low, then again too high, and it completely misses the right boundary. It's easy to address this shortcoming by including *higher order terms* of an explanatory variable. We would modify \@ref(eq:abline) to read now

$$
\begin{equation}
y_i = b_0 + b_1 x_i + b_2 x_i^2 + e_i (\#eq:abline2)
$$
\end{equation}

This is a special case of *multiple regression*, which we will talk about in chapter \@ref(multiple-reg). You can see that there are *multiple* slope coefficients. For now, let's just see how this performs:

Expand Down Expand Up @@ -497,9 +502,8 @@ The total variation in outcome $y$ (often called SST, or *total sum of squares*)

In our setup, there exists a convenient measure for how good a particular statistical model fits the data. It is called $R^2$ (*R squared*), also called the *coefficient of determination*. We make use of the just introduced decomposition of variance, and write the formula as

$$
R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] (\#eq:Rsquared)
$$
\begin{equation}R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] (\#eq:Rsquared)
\end{equation}

It is easy to see that a *good fit* is one where the sum of *explained* squares (SSE) is large relativ to the total variation (SST). In such a case, we observe an $R^2$ close to one. In the opposite case, we will see an $R^2$ close to zero. Notice that a small $R^2$ does not imply that the model is useless, just that it explains a small fraction of the observed variation.

Expand Down
18 changes: 10 additions & 8 deletions 05-MultipleReg.Rmd → 04-MultipleReg.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
We can extend the discussion from chapter \@ref(linreg) to more than one explanatory variable. For example, suppose that instead of only $x$ we now had $x_1$ and $x_2$ in order to explain $y$. Everything we've learned for the single variable case applies here as well. Instead of a regression *line*, we now get a regression *plane*, i.e. an object representable in 3 dimenions: $(x_1,x_2,y)$.
As an example, suppose we wanted to explain how many *miles per gallon* (`mpg`) a car can travel as a function of its *horse power* (`hp`) and its *weight* (`wt`). In other words we want to estimate the equation

$$
\begin{equation}
mpg_i = b_0 + b_1 hp_i + b_2 wt_i + e_i (\#eq:abline2d)
$$
\end{equation}

on our built-in dataset of cars (`mtcars`):

```{r mtcarsdata}
Expand Down Expand Up @@ -69,9 +70,10 @@ We ask this kind of question all the time in econometrics. In figure \@ref(fig:p

As a matter of fact, the kind of question asked here is so common that it has got its own name: we'd say "*ceteris paribus*, what is the impact of `hp` on `mpg`?". *ceteris paribus* is latin and means *the others equal*, i.e. all other variables fixed. In terms of our model in \@ref(eq:abline2d), we want to know the following quantity:

$$
\begin{equation}
\frac{\partial mpg_i}{\partial hp_i} = b_1 (\#eq:abline2d-deriv)
$$
\end{equation}

This means: *keeping all other variables fixed, what is the effect of `hp` on `mpg`?*. In calculus, the answer to this is provided by the *partial derivative* as shown in \@ref(eq:abline2d-deriv). We call the value of coefficient $b_1$ therefore also the *partial effect* of `hp` on `mpg`. In terms of our dataset, we use `R` to run the following **multiple regression**:
<br>

Expand Down Expand Up @@ -155,16 +157,16 @@ While you explore this plot, ask yourself the following question: if you could o
Interactions allow that the *ceteris paribus* effect of a certain regressor, `str` say, depends also on the value of yet another regressor, `avginc` for example. In other words, do test scores depend differentially on the student teacher ratio, depending on wether the average income in an ares is high or low? Notice that `str` and `avginc` in isolation cannot answer that question (because the value of other variables is assumed *fixed*!). To measure such an effect, we would reformulate our model like this:


$$
\begin{equation}
\text{testscr}_i = b_0 + b_1 \text{str}_i + b_2 \text{avginc}_i + b_3 (\text{str}_i \times \text{avginc}_i)+ e_i (\#eq:caschool-inter)
$$
\end{equation}


The inclusion of the *product* of `str` and `avginc` amounts to having different slopes with respect to `str` for different values of `avginc` (and vice versa). This is easy to see if we take the partial derivative of \@ref(eq:caschool-inter) with respect to `str`:

$$
\begin{equation}
\frac{\partial \text{testscr}_i}{\partial \text{str}_i} = b_1 + b_3 \text{avginc}_i (\#eq:caschool-inter-deriv)
$$
\end{equation}


>You should go back to equation \@ref(eq:abline2d-deriv) to remind yourself of what a *partial effect* was, and how exactly the present \@ref(eq:caschool-inter-deriv) differs from what we saw there.
Expand Down
22 changes: 11 additions & 11 deletions 06-Categorial-Vars.Rmd → 05-Categorial-Vars.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,9 @@ $$
$$
and let's suppose that $y_i$ is a measure of $i$'s annual labor income. Our model is

$$
\begin{equation}
y_i = b_0 + b_1 \text{is.female}_i + e_i (\#eq:dummy-reg)
$$
\end{equation}

and here is how we estimate this in `R`:

Expand Down Expand Up @@ -201,9 +201,9 @@ This means a one-unit change in $x$ increases the logarithm of the outcome by $b

Going back to our example, let's say that a workers wage depends only on his *experience*, measured in the number of years he/she worked full-time:

$$
\begin{equation}
\ln w_i = b_0 + b_1 exp_i + e_i (\#eq:wage-exp)
$$
\end{equation}


```{r}
Expand All @@ -222,9 +222,9 @@ ggplot(mapping = aes(y=lwage,x=exp), data=Wages) + geom_point(shape=1,alpha=0.6)

Now let's investigate whether this relationship is different for men and women. We can do this by just including the `factor` variable `sex`:

$$
\begin{equation}
\ln w_i = b_0 + b_1 exp_i + b_2 sex_i + e_i (\#eq:wage-sex)
$$
\end{equation}

In `R` we can do this easily by using the `update` function as follows:

Expand All @@ -234,7 +234,7 @@ summary(lm_sex)
```


What's going on here? Remember from above that `sex` is a `factor` with 2 levels *female* and *male*. We see in the above output that `R` included a regressor called `sexmale` $=\mathbf{1}[sex_i=="male"]$. This is a combination of the variable name `sex` and the level which was included in the regression. In other words, `R` chooses a *reference category* (by default the first of all levels by order of appearance), which is excluded - here this is `sex=="female"`. The interpretation is that $b_2$ measures the effect of being male *relative* to being female. `R` automatically creates a dummy variable for each potential level, excluding the first category. In particular, if `sex` had a third category `dont want to say`, there would be an additional regressor called `sexdontwanttosay`.
What's going on here? Remember from above that `sex` is a `factor` with 2 levels *female* and *male*. We see in the above output that `R` included a regressor called `sexmale` $=\mathbf{1}[sex_i==male]$. This is a combination of the variable name `sex` and the level which was included in the regression. In other words, `R` chooses a *reference category* (by default the first of all levels by order of appearance), which is excluded - here this is `sex=="female"`. The interpretation is that $b_2$ measures the effect of being male *relative* to being female. `R` automatically creates a dummy variable for each potential level, excluding the first category. In particular, if `sex` had a third category `dont want to say`, there would be an additional regressor called `sexdontwanttosay`.

```{r wage-plot2,fig.align='center',echo=FALSE,fig.cap='log wage vs experience with different intercepts by sex'}
Expand All @@ -256,15 +256,15 @@ Figure \@ref(fig:wage-plot2) illustrates this. You can see that both male and fe

You can see above that we *restricted* male and female to have the same slope with repect to years of experience. This may or may not be a good assumption. Thankfully, the dummy variable regression machinery allows for a quick solution to this - so-called *interaction* effects. As already introduced in chapter \@ref(mreg-interactions), interactions allow that the *ceteris paribus* effect of a certain regressor, `exp` say, depends also on the value of yet another regressor, `sex` for example. Suppose then we would like to see whether male and female not only have different intercepts, but also different slopes with respect to `exp` in figure \@ref(fig:wage-plot2). Therefore we formulate this version of our model:

$$
\begin{equation}
\ln w_i = b_0 + b_1 exp_i + b_2 sex_i + b_3 (sex_i \times exp_i) + e_i (\#eq:wage-sex-inter)
$$
\end{equation}

The inclusion of the *product* of `exp` and `sex` amounts to having different slopes for different categories in `sex`. This is easy to see if we take the partial derivative of \@ref(eq:wage-sex-inter) with respect to `sex`:

$$
\begin{equation}
\frac{\partial \ln w_i}{\partial sex_i} = b_2 + b_3 exp_i (\#eq:wage-sex-inter-deriv)
$$
\end{equation}

Back in our `R` session, we can run the full interactions model like this:

Expand Down
24 changes: 14 additions & 10 deletions 07-StdErrors.Rmd → 06-StdErrors.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,10 @@ A **statistical model** is simply a set of assumptions about how some data have
Let's bring back our simple model \@ref(eq:abline) to explain this concept.


$$
\begin{equation}
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i (\#eq:abline-5)
$$
\end{equation}

The smallest set of assumptions used to define the *classical regression model* as in \@ref(eq:abline-5) are the following:

1. The data are **not linearly dependent**: Each variable provides new information for the outcome, and it cannot be replicated as a linear combination of other variables. We have seen this in section \@ref(multicol). In the particular case of one regressor, as here, we require that $x$ exhibit some variation in the data, i.e. $Var(x)\neq 0$.
Expand All @@ -64,19 +65,20 @@ First, we *assumed* that \@ref(eq:abline-5) is the correct represenation of the
The standard deviation of the OLS parameters is generally called *standard error*. As such, it is just the square root of the parameter's variance.
Under assumptions 1. through 4. we can define the formula for the variance of our slope coefficient in the context of our single regressor model \@ref(eq:abline-5) as follows:

$$
\begin{equation}
Var(b_1|x_i) = \frac{\sigma^2}{\sum_i^N (x_i - \bar{x})^2} = \frac{\sigma^2}{Var(x)} (\#eq:var-ols)
$$
\end{equation}

In pratice, we don't know the theoretical variance of $\varepsilon$, i.e. $\sigma^2$, but we form an estimate about it from our sample of data. A widely used estimate uses the already encountered SSR (sum of squared residuals), and is denoted $s^2$:

$$
s^2 = \frac{SSR}{n-p} = \frac{\sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2}{n-p} = \frac{\sum_{i=1}^n e_i^2}{n-p}
$$
where $n-p$ are the *degrees of freedom* available in this estimation. $p$ is the number of parameters we wish to estimate (here: 1). So, the variance formula would become

$$
\begin{equation}
Var(b_1|x_i) = \frac{SSR}{(n-p)Var(x)} (\#eq:var-ols2)
$$
\end{equation}

You can clearly see that, as $n$ increases, the denominator increases, and therefore the variance of the estimate will decrease.

Expand All @@ -103,9 +105,9 @@ Great. But what does this *mean*? How could $x$ be correlated with something we

Imagine that we assume that

$$
\begin{equation}
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i (\#eq:DGP-h)
$$
\end{equation}

represents the DGP of impact the sales price of houses ($y$) as a function of number of bathrooms ($x$). We run OLS as

Expand Down Expand Up @@ -139,9 +141,11 @@ Keeping everything else fixed at the current value, what is the impact of $x$ on
```
<br>
It looks like our DGP in \@ref(eq:DGP-h) is the *wrong model*. Suppose instead, that in reality sales prices are generated like this:
$$

\begin{equation}
y_i = \beta_0 + \beta_1 x_i + \beta_2 z_i + \varepsilon_i (\#eq:DGP-h2)
$$
\end{equation}

This would now mean that by running our regression, informed by the wrong DGP, what we estimate is in fact this:
$$
y_i = b_0 + b_1 x_i + (b_2 z_i + e_i) = b_0 + b_1 x_i + u_i.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: ScPoEconometrics
Type: Package
Title: ScPoEconometrics
Date: 2018-09-22
Version: 0.1.7
Date: 2018-10-05
Version: 0.1.8
Authors@R: c(
person("Florian", "Oswald", email = "[email protected]", role = c("aut","cre")),
person("Jean-Marc", "Robin", email = "[email protected]", role = "ctb"),
Expand Down

0 comments on commit 9211da3

Please sign in to comment.