Skip to content

Commit

Permalink
Changes from @mine-cetinkaya-rundel
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Jul 31, 2016
1 parent fb8f3e5 commit 9cf3bad
Show file tree
Hide file tree
Showing 11 changed files with 152 additions and 86 deletions.
36 changes: 23 additions & 13 deletions EDA.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,8 @@ ggplot(data = diamonds) +
The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `dplyr::count()`:

```{r}
diamonds %>% count(cut)
diamonds %>%
count(cut)
```

A variable is **continuous** if can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
Expand All @@ -106,15 +107,17 @@ ggplot(data = diamonds) +
You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`:

```{r}
diamonds %>% count(cut_width(carat, 0.5))
diamonds %>%
count(cut_width(carat, 0.5))
```

A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.

You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.

```{r}
smaller <- diamonds %>% filter(carat < 3)
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
Expand All @@ -123,10 +126,12 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`. `geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead. It's much easier to understand overlapping lines than bars.

```{r}
ggplot(data = smaller, mapping = aes(x = carat)) +
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
```

There are a few challenges with this type of plot, which will come back to in [visualisation a categorical and a continuous variable](#cat-cont).

Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).

### Typical values
Expand Down Expand Up @@ -202,7 +207,8 @@ unusual

The `y` variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can't have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!

When you discover an outlier, it's a good idea to trace it back as far as possible. You'll be in a much stronger analytical position if you can figure out why it happened. If you can't figure it out, and want to just move on with your analysis, replace it with a missing value, which we'll discuss in the next section.
It's good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.


### Exercises

Expand All @@ -227,8 +233,9 @@ If you've encountered unusual values in your dataset, and simply want to move on

1. Drop the entire row with the strange values:

```{r}
diamonds2 <- diamonds %>% filter(between(y, 3, 20))
```{r, eval = FALSE}
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
```
I don't recommend this option because just because one measurement
Expand Down Expand Up @@ -289,7 +296,7 @@ However this plot isn't great because there are many more non-cancelled flights

If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

### A categorical and continuous variable
### A categorical and continuous variable {#cat-cont}

It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:

Expand Down Expand Up @@ -343,14 +350,16 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.

`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have an intrinsic order, so you might want to reorder them to make an more informative display. One way to do that is with the `reorder()` function.

For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:

```{r}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
```

Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, reorder `x` variable with `reorder()`. This code reorders the `class` based on the median value of `hwy` in each group.
To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:

```{r fig.height = 3}
ggplot(data = mpg) +
Expand Down Expand Up @@ -410,7 +419,8 @@ The size of each circle in the plot displays how many observations occurred at e
Another approach is to compute the count with dplyr:

```{r}
diamonds %>% count(color, cut)
diamonds %>%
count(color, cut)
```

Then visualise with `geom_tile()` and the fill aesthetic:
Expand All @@ -419,7 +429,7 @@ Then visualise with `geom_tile()` and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(aes(fill = n))
geom_tile(aes(fill = n))
```

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
Expand Down Expand Up @@ -580,7 +590,7 @@ Sometimes we'll turn the end of pipeline of data transformation into a plot. Wat
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
geom_tile()
```

If you want learn more about ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
6 changes: 4 additions & 2 deletions datetimes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -287,8 +287,10 @@ update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
If values are too big, they will roll-over:

```{r}
ymd("2015-02-01") %>% update(mday = 30)
ymd("2015-02-01") %>% update(hour = 400)
ymd("2015-02-01") %>%
update(mday = 30)
ymd("2015-02-01") %>%
update(hour = 400)
```

You can use `update()` to show the distribution of flights across the course of the day for every day of the year:
Expand Down
15 changes: 9 additions & 6 deletions model-assess.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,10 @@ my_model <- function(df) {
mod <- my_model(df)
rmse(mod, df)
grid <- df %>% expand(x = seq_range(x, 50))
preds <- grid %>% add_predictions(mod, var = "y")
grid <- df %>%
expand(x = seq_range(x, 50))
preds <- grid %>%
add_predictions(mod, var = "y")
df %>%
ggplot(aes(x, y)) +
Expand Down Expand Up @@ -156,10 +158,11 @@ But do you think this model will do well if we apply it to new data from the sam
In real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.

```{r}
boot <- bootstrap(df, 100) %>% mutate(
mod = map(strap, my_model),
pred = map2(list(grid), mod, add_predictions)
)
boot <- bootstrap(df, 100) %>%
mutate(
mod = map(strap, my_model),
pred = map2(list(grid), mod, add_predictions)
)
boot %>%
unnest(pred) %>%
Expand Down
15 changes: 10 additions & 5 deletions model-basics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ sim1_dist <- function(a1, a2) {
measure_distance(c(a1, a2), sim1)
}
models <- models %>% mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models <- models %>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models
```

Expand Down Expand Up @@ -245,7 +246,8 @@ It's also useful to see what the model doesn't capture, the so called residuals
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `modelr::data_grid()`. Its first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <- sim1 %>% data_grid(x)
grid <- sim1 %>%
data_grid(x)
grid
```

Expand All @@ -254,7 +256,8 @@ grid
Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to a new column in the data frame:

```{r}
grid <- grid %>% add_predictions(sim1_mod)
grid <- grid %>%
add_predictions(sim1_mod)
grid
```

Expand All @@ -275,7 +278,8 @@ The flip-side of predictions are __residuals__. The predictions tells you the pa
We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. This is because to compute residuals we need actual y values.

```{r}
sim1 <- sim1 %>% add_residuals(sim1_mod)
sim1 <- sim1 %>%
add_residuals(sim1_mod)
sim1
```

Expand Down Expand Up @@ -392,7 +396,8 @@ ggplot(sim2, aes(x)) +
You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:

```{r, error = TRUE}
tibble(x = "e") %>% add_predictions(mod2)
tibble(x = "e") %>%
add_predictions(mod2)
```

### Interactions (continuous and categorical)
Expand Down
21 changes: 13 additions & 8 deletions model-building.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,8 @@ ggplot(daily, aes(wday, n)) +
Next we compute and visualise the residuals:

```{r}
daily <- daily %>% add_residuals(mod)
daily <- daily %>%
add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
Expand All @@ -248,7 +249,8 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
1. There are some days with far fewer flights than expected:
```{r}
daily %>% filter(resid < -100)
daily %>%
filter(resid < -100)
```
If you're familiar with American public holidays, you might spot New Year's
Expand Down Expand Up @@ -301,7 +303,8 @@ term <- function(date) {
)
}
daily <- daily %>% mutate(term = term(date))
daily <- daily %>%
mutate(term = term(date))
daily %>%
filter(wday == "Sat") %>%
Expand Down Expand Up @@ -367,10 +370,11 @@ If you're experimenting with many models and many visualisations, it's a good id

```{r}
compute_vars <- function(data) {
data %>% mutate(
term = term(date),
wday = wday(date, label = TRUE)
)
data %>%
mutate(
term = term(date),
wday = wday(date, label = TRUE)
)
}
```

Expand Down Expand Up @@ -413,7 +417,8 @@ How do you decide how many parameters to use for the spline? You can either eith
How would these days generalise to another year?

```{r}
daily %>% filter(resid > 80)
daily %>%
filter(resid > 80)
```
1. Create a new variable that splits the `wday` variable into terms, but only
Expand Down
21 changes: 14 additions & 7 deletions model-many.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,10 @@ by_country
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes take of that for you:

```{r}
by_country %>% filter(continent == "Europe")
by_country %>% arrange(continent, country)
by_country %>%
filter(continent == "Europe")
by_country %>%
arrange(continent, country)
```

If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
Expand All @@ -167,9 +169,10 @@ If your list of data frames and list of models were separate objects, you have t
Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` with each model-data pair:

```{r}
by_country <- by_country %>% mutate(
resids = map2(data, model, add_residuals)
)
by_country <- by_country %>%
mutate(
resids = map2(data, model, add_residuals)
)
by_country
```

Expand Down Expand Up @@ -233,7 +236,8 @@ glance
With this data frame in hand, we can start to look for models that don't fit well:

```{r}
glance %>% arrange(r.squared)
glance %>%
arrange(r.squared)
```

The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
Expand Down Expand Up @@ -435,7 +439,10 @@ The advantage of this structure is that it generalises in a straightforward way
Now if you want to iterate over names and values in parallel, you can use `map2()`:

```{r}
df %>% mutate(smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1])))
df %>%
mutate(
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
)
```

Expand Down
7 changes: 4 additions & 3 deletions pipes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -243,13 +243,14 @@ The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of th
* For assignment magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
mtcars <- mtcars %>% transform(cyl = cyl * 2)
```{r, eval = FALSE}
mtcars <- mtcars %>%
transform(cyl = cyl * 2)
```
with
```R
```{r, eval = FALSE}
mtcars %<>% transform(cyl = cyl * 2)
```
Expand Down
Loading

0 comments on commit 9cf3bad

Please sign in to comment.