Changes from @mine-cetinkaya-rundel

pmPartch · Jul 31, 2016 · 9cf3bad · 9cf3bad
1 parent fb8f3e5
commit 9cf3bad
Show file tree

Hide file tree

Showing 11 changed files with 152 additions and 86 deletions.
diff --git a/EDA.Rmd b/EDA.Rmd
@@ -93,7 +93,8 @@ ggplot(data = diamonds) +
 The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `dplyr::count()`:
 
 ```{r}
-diamonds %>% count(cut)
+diamonds %>% 
+  count(cut)
 ```
 
 A variable is **continuous** if can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
@@ -106,15 +107,17 @@ ggplot(data = diamonds) +
 You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`:
 
 ```{r}
-diamonds %>% count(cut_width(carat, 0.5))
+diamonds %>% 
+  count(cut_width(carat, 0.5))
 ```
 
 A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar. 
 
 You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
 
 ```{r}
-smaller <- diamonds %>% filter(carat < 3)
+smaller <- diamonds %>% 
+  filter(carat < 3)
   
 ggplot(data = smaller, mapping = aes(x = carat)) +
   geom_histogram(binwidth = 0.1)
@@ -123,10 +126,12 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
 If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`. `geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead. It's much easier to understand overlapping lines than bars.
 
 ```{r}
-ggplot(data = smaller, mapping = aes(x = carat)) +
+ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
   geom_freqpoly(binwidth = 0.1)
 ```
 
+There are a few challenges with this type of plot, which will come back to in [visualisation a categorical and a continuous variable](#cat-cont).
+
 Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
 
 ### Typical values
@@ -202,7 +207,8 @@ unusual
 
 The `y` variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can't have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
 
-When you discover an outlier, it's a good idea to trace it back as far as possible. You'll be in a much stronger analytical position if you can figure out why it happened. If you can't figure it out, and want to just move on with your analysis, replace it with a missing value, which we'll discuss in the next section.
+It's good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
+
 
 ### Exercises
 
@@ -227,8 +233,9 @@ If you've encountered unusual values in your dataset, and simply want to move on
 
 1.  Drop the entire row with the strange values:
 
-    ```{r}
-    diamonds2 <- diamonds %>% filter(between(y, 3, 20))
+    ```{r, eval = FALSE}
+    diamonds2 <- diamonds %>% 
+      filter(between(y, 3, 20))
     ```
     
     I don't recommend this option because just because one measurement
@@ -289,7 +296,7 @@ However this plot isn't great because there are many more non-cancelled flights
 
 If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
 
-### A categorical and continuous variable
+### A categorical and continuous variable {#cat-cont}
 
 It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
 
@@ -343,14 +350,16 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
 
 We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
 
-`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
+`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have an intrinsic order, so you might want to reorder them to make an more informative display. One way to do that is with the `reorder()` function.
+
+For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
 
 ```{r}
 ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
   geom_boxplot()
 ```
 
-Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, reorder `x` variable with `reorder()`. This code reorders the `class` based on the median value of `hwy` in each group.
+To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:
 
 ```{r fig.height = 3}
 ggplot(data = mpg) +
@@ -410,7 +419,8 @@ The size of each circle in the plot displays how many observations occurred at e
 Another approach is to compute the count with dplyr:
 
 ```{r}
-diamonds %>% count(color, cut)
+diamonds %>% 
+  count(color, cut)
 ```
 
 Then visualise with `geom_tile()` and the fill aesthetic:
@@ -419,7 +429,7 @@ Then visualise with `geom_tile()` and the fill aesthetic:
 diamonds %>% 
   count(color, cut) %>%  
   ggplot(mapping = aes(x = color, y = cut)) +
-  geom_tile(aes(fill = n))
+    geom_tile(aes(fill = n))
 ```
 
 If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
@@ -580,7 +590,7 @@ Sometimes we'll turn the end of pipeline of data transformation into a plot. Wat
 diamonds %>% 
   count(cut, clarity) %>% 
   ggplot(aes(clarity, cut, fill = n)) + 
-  geom_tile()
+    geom_tile()
 ```
 
 If you want learn more about ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free,  but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
diff --git a/datetimes.Rmd b/datetimes.Rmd
@@ -287,8 +287,10 @@ update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
 If values are too big, they will roll-over:
 
 ```{r}
-ymd("2015-02-01") %>% update(mday = 30)
-ymd("2015-02-01") %>% update(hour = 400)
+ymd("2015-02-01") %>% 
+  update(mday = 30)
+ymd("2015-02-01") %>% 
+  update(hour = 400)
 ```
 
 You can use `update()` to show the distribution of flights across the course of the day for every day of the year: 

diff --git a/model-assess.Rmd b/model-assess.Rmd
@@ -116,8 +116,10 @@ my_model <- function(df) {
 mod <- my_model(df)
 rmse(mod, df)
 
-grid <- df %>% expand(x = seq_range(x, 50))
-preds <- grid %>% add_predictions(mod, var = "y")
+grid <- df %>% 
+  expand(x = seq_range(x, 50))
+preds <- grid %>% 
+  add_predictions(mod, var = "y")
 
 df %>% 
   ggplot(aes(x, y)) +
@@ -156,10 +158,11 @@ But do you think this model will do well if we apply it to new data from the sam
 In real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.
 
 ```{r}
-boot <- bootstrap(df, 100) %>% mutate(
-  mod = map(strap, my_model),
-  pred = map2(list(grid), mod, add_predictions)
-)
+boot <- bootstrap(df, 100) %>% 
+  mutate(
+    mod = map(strap, my_model),
+    pred = map2(list(grid), mod, add_predictions)
+  )
 
 boot %>% 
   unnest(pred) %>% 

diff --git a/model-basics.Rmd b/model-basics.Rmd
@@ -125,7 +125,8 @@ sim1_dist <- function(a1, a2) {
   measure_distance(c(a1, a2), sim1)
 }
 
-models <- models %>% mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
+models <- models %>% 
+  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
 models
 ```
 
@@ -245,7 +246,8 @@ It's also useful to see what the model doesn't capture, the so called residuals
 To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `modelr::data_grid()`. Its first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
 
 ```{r}
-grid <- sim1 %>% data_grid(x) 
+grid <- sim1 %>% 
+  data_grid(x) 
 grid
 ```
 
@@ -254,7 +256,8 @@ grid
 Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to a new column in the data frame:
 
 ```{r}
-grid <- grid %>% add_predictions(sim1_mod) 
+grid <- grid %>% 
+  add_predictions(sim1_mod) 
 grid
 ```
 
@@ -275,7 +278,8 @@ The flip-side of predictions are __residuals__. The predictions tells you the pa
 We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. This is because to compute residuals we need actual y values.
 
 ```{r}
-sim1 <- sim1 %>% add_residuals(sim1_mod)
+sim1 <- sim1 %>% 
+  add_residuals(sim1_mod)
 sim1
 ```
 
@@ -392,7 +396,8 @@ ggplot(sim2, aes(x)) +
 You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:
 
 ```{r, error = TRUE}
-tibble(x = "e") %>% add_predictions(mod2)
+tibble(x = "e") %>% 
+  add_predictions(mod2)
 ```
 
 ### Interactions (continuous and categorical)

diff --git a/model-building.Rmd b/model-building.Rmd
@@ -222,7 +222,8 @@ ggplot(daily, aes(wday, n)) +
 Next we compute and visualise the residuals:
 
 ```{r}
-daily <- daily %>% add_residuals(mod)
+daily <- daily %>% 
+  add_residuals(mod)
 daily %>% 
   ggplot(aes(date, resid)) + 
   geom_ref_line(h = 0) + 
@@ -248,7 +249,8 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
 1.  There are some days with far fewer flights than expected:
 
     ```{r}
-    daily %>% filter(resid < -100)
+    daily %>% 
+      filter(resid < -100)
     ```
 
     If you're familiar with American public holidays, you might spot New Year's 
@@ -301,7 +303,8 @@ term <- function(date) {
   )
 }
 
-daily <- daily %>% mutate(term = term(date)) 
+daily <- daily %>% 
+  mutate(term = term(date)) 
 
 daily %>% 
   filter(wday == "Sat") %>% 
@@ -367,10 +370,11 @@ If you're experimenting with many models and many visualisations, it's a good id
 
 ```{r}
 compute_vars <- function(data) {
-  data %>% mutate(
-    term = term(date), 
-    wday = wday(date, label = TRUE)
-  )
+  data %>% 
+    mutate(
+      term = term(date), 
+      wday = wday(date, label = TRUE)
+    )
 }
 ```
 
@@ -413,7 +417,8 @@ How do you decide how many parameters to use for the spline? You can either eith
     How would these days generalise to another year?
 
     ```{r}
-    daily %>% filter(resid > 80)
+    daily %>% 
+      filter(resid > 80)
     ```
 
 1.  Create a new variable that splits the `wday` variable into terms, but only

diff --git a/model-many.Rmd b/model-many.Rmd
@@ -156,8 +156,10 @@ by_country
 This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes take of that for you:
 
 ```{r}
-by_country %>% filter(continent == "Europe")
-by_country %>% arrange(continent, country)
+by_country %>% 
+  filter(continent == "Europe")
+by_country %>% 
+  arrange(continent, country)
 ```
 
 If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
@@ -167,9 +169,10 @@ If your list of data frames and list of models were separate objects, you have t
 Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` with each model-data pair:
 
 ```{r}
-by_country <- by_country %>% mutate(
-  resids = map2(data, model, add_residuals)
-)
+by_country <- by_country %>% 
+  mutate(
+    resids = map2(data, model, add_residuals)
+  )
 by_country
 ```
 
@@ -233,7 +236,8 @@ glance
 With this data frame in hand, we can start to look for models that don't fit well:
 
 ```{r}
-glance %>% arrange(r.squared)
+glance %>% 
+  arrange(r.squared)
 ```
 
 The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
@@ -435,7 +439,10 @@ The advantage of this structure is that it generalises in a straightforward way
 Now if you want to iterate over names and values in parallel, you can use `map2()`:
 
 ```{r}
-df %>% mutate(smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1])))
+df %>% 
+  mutate(
+    smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
+  )
 
 ```
 

diff --git a/pipes.Rmd b/pipes.Rmd
@@ -243,13 +243,14 @@ The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of th
 *   For assignment magrittr provides the `%<>%` operator which allows you to
     replace code like:
   
-    ```R
-    mtcars <- mtcars %>% transform(cyl = cyl * 2)
+    ```{r, eval = FALSE}
+    mtcars <- mtcars %>% 
+      transform(cyl = cyl * 2)
     ```
     
     with
      
-    ```R
+    ```{r, eval = FALSE}
     mtcars %<>% transform(cyl = cyl * 2)
     ```