Skip to content

Commit

Permalink
Fixes from @schuess
Browse files Browse the repository at this point in the history
Closes hadley#409
  • Loading branch information
hadley committed Oct 3, 2016
1 parent 74cb7d5 commit a0eba42
Show file tree
Hide file tree
Showing 13 changed files with 51 additions and 51 deletions.
22 changes: 11 additions & 11 deletions EDA.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This chapter will show you how to use visualisation and transformation to explor

1. Search for answers by visualising, transforming, and modelling your data.

1. Use what you learn to refine your questions and or generate new questions.
1. Use what you learn to refine your questions and/or generate new questions.

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.

Expand All @@ -34,7 +34,7 @@ library(dplyr)
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

Expand Down Expand Up @@ -97,7 +97,7 @@ diamonds %>%
count(cut)
```

A variable is **continuous** if can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
A variable is **continuous** if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:

```{r}
ggplot(data = diamonds) +
Expand All @@ -111,7 +111,7 @@ diamonds %>%
count(cut_width(carat, 0.5))
```

A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
A histogram divides the x-axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.

You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.

Expand All @@ -132,7 +132,7 @@ ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +

There are a few challenges with this type of plot, which we will come back to in [visualising a categorical and a continuous variable](#cat-cont).

Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).

### Typical values

Expand Down Expand Up @@ -244,7 +244,7 @@ If you've encountered unusual values in your dataset, and simply want to move on
variable you might find that you don't have any data left!
1. Instead, I recommend replacing the unusual values with missing values.
The easiest way to do this is use `mutate()` to replace the variable
The easiest way to do this is to use `mutate()` to replace the variable
with a modified copy. You can use the `ifelse()` function to replace
unusual values with `NA`:
Expand Down Expand Up @@ -288,7 +288,7 @@ However this plot isn't great because there are many more non-cancelled flights
### Exercises

1. What happens to missing values in a histogram? What happens to missing
values in bar chart? Why is there a difference?
values in a bar chart? Why is there a difference?

1. What does `na.rm = TRUE` do in `mean()` and `sum()`?

Expand All @@ -298,7 +298,7 @@ If variation describes the behavior _within_ a variable, covariation describes t

### A categorical and continuous variable {#cat-cont}

It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:
It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, let's explore how the price of a diamond varies with its quality:

```{r}
ggplot(data = diamonds, mapping = aes(x = price)) +
Expand Down Expand Up @@ -332,7 +332,7 @@ Another alternative to display the distribution of a continuous variable broken

* Visual points that display observations that fall more than 1.5 times the
IQR from either edge of the box. These outlying points are unusual
so are plotted individually
so are plotted individually.

* A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.
Expand Down Expand Up @@ -593,8 +593,8 @@ diamonds %>%

## Learning more

If you want learn more about the mechanics ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
If you want learn more about the mechanics of ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so it includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.

Another useful resource is the [_R Graphics Cookbook_](https://amzn.com/1449316956) by Winston Chang. Much of the contents are available online at <http://www.cookbook-r.com/Graphs/>.

I also recommend [_Graphical Data Analysis with R_](https://amzn.com/1498715230), by Antony Unwin. This is a book length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
I also recommend [_Graphical Data Analysis with R_](https://amzn.com/1498715230), by Antony Unwin. This is a book-length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
4 changes: 2 additions & 2 deletions explore.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ The goal of the first part of this book is to get you up to speed with the basic
knitr::include_graphics("diagrams/data-science-explore.png")
```

You will get frustrated when you start programming in R, because it is such a stickler for mistakes. Even one character out of place will cause it to complain. However, that frustration is both typical and temporary. It happens to everyone, and the only way to get over it is to keep trying.
You will get frustrated when you start programming in R, because it is such a stickler. Even one character out of place will cause it to complain. However, that frustration is both typical and temporary. It happens to everyone, and the only way to get over it is to keep trying.

The goal of this part of the book is to get you to some useful tools with an immediate payoff as quickly as possible:
The goal of this part of the book is to get you some useful tools with an immediate payoff as quickly as possible:

* Visualisation is a great place to start with R programming, because the
payoff is so clear: you get to make elegant and informative plots that help
Expand Down
2 changes: 1 addition & 1 deletion import.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Another option that commonly needs tweaking is `na`: this specifies the value (o
read_csv("a,b,c\n1,2,.", na = ".")
```

This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them in to R vectors.
This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.

### Compared to base R

Expand Down
4 changes: 2 additions & 2 deletions intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ It's common to think about modelling as a tool for hypothesis confirmation, and

## Prerequisites

We've made few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
We've made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it's helpful if you have some programming experience already. If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.

To run the code in this book, you will need to install both R and the RStudio IDE. Both are open source, free, and easy to install:

Expand Down Expand Up @@ -196,7 +196,7 @@ There are three things you need to include to make your example reproducible: re

Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.

You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way to is follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org). This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.

To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world. If you're an active Twitter user, follow the `#rstats` hashtag. Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.

Expand Down
6 changes: 3 additions & 3 deletions program.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ In this part of the book, you'll improve your programming skills. Programming is
knitr::include_graphics("diagrams/data-science-program.png")
```

Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand your why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.
Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves get better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.

Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)

Expand All @@ -35,7 +35,7 @@ In the following four chapters, you'll learn skills that will allow you to both

## Learning more

The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount. Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills. Learning more about programming is a long-term investment: it won't pay off immediately, but in the long-term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount. Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills. Learning more about programming is a long-term investment: it won't pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.

To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:

Expand All @@ -49,5 +49,5 @@ To learn more you need to study R as a programming language, not just an interac
* [_Advanced R_](https://amzn.com/1466586966) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if you
have existing programming experience. It's also a great next step once you've
internalised the ideas in these chapters. You can read it online at at
internalised the ideas in these chapters. You can read it online at
<http://adv-r.had.co.nz>.
4 changes: 2 additions & 2 deletions tibble.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Introduction

Throughout this book we work with "tibbles" instead of R's traditional data.frame. Tibbles _are_ data frames, but they tweak some older behaviours to make life a littler easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier.
Throughout this book we work with "tibbles" instead of R's traditional data.frame. Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier.

If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.

Expand Down Expand Up @@ -146,7 +146,7 @@ The main reason that some older functions don't work with tibble is the `[` func

## Exercises

1. How can you tell if an object is a tibble? (Hint: trying print `mtcars`,
1. How can you tell if an object is a tibble? (Hint: try printing `mtcars`,
which is a regular data frame).

1. Practice referring to non-syntactic names by:
Expand Down
Loading

0 comments on commit a0eba42

Please sign in to comment.