vignettes/data_frames.Rmd

---
title: "Data frames"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data frames}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(dplyr.print_min = 4L, dplyr.print_max = 4L)
library(dplyr)
```

## Creating

`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames:

  * It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!).
    
    ```{r}
    data.frame(x = letters) %>% sapply(class)
    data_frame(x = letters) %>% sapply(class)
    ```
    
    This makes it easier to use with list-columns:
    
    ```{r}
    data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
    ```
    
    List-columns are most commonly created by `do()`, but they can be useful to
    create by hand.
      
  * It never adjusts the names of variables:
  
    ```{r}
    data.frame(`crazy name` = 1) %>% names()
    data_frame(`crazy name` = 1) %>% names()
    ```

  * It evaluates its arguments lazily and sequentially:
  
    ```{r}
    data_frame(x = 1:5, y = x ^ 2)
    ```

  * It adds the `tbl_df()` class to the output so that if you accidentally print a large 
    data frame you only get the first few rows.
    
    ```{r}
    data_frame(x = 1:5) %>% class()
    ```

  * It changes the behaviour of `[` to always return the same type of object:
    subsetting using `[` always returns a `tbl_df()` object; subsetting using 
    `[[` always returns a column.
    
    You should be aware of one case where subsetting a `tbl_df()` object  
    will produce a different result than a `data.frame()` object:
  
    ```{r}
    df <- data.frame(a = 1:2, b = 1:2)
    str(df[, "a"])
    
    tbldf <- tbl_df(df)
    str(tbldf[, "a"])
    ```
    
  * It never uses `row.names()`. The whole point of tidy data is to 
    store variables in a consistent way. So it never stores a variable as 
    special attribute.
  
  * It only recycles vectors of length 1. This is because recycling vectors of greater lengths 
    is a frequent source of bugs.

## Coercion

To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things:

* It checks that the input list is valid for a data frame, i.e. that each element
  is named, is a 1d atomic vector or list, and all elements have the same 
  length.
  
* It sets the class and attributes of the list to make it behave like a data frame.
  This modification does not require a deep copy of the input list, so it's
  very fast.
  
This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`:

```{r}
l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
  as_data_frame(l2),
  as.data.frame(l2)
)
```

The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.

## Memory

One of the reasons that dplyr is fast is that it is very careful about when it makes copies. This section describes how this works, and gives you some useful tools for understanding the memory usage of data frames in R.

The first tool we'll use is `dplyr::location()`. It tells us the memory location of three components of a data frame object:

* the data frame itself
* each column
* each attribute

```{r}
location(iris)
```

It's useful to know the memory address, because if the address changes, then you'll know that R has made a copy. Copies are bad because they take time to create. This isn't usually a bottleneck if you have a few thousand values, but if you have millions or tens of millions of values it starts to take significant amounts of time. Unnecessary copies are also bad because they take up memory.

R tries to avoid making copies where possible. For example, if you just assign `iris` to another variable, it continues to the point same location:

```{r}
iris2 <- iris
location(iris2)
```

Rather than having to compare hard to read memory locations, we can instead use the `dplyr::changes()` function to highlights changes between two versions of a data frame. The code below shows us that `iris` and `iris2` are identical: both names point to the same location in memory.

```{r}
changes(iris2, iris)
```

What do you think happens if you modify a single column of `iris2`? In R 3.1.0 and above, R knows to modify only that one column and to leave the others pointing to their existing locations:

```{r}
iris2$Sepal.Length <- iris2$Sepal.Length * 2
changes(iris, iris2)
```

(This was not the case prior to version 3.1.0, where R created a deep copy of the entire data frame.)

dplyr is equally smart:

```{r}
iris3 <- mutate(iris, Sepal.Length = Sepal.Length * 2)
changes(iris3, iris)
```

It creates only one new column while all the other columns continue to point at their original locations. You might notice that the attributes are still copied. However, this has little impact on performance. Because attributes are usually short vectors, the internal dplyr code needed to copy them is also considerably simpler.

dplyr never makes copies unless it has to:

* `tbl_df()` and `group_by()` don't copy columns

* `select()` never copies columns, even when you rename them

* `mutate()` never copies columns, except when you modify an existing column

* `arrange()` must always copy all columns because you're changing the order of every one.
  This is an expensive operation for big data, but you can generally avoid
  it using the order argument to [window functions](window-functions.html)

* `summarise()` creates new data, but it's usually at least an order of
  magnitude smaller than the original data.

In short, dplyr lets you work with data frames with very little memory overhead.

data.table takes this idea one step further: it provides functions that modify a data table in place. This avoids the need to make copies of pointers to existing columns and attributes, and speeds up operations when you have many columns. dplyr doesn't do this with data frames (although it could) because I think it's safer to keep data immutable: even if the resulting data frame shares practically all the data of the original data frame, all dplyr data frame methods return a new data frame.