03-loops-R.Rmd

---
layout: page
title: Programming with R
subtitle: Analyzing multiple data sets
minutes: 30
---

```{r, include = FALSE}
source("tools/chunk-options.R")
opts_chunk$set(fig.path = "fig/03-loops-R-")
```

> ## Learning Objectives {.objectives}
>
> * Explain what a `for` loop does.
> * Correctly write `for` loops to repeat simple calculations.
> * Trace changes to a loop variable as the loop runs.
> * Trace changes to other variables as they are updated by a `for` loop.
> * Use a function to get a list of filenames that match a simple pattern.
> * Use a `for` loop to process multiple files.

We have created a function called `analyze` that creates graphs of the minimum, average, and maximum daily inflammation rates for a single data set:

```{r inflammation-01}
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}

analyze("data/inflammation-01.csv")
```

We can use it to analyze other data sets one by one:

```{r inflammation-02}
analyze("data/inflammation-02.csv")
```

but we have a dozen data sets right now and more on the way.
We want to create plots for all our data sets with a single statement.
To do that, we'll have to teach the computer how to repeat things.

### For Loops

Suppose we want to print each word in a sentence.
One way is to use six `print` statements:

```{r}
best_practice <- c("Let", "the", "computer", "do", "the", "work")
print_words <- function(sentence) {
  print(sentence[1])
  print(sentence[2])
  print(sentence[3])
  print(sentence[4])
  print(sentence[5])
  print(sentence[6])
}

print_words(best_practice)
```

but that's a bad approach for two reasons:

 1. It doesn't scale: if we want to print the elements in a vector that's hundreds long, we'd be better off just typing them in.

 2. It's fragile: if we give it a longer vector, it only prints part of the data, and if we give it a shorter input, it returns `NA` values because we're asking for elements that don't exist!

```{r}
best_practice[-6]
print_words(best_practice[-6])
```

> ## Tip {.callout}
>
> R has has a special variable, `NA`, for designating missing values that are
> **N**ot **A**vailable in a data set. See `?NA` and [An Introduction to R][na]
> for more details.

[na]: http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Missing-values

Here's a better approach:

```{r}
print_words <- function(sentence) {
  for (word in sentence) {
    print(word)
  }
}

print_words(best_practice)
```

This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well:

```{r}
print_words(best_practice[-6])
```

The improved version of `print_words` uses a [for loop](reference.html#for-loop) to repeat an operation---in this case, printing---once for each thing in a collection.
The general form of a loop is:

```{r, eval=FALSE}
for (variable in collection) {
  do things with variable
}
```

We can name the [loop variable](reference.html#loop-variable) anything we like (with a few [restrictions][], e.g. the name of the variable cannot start with a digit).
`in` is part of the `for` syntax.
Note that the body of the loop is enclosed in curly braces `{ }`.
For a single-line loop body, as here, the braces aren't needed, but it is good practice to include them as we did.

[restrictions]: http://cran.r-project.org/doc/manuals/R-intro.html#R-commands_003b-case-sensitivity-etc

Here's another loop that repeatedly updates a variable:

```{r}
len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
  len <- len + 1
}
# Number of vowels
len
```

It's worth tracing the execution of this little program step by step.
Since there are five elements in the vector `vowels`, the statement inside the loop will be executed five times.
The first time around, `len` is zero (the value assigned to it on line 1) and `v` is `"a"`.
The statement adds 1 to the old value of `len`, producing 1, and updates `len` to refer to that new value.
The next time around, `v` is `"e"` and `len` is 1, so `len` is updated to be 2.
After three more updates, `len` is 5; since there is nothing left in the vector `vowels` for R to process, the loop finishes.

Note that a loop variable is just a variable that's being used to record progress in a loop.
It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:

```{r}
letter <- "z"
for (letter in c("a", "b", "c")) {
  print(letter)
}
# after the loop, letter is
letter
```

Note also that finding the length of a vector is such a common operation that R actually has a built-in function to do it called `length`:

```{r}
length(vowels)
```

`length` is much faster than any R function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven't met yet, so we should always use it when we can (see this [lesson](01-supp-data-structures.html) to learn more about the different ways to store data in R).


> ## Challenge - Using loops {.challenge}
>
> 1. R has a built-in function called `seq` that creates a list of numbers:
>
> ```{r}
> seq(3)
> ```
>
> Using `seq`, write a function that prints the first **N** natural numbers, one per line:
>
> ```{r, echo=-1}
> print_N <- function(N) {
>   nseq <- seq(N)
>   for (num in nseq) {
>     print(num)
>   }
> }
> print_N(3)
> ```
>
> 2. Write a function called `total` that calculates the sum of the values in a vector.
> (R has a built-in function called `sum` that does this for you.
> Please don't use it for this exercise.)
>
> ```{r, echo=-1}
> total <- function(vec) {
>   #calculates the sum of the values in a vector
>   vec_sum <- 0
>   for (num in vec) {
>     vec_sum <- vec_sum + num
>   }
>   return(vec_sum)
> }
> ex_vec <- c(4, 8, 15, 16, 23, 42)
> total(ex_vec)
> ```
>
> 3. Exponentiation is built into R:
>
> ```{r}
> 2^4
> ```
>
> Write a function called `expo` that uses a loop to calculate the same result.
> ```{r, echo=-1}
> expo <- function(base, power) {
>   result <- 1
>   for (i in seq(power)) {
>     result <- result * base
>   }
>   return(result)
> }
> expo(2, 4)
> ```
>

### Processing Multiple Files

We now have almost everything we need to process all our data files.
The only thing that's missing is a function that finds files whose names match a pattern.
We do not need to write it ourselves because R already has a function to do this called `list.files`.

If we run the function without any arguments, `list.files()`, it returns every file in the current working directory.
We can understand this result by reading the help file (`?list.files`).
The first argument, `path`, is the path to the directory to be searched, and it has the default value of `"."` (recall from the [lesson](http://swcarpentry.github.io/shell-novice/01-filedir.html) on the Unix Shell that `"."` is shorthand for the current working directory).
The second argument, `pattern`, is the pattern being searched, and it has the default value of `NULL`.
Since no pattern is specified to filter the files, all files are returned.

So to list all the csv files, we could run either of the following:

```{r}
list.files(path = "data", pattern = "csv")
list.files(path = "data", pattern = "inflammation")
```


> ## Tip {.callout}
>
> For larger projects, it is recommended to organize separate parts of the
> analysis into multiple subdirectories, e.g. one subdirectory for the raw data,
> one for the code, and one for the results like figures. We have done that here
> to some extent, putting all of our data files into the subdirectory "data".
> For more advice on this topic, you can read [A quick guide to organizing
> computational biology projects][Noble2009] by William Stafford Noble.

[Noble2009]: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424


As these examples show, `list.files` result is a vector of strings, which means we can loop over it to do something with each filename in turn.
In our case, the "something" we want is our `analyze` function.

Because we have put our data in separate subdirectory, if we want to access these files
using the output of `list.files` we also need to include the "path" portion of the file name.
We can do that by using the argument `full.names = TRUE`.

```{r}
list.files(path = "data", pattern = "csv", full.names = TRUE)
list.files(path = "data", pattern = "inflammation", full.names = TRUE)
```


Let's test out running our `analyze` function by using it on the first three files in the vector returned by `list.files`:


```{r loop-analyze, fig=FALSE}
filenames <- list.files(path = "data", pattern = "inflammation", full.names = TRUE)
filenames <- filenames[1:3]
for (f in filenames) {
  print(f)
  analyze(f)
}
```

Sure enough, the maxima of these data sets show exactly the same ramp as the first, and their minima show the same staircase structure.

> ## Tip {.callout}
>
> In this lesson we saw how to use a simple `for` loop to repeat an operation.
> As you progress with R, you will learn that there are multiple ways to
> accomplish this. Sometimes the choice of one method over another is more a
> matter of personal style, but other times it can have consequences for the
> speed of your code. For instruction on best practices, see this supplementary
> [lesson](03-supp-loops-in-depth.html) that demonstrates how to properly repeat
> operations in R.

> ## Challenge - Using loops to analyze multiple files {.challenge}
>
> 1. Write a function called `analyze_all` that takes a filename pattern as its sole argument and runs `analyze` for each file whose name matches the pattern.

```{r analyze_all, include=FALSE}
analyze_all <- function(pattern) {
  # Runs the function analyze for each file in the current working directory
  # that contains the given pattern.
  filenames <- list.files(path = "data", pattern = pattern, full.names = TRUE)
  for (f in filenames) {
    analyze(f)
  }
}

# analyze_all("csv")
```

> ## Key Points {.callout}
>
> * Use `for (variable in collection)` to process the elements of a collection one at a time.
> * The body of a `for` loop is surrounded by curly braces (`{ }`).
> * Use `length(thing)` to determine the length of something that contains other values.
> * Use `list.files(path = "path", pattern = "pattern", full.names = TRUE)` to create a list of files whose names match a pattern.

> ## Next Steps {.callout}
>
> We have now solved our original problem: we can analyze any number of data files with a single command.
> More importantly, we have met two of the most important ideas in programming:
>
> * Use functions to make code easier to re-use and easier to understand.
> * Use vectors and data frames to store related values, and loops to repeat operations on them.
>
> We have one more big idea to introduce...