diff --git a/docs/packages.html b/docs/packages.html index dae381e..ab87638 100644 --- a/docs/packages.html +++ b/docs/packages.html @@ -411,7 +411,7 @@

3.1.2 library

Solution. You can make a histogram of x3 with qplot(x3, binwidth = 1). The histogram will look like a symmetric pyramid. The middle bar will have a height of 3 and will appear above [2, 3), but be sure to try it and see for yourself. -

You can use a histogram to display visually how common different values of x are. Numbers covered by a tall bar are no more common than numbers covered by a short bar.

+

You can use a histogram to display visually how common different values of x are. Numbers covered by a tall bar are more common than numbers covered by a short bar.

How can you use a histogram to check the accuracy of your dice?

Well, if you roll your dice many times and keep track of the results, you would expect some numbers to occur more than others. This is because there are more ways to get some numbers by adding two dice together than to get other numbers, as shown in Figure 3.3.

If you roll your dice many times and plot the results with qplot, the histogram will show you how often each sum appeared. The sums that occurred most often will have the highest bars. The histogram should look like the pattern in Figure 3.3 if the dice are fairly weighted.

diff --git a/docs/search_index.json b/docs/search_index.json index d160783..dc8b1f3 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -3,7 +3,7 @@ ["preface.html", "Preface 0.1 Conventions Used in This Book 0.2 Acknowledgments", " Preface This book will teach you how to program in R. You’ll go from loading data to writing your own functions (which will outperform the functions of other R users). But this is not a typical introduction to R. I want to help you become a data scientist, as well as a computer scientist, so this book will focus on the programming skills that are most related to data science. The chapters in the book are arranged according to three practical projects–given that they’re fairly substantial projects, they span multiple chapters. I chose these projects for two reasons. First, they cover the breadth of the R language. You will learn how to load data, assemble and disassemble data objects, navigate R’s environment system, write your own functions, and use all of R’s programming tools, such as if else statements, for loops, S3 classes, R’s package system, and R’s debugging tools. The projects will also teach you how to write vectorized R code, a style of lightning-fast code that takes advantage of all of the things R does best. But, more importantly, the projects will teach you how to solve the logistical problems of data science—and there are many logistical problems. When you work with data, you will need to store, retrieve, and manipulate large sets of values without introducing errors. As you work through the book, I will teach you not just how to program with R, but how to use the programming skills to support your work as a data scientist. Not every programmer needs to be a data scientist, so not every programmer will find this book useful. You will find this book helpful if you’re in one of the following categories: You already use R as a statistical tool, but you would like to learn how to write your own functions and simulations with R. You would like to teach yourself how to program, and you see the sense of learning a language related to data science. One of the biggest surprises in this book is that I do not cover traditional applications of R, such as models and graphs; instead, I treat R purely as a programming language. Why this narrow focus? R is designed to be a tool that helps scientists analyze data. It has many excellent functions that make plots and fit models to data. As a result, many statisticians learn to use R as if it were a piece of software—they learn which functions do what they want, and they ignore the rest. This is an understandable approach to learning R. Visualizing and modeling data are complicated skills that require a scientist’s full attention. It takes expertise, judgement, and focus to extract reliable insights from a data set. I would not recommend that any data scientist distract herself with computer programming until she feels comfortable with the basic theory and practice of her craft. If you would like to learn the craft of data science, I recommend the book R for Data Science, my companion volume to this book, co-written with Hadley Wickham. However, learning to program should be on every data scientist’s to-do list. Knowing how to program will make you a more flexible analyst and augment your mastery of data science in every way. My favorite metaphor for describing this was introduced by Greg Snow on the R help mailing list in May 2006. Using functions in R is like riding a bus. Writing functions in R is like driving a car. Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off (and you need to pay your fare). Cars, on the other hand, require much more work: you need to have some type of map or directions (even if the map is in your head), you need to put gas in every now and then, you need to know the rules of the road (have some type of drivers license). The big advantage of the car is that it can take you a bunch of places that the bus does not go and it is quicker for some trips that would require transferring between busses. Using this analogy, programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed. R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back. R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS. - Greg Snow Greg compares R to SPSS, but he assumes that you use the full powers of R; in other words, that you learn how to program in R. If you only use functions that preexist in R, you are using R like SPSS: it is a bus that can only take you to certain places. This flexibility matters to data scientists. The exact details of a method or simulation will change from problem to problem. If you cannot build a method tailored to your situation, you may find yourself tempted to make unrealistic assumptions just so you can use an ill-suited method that already exists. This book will help you make the leap from bus to car. I have written it for beginning programmers. I do not talk about the theory of computer science—there are no discussions of big O() and little o() in these pages. Nor do I get into advanced details such as the workings of lazy evaluation. These things are interesting if you think of computer science at the theoretical level, but they are a distraction when you first learn to program. Instead, I teach you how to program in R with three concrete examples. These examples are short, easy to understand, and cover everything you need to know. I have taught this material many times in my job as Master Instructor at RStudio. As a teacher, I have found that students learn abstract concepts much faster when they are illustrated by concrete examples. The examples have a second advantage, as well: they provide immediate practice. Learning to program is like learning to speak another language—you progress faster when you practice. In fact, learning to program is learning to speak another language. You will get the best results if you follow along with the examples in the book and experiment whenever an idea strikes you. The book is a companion to R for Data Science. In that book, Hadley Wickham and I explain how to use R to make plots, model data, and write reports. That book teaches these tasks as data-science skills, which require judgement and expertise—not as programming exercises, which they also are. This book will teach you how to program in R. It does not assume that you have mastered the data-science skills taught in R for Data Science (nor that you ever intend to). However, this skill set amplifies that one. And if you master both, you will be a powerful, computer-augmented data scientist, fit to command a high salary and influence scientific dialogue. 0.1 Conventions Used in This Book The following typographical conventions are used in this book: Italic:: Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width:: Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold:: Shows commands or other text that should be typed literally by the user. Constant width italic:: Shows text that should be replaced with user-supplied values or by values determined by context. To comment or ask technical questions about this book, please file an issue at github.com/rstudio-education/hopr. 0.2 Acknowledgments Many excellent people have helped me write this book, from my two editors, Courtney Nash and Julie Steele, to the rest of the O’Reilly team, who designed, proofread, and indexed the book. Also, Greg Snow generously let me quote him in this preface. I offer them all my heartfelt thanks. I would also like to thank Hadley Wickham, who has shaped the way I think about and teach R. Many of the ideas in this book come from Statistics 405, a course that I helped Hadley teach when I was a PhD student at Rice University. Further ideas came from the students and teachers of Introduction to Data Science with R, a workshop that I teach on behalf of RStudio. Thank you to all of you. I’d like to offer special thanks to my teaching assistants Josh Paulson, Winston Chang, Jaime Ramos, Jay Emerson, and Vivian Zhang. Thank you also to JJ Allaire and the rest of my colleagues at RStudio who provide the RStudio IDE, a tool that makes it much easier to use, teach, and write about R. Finally, I would like to thank my wife, Kristin, for her support and understanding while I wrote this book. "], ["project-1-weighted-dice.html", "1 Project 1: Weighted Dice", " 1 Project 1: Weighted Dice Computers let you assemble, manipulate, and visualize data sets, all at speeds that would have wowed yesterday’s scientists. In short, computers give you scientific superpowers! But if you wish to use them, you’ll need to pick up some programming skills. As a data scientist who knows how to program, you will improve your ability to: Memorize (store) entire data sets Recall data values on demand Perform complex calculations with large amounts of data Do repetitive tasks without becoming careless or bored Computers can do all of these things quickly and error free, which lets your mind do the things it does well: make decisions and assign meaning. Sound exciting? Great! Let’s begin. When I was a college student, I sometimes daydreamed of going to Las Vegas. I thought that knowing statistics might help me win big. If that’s what led you to data science, you better sit down; I have some bad news. Even a statistician will lose money in a casino over the long run. This is because the odds for each game are always stacked in the casino’s favor. However, there is a loophole to this rule. You can make money–and reliably too. All you have to do is be the casino. Believe it or not, R can help you do that. Over the course of the book, you will use R to build three virtual objects: a pair of dice that you can roll to generate random numbers, a deck of cards that you can shuffle and deal from, and a slot machine modeled after some real-life video lottery terminals. After that, you’ll just need to add some video graphics and a bank account (and maybe get a few government licenses), and you’ll be in business. I’ll leave those details to you. These projects are lighthearted, but they are also deep. As you complete them, you will become an expert at the skills you need to work with data as a data scientist. You will learn how to store data in your computer’s memory, how to access data that is already there, and how to transform data values in memory when necessary. You will also learn how to write your own programs in R that you can use to analyze data and run simulations. If simulating a slot machine (or dice, or cards) seems frivilous, think of it this way: playing a slot machine is a process. Once you can simulate it, you’ll be able to simulate other processes, such as bootstrap sampling, Markov chain Monte Carlo, and other data-analysis procedures. Plus, these projects provide concrete examples for learning all of the components of R programming: objects, data types, classes, notation, functions, environments, if trees, loops, and vectorization. This first project will make it easier to study these things by teaching you the basics of R. Your first mission is simple: assemble R code that will simulate rolling a pair of dice, like at a craps table. Once you have done that, we’ll weight the dice a bit in your favor, just to keep things interesting. In this project, you will learn how to: Use the R and RStudio interfaces Run R commands Create R objects Write your own R functions and scripts Load and use R packages Generate random samples Create quick plots Get help when you need it Don’t worry if it seems like we cover a lot of ground fast. This project is designed to give you a concise overview of the R language. You will return to many of the concepts we meet here in projects 2 and 3, where you will examine the concepts in depth. You’ll need to have both R and RStudio installed on your computer before you can use them. Both are free and easy to download. See Appendix A for complete instructions. If you are ready to begin, open RStudio on your computer and read on. "], ["basics.html", "2 The Very Basics 2.1 The R User Interface 2.2 Objects 2.3 Functions 2.4 Writing Your Own Functions 2.5 Arguments 2.6 Scripts 2.7 Summary", " 2 The Very Basics This chapter provides a broad overview of the R language that will get you programming right away. In it, you will build a pair of virtual dice that you can use to generate random numbers. Don’t worry if you’ve never programmed before; the chapter will teach you everything you need to know. To simulate a pair of dice, you will have to distill each die into its essential features. You cannot place a physical object, like a die, into a computer (well, not without unscrewing some screws), but you can save information about the object in your computer’s memory. Which information should you save? In general, a die has six important pieces of information: when you roll a die, it can only result in one of six numbers: 1, 2, 3, 4, 5, and 6. You can capture the essential characteristics of a die by saving the numbers 1, 2, 3, 4, 5, and 6 as a group of values in your computer’s memory. Let’s work on saving these numbers first, and then consider a method for “rolling” our die. 2.1 The R User Interface Before you can ask your computer to save some numbers, you’ll need to know how to talk to it. That’s where R and RStudio come in. RStudio gives you a way to talk to your computer. R gives you a language to speak in. To get started, open RStudio just as you would open any other application on your computer. When you do, a window should appear in your screen like the one shown in Figure 2.1. Figure 2.1: Your computer does your bidding when you type R commands at the prompt in the bottom line of the console pane. Don’t forget to hit the Enter key. When you first open RStudio, the console appears in the pane on your left, but you can change this with File > Preferences in the menu bar. If you do not yet have R and RStudio intalled on your computer–or do not know what I am talking about–visit Appendix A. The appendix will give you an overview of the two free tools and tell you how to download them. The RStudio interface is simple. You type R code into the bottom line of the RStudio console pane and then click Enter to run it. The code you type is called a command, because it will command your computer to do something for you. The line you type it into is called the command line. When you type a command at the prompt and hit Enter, your computer executes the command and shows you the results. Then RStudio displays a fresh prompt for your next command. For example, if you type 1 + 1 and hit Enter, RStudio will display: > 1 + 1 [1] 2 > You’ll notice that a [1] appears next to your result. R is just letting you know that this line begins with the first value in your result. Some commands return more than one value, and their results may fill up multiple lines. For example, the command 100:130 returns 31 values; it creates a sequence of integers from 100 to 130. Notice that new bracketed numbers appear at the start of the second and third lines of output. These numbers just mean that the second line begins with the 14th value in the result, and the third line begins with the 25th value. You can mostly ignore the numbers that appear in brackets: > 100:130 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 [14] 113 114 115 116 117 118 119 120 121 122 123 124 125 [25] 126 127 128 129 130 The colon operator (:) returns every integer between two integers. It is an easy way to create a sequence of numbers. Isn’t R a language? You may hear me speak of R in the third person. For example, I might say, “Tell R to do this” or “Tell R to do that”, but of course R can’t do anything; it is just a language. This way of speaking is shorthand for saying, “Tell your computer to do this by writing a command in the R language at the command line of your RStudio console.” Your computer, and not R, does the actual work. Is this shorthand confusing and slightly lazy to use? Yes. Do a lot of people use it? Everyone I know–probably because it is so convenient. When do we compile? In some languages, like C, Java, and FORTRAN, you have to compile your human-readable code into machine-readable code (often 1s and 0s) before you can run it. If you’ve programmed in such a language before, you may wonder whether you have to compile your R code before you can use it. The answer is no. R is a dynamic programming language, which means R automatically interprets your code as you run it. If you type an incomplete command and press Enter, R will display a + prompt, which means R is waiting for you to type the rest of your command. Either finish the command or hit Escape to start over: > 5 - + + 1 [1] 4 If you type a command that R doesn’t recognize, R will return an error message. If you ever see an error message, don’t panic. R is just telling you that your computer couldn’t understand or do what you asked it to do. You can then try a different command at the next prompt: > 3 % 5 Error: unexpected input in "3 % 5" > Once you get the hang of the command line, you can easily do anything in R that you would do with a calculator. For example, you could do some basic arithmetic: 2 * 3 ## 6 4 - 1 ## 3 6 / (4 - 1) ## 2 Did you notice something different about this code? I’ve left out the >’s and [1]’s. This will make the code easier to copy and paste if you want to put it in your own console. R treats the hashtag character, #, in a special way; R will not run anything that follows a hashtag on a line. This makes hashtags very useful for adding comments and annotations to your code. Humans will be able to read the comments, but your computer will pass over them. The hashtag is known as the commenting symbol in R. For the remainder of the book, I’ll use hashtags to display the output of R code. I’ll use a single hashtag to add my own comments and a double hashtag, ##, to display the results of code. I’ll avoid showing >s and [1]s unless I want you to look at them. Cancelling commands Some R commands may take a long time to run. You can cancel a command once it has begun by pressing ctrl + c. Note that it may also take R a long time to cancel the command. Exercise 2.1 (Magic with Numbers) That’s the basic interface for executing R code in RStudio. Think you have it? If so, try doing these simple tasks. If you execute everything correctly, you should end up with the same number that you started with: Choose any number and add 2 to it. Multiply the result by 3. Subtract 6 from the answer. Divide what you get by 3. Throughout the book, I’ll put exercises in chunks, like the one above. I’ll follow each exercise with a model answer, like the one below. Solution. You could start with the number 10, and then do the following steps: 10 + 2 ## 12 12 * 3 ## 36 36 - 6 ## 30 30 / 3 ## 10 2.2 Objects Now that you know how to use R, let’s use it to make a virtual die. The : operator from a couple of pages ago gives you a nice way to create a group of numbers from one to six. The : operator returns its results as a vector, a one-dimensional set of numbers: 1:6 ## 1 2 3 4 5 6 That’s all there is to how a virtual die looks! But you are not done yet. Running 1:6 generated a vector of numbers for you to see, but it didn’t save that vector anywhere in your computer’s memory. What you are looking at is basically the footprints of six numbers that existed briefly and then melted back into your computer’s RAM. If you want to use those numbers again, you’ll have to ask your computer to save them somewhere. You can do that by creating an R object. R lets you save data by storing it inside an R object. What is an object? Just a name that you can use to call up stored data. For example, you can save data into an object like a or b. Wherever R encounters the object, it will replace it with the data saved inside, like so: a <- 1 a ## 1 a + 2 ## 3 What just happened? To create an R object, choose a name and then use the less-than symbol, <, followed by a minus sign, -, to save data into it. This combination looks like an arrow, <-. R will make an object, give it your name, and store in it whatever follows the arrow. So a <- 1 stores 1 in an object named a. When you ask R what’s in a, R tells you on the next line. You can use your object in new R commands, too. Since a previously stored the value of 1, you’re now adding 1 to 2. So, for another example, the following code would create an object named die that contains the numbers one through six. To see what is stored in an object, just type the object’s name by itself: die <- 1:6 die ## 1 2 3 4 5 6 When you create an object, the object will appear in the environment pane of RStudio, as shown in Figure 2.2. This pane will show you all of the objects you’ve created since opening RStudio. Figure 2.2: The RStudio environment pane keeps track of the R objects you create. You can name an object in R almost anything you want, but there are a few rules. First, a name cannot start with a number. Second, a name cannot use some special symbols, like ^, !, $, @, +, -, /, or *: Good names Names that cause errors a 1trial b $ FOO ^mean my_var 2nd .day !bad Capitalization R is case-sensitive, so name and Name will refer to different objects: Name <- 1 name <- 0 Name + 1 ## 2 Finally, R will overwrite any previous information stored in an object without asking you for permission. So, it is a good idea to not use names that are already taken: my_number <- 1 my_number ## 1 my_number <- 999 my_number ## 999 You can see which object names you have already used with the function ls: ls() ## "a" "die" "my_number" "name" "Name" You can also see which names you have used by examining RStudio’s environment pane. You now have a virtual die that is stored in your computer’s memory. You can access it whenever you like by typing the word die. So what can you do with this die? Quite a lot. R will replace an object with its contents whenever the object’s name appears in a command. So, for example, you can do all sorts of math with the die. Math isn’t so helpful for rolling dice, but manipulating sets of numbers will be your stock and trade as a data scientist. So let’s take a look at how to do that: die - 1 ## 0 1 2 3 4 5 die / 2 ## 0.5 1.0 1.5 2.0 2.5 3.0 die * die ## 1 4 9 16 25 36 If you are a big fan of linear algebra (and who isn’t?), you may notice that R does not always follow the rules of matrix multiplication. Instead, R uses element-wise execution. When you manipulate a set of numbers, R will apply the same operation to each element in the set. So for example, when you run die - 1, R subtracts one from each element of die. When you use two or more vectors in an operation, R will line up the vectors and perform a sequence of individual operations. For example, when you run die * die, R lines up the two die vectors and then multiplies the first element of vector 1 by the first element of vector 2. R then multiplies the second element of vector 1 by the second element of vector 2, and so on, until every element has been multiplied. The result will be a new vector the same length as the first two, as shown in Figure 2.3. Figure 2.3: When R performs element-wise execution, it matches up vectors and then manipulates each pair of elements independently. If you give R two vectors of unequal lengths, R will repeat the shorter vector until it is as long as the longer vector, and then do the math, as shown in Figure 2.4. This isn’t a permanent change–the shorter vector will be its original size after R does the math. If the length of the short vector does not divide evenly into the length of the long vector, R will return a warning message. This behavior is known as vector recycling, and it helps R do element-wise operations: 1:2 ## 1 2 1:4 ## 1 2 3 4 die ## 1 2 3 4 5 6 die + 1:2 ## 2 4 4 6 6 8 die + 1:4 ## 2 4 6 8 6 8 Warning message: In die + 1:4 : longer object length is not a multiple of shorter object length Figure 2.4: R will repeat a short vector to do element-wise operations with two vectors of uneven lengths. Element-wise operations are a very useful feature in R because they manipulate groups of values in an orderly way. When you start working with data sets, element-wise operations will ensure that values from one observation or case are only paired with values from the same observation or case. Element-wise operations also make it easier to write your own programs and functions in R. But don’t think that R has given up on traditional matrix multiplication. You just have to ask for it when you want it. You can do inner multiplication with the %*% operator and outer multiplication with the %o% operator: die %*% die ## 91 die %o% die ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 2 3 4 5 6 ## [2,] 2 4 6 8 10 12 ## [3,] 3 6 9 12 15 18 ## [4,] 4 8 12 16 20 24 ## [5,] 5 10 15 20 25 30 ## [6,] 6 12 18 24 30 36 You can also do things like transpose a matrix with t and take its determinant with det. Don’t worry if you’re not familiar with these operations. They are easy to look up, and you won’t need them for this book. Now that you can do math with your die object, let’s look at how you could “roll” it. Rolling your die will require something more sophisticated than basic arithmetic; you’ll need to randomly select one of the die’s values. And for that, you will need a function. 2.3 Functions R comes with many functions that you can use to do sophisticated tasks like random sampling. For example, you can round a number with the round function, or calculate its factorial with the factorial function. Using a function is pretty simple. Just write the name of the function and then the data you want the function to operate on in parentheses: round(3.1415) ## 3 factorial(3) ## 6 The data that you pass into the function is called the function’s argument. The argument can be raw data, an R object, or even the results of another R function. In this last case, R will work from the innermost function to the outermost, as in Figure 2.5. mean(1:6) ## 3.5 mean(die) ## 3.5 round(mean(die)) ## 4 Figure 2.5: When you link functions together, R will resolve them from the innermost operation to the outermost. Here R first looks up die, then calculates the mean of one through six, then rounds the mean. Lucky for us, there is an R function that can help “roll” the die. You can simulate a roll of the die with R’s sample function. sample takes two arguments: a vector named x and a number named size. sample will return size elements from the vector: sample(x = 1:4, size = 2) ## 3 2 To roll your die and get a number back, set x to die and sample one element from it. You’ll get a new (maybe different) number each time you roll it: sample(x = die, size = 1) ## 2 sample(x = die, size = 1) ## 1 sample(x = die, size = 1) ## 6 Many R functions take multiple arguments that help them do their job. You can give a function as many arguments as you like as long as you separate each argument with a comma. You may have noticed that I set die and 1 equal to the names of the arguments in sample, x and size. Every argument in every R function has a name. You can specify which data should be assigned to which argument by setting a name equal to data, as in the preceding code. This becomes important as you begin to pass multiple arguments to the same function; names help you avoid passing the wrong data to the wrong argument. However, using names is optional. You will notice that R users do not often use the name of the first argument in a function. So you might see the previous code written as: sample(die, size = 1) ## 2 Often, the name of the first argument is not very descriptive, and it is usually obvious what the first piece of data refers to anyways. But how do you know which argument names to use? If you try to use a name that a function does not expect, you will likely get an error: round(3.1415, corners = 2) ## Error in round(3.1415, corners = 2) : unused argument(s) (corners = 2) If you’re not sure which names to use with a function, you can look up the function’s arguments with args. To do this, place the name of the function in the parentheses behind args. For example, you can see that the round function takes two arguments, one named x and one named digits: args(round) ## function (x, digits = 0) ## NULL Did you notice that args shows that the digits argument of round is already set to 0? Frequently, an R function will take optional arguments like digits. These arguments are considered optional because they come with a default value. You can pass a new value to an optional argument if you want, and R will use the default value if you do not. For example, round will round your number to 0 digits past the decimal point by default. To override the default, supply your own value for digits: round(3.1415) ## 3 round(3.1415, digits = 2) ## 3.14 You should write out the names of each argument after the first one or two when you call a function with multiple arguments. Why? First, this will help you and others understand your code. It is usually obvious which argument your first input refers to (and sometimes the second input as well). However, you’d need a large memory to remember the third and fourth arguments of every R function. Second, and more importantly, writing out argument names prevents errors. If you do not write out the names of your arguments, R will match your values to the arguments in your function by order. For example, in the following code, the first value, die, will be matched to the first argument of sample, which is named x. The next value, 1, will be matched to the next argument, size: sample(die, 1) ## 2 As you provide more arguments, it becomes more likely that your order and R’s order may not align. As a result, values may get passed to the wrong argument. Argument names prevent this. R will always match a value to its argument name, no matter where it appears in the order of arguments: sample(size = 1, x = die) ## 2 2.3.1 Sample with Replacement If you set size = 2, you can almost simulate a pair of dice. Before we run that code, think for a minute why that might be the case. sample will return two numbers, one for each die: sample(die, size = 2) ## 3 4 I said this “almost” works because this method does something funny. If you use it many times, you’ll notice that the second die never has the same value as the first die, which means you’ll never roll something like a pair of threes or snake eyes. What is going on? By default, sample builds a sample without replacement. To see what this means, imagine that sample places all of the values of die in a jar or urn. Then imagine that sample reaches into the jar and pulls out values one by one to build its sample. Once a value has been drawn from the jar, sample sets it aside. The value doesn’t go back into the jar, so it cannot be drawn again. So if sample selects a six on its first draw, it will not be able to select a six on the second draw; six is no longer in the jar to be selected. Although sample creates its sample electronically, it follows this seemingly physical behavior. One side effect of this behavior is that each draw depends on the draws that come before it. In the real world, however, when you roll a pair of dice, each die is independent of the other. If the first die comes up six, it does not prevent the second die from coming up six. In fact, it doesn’t influence the second die in any way whatsoever. You can recreate this behavior in sample by adding the argument replace = TRUE: sample(die, size = 2, replace = TRUE) ## 5 5 The argument replace = TRUE causes sample to sample with replacement. Our jar example provides a good way to understand the difference between sampling with replacement and without. When sample uses replacement, it draws a value from the jar and records the value. Then it puts the value back into the jar. In other words, sample replaces each value after each draw. As a result, sample may select the same value on the second draw. Each value has a chance of being selected each time. It is as if every draw were the first draw. Sampling with replacement is an easy way to create independent random samples. Each value in your sample will be a sample of size one that is independent of the other values. This is the correct way to simulate a pair of dice: sample(die, size = 2, replace = TRUE) ## 2 4 Congratulate yourself; you’ve just run your first simulation in R! You now have a method for simulating the result of rolling a pair of dice. If you want to add up the dice, you can feed your result straight into the sum function: dice <- sample(die, size = 2, replace = TRUE) dice ## 2 4 sum(dice) ## 6 What would happen if you call dice multiple times? Would R generate a new pair of dice values each time? Let’s give it a try: dice ## 2 4 dice ## 2 4 dice ## 2 4 Nope. Each time you call dice, R will show you the result of that one time you called sample and saved the output to dice. R won’t rerun sample(die, 2, replace = TRUE) to create a new roll of the dice. This is a relief in a way. Once you save a set of results to an R object, those results do not change. Programming would be quite hard if the values of your objects changed each time you called them. However, it would be convenient to have an object that can re-roll the dice whenever you call it. You can make such an object by writing your own R function. 2.4 Writing Your Own Functions To recap, you already have working R code that simulates rolling a pair of dice: die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice) You can retype this code into the console anytime you want to re-roll your dice. However, this is an awkward way to work with the code. It would be easier to use your code if you wrapped it into its own function, which is exactly what we’ll do now. We’re going to write a function named roll that you can use to roll your virtual dice. When you’re finished, the function will work like this: each time you call roll(), R will return the sum of rolling two dice: roll() ## 8 roll() ## 3 roll() ## 7 Functions may seem mysterious or fancy, but they are just another type of R object. Instead of containing data, they contain code. This code is stored in a special format that makes it easy to reuse the code in new situations. You can write your own functions by recreating this format. 2.4.1 The Function Constructor Every function in R has three basic parts: a name, a body of code, and a set of arguments. To make your own function, you need to replicate these parts and store them in an R object, which you can do with the function function. To do this, call function() and follow it with a pair of braces, {}: my_function <- function() {} function will build a function out of whatever R code you place between the braces. For example, you can turn your dice code into a function by calling: roll <- function() { die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice) } Notice that I indented each line of code between the braces. This makes the code easier for you and me to read but has no impact on how the code runs. R ignores spaces and line breaks and executes one complete expression at a time. Just hit the Enter key between each line after the first brace, {. R will wait for you to type the last brace, }, before it responds. Don’t forget to save the output of function to an R object. This object will become your new function. To use it, write the object’s name followed by an open and closed parenthesis: roll() ## 9 You can think of the parentheses as the “trigger” that causes R to run the function. If you type in a function’s name without the parentheses, R will show you the code that is stored inside the function. If you type in the name with the parentheses, R will run that code: roll ## function() { ## die <- 1:6 ## dice <- sample(die, size = 2, replace = TRUE) ## sum(dice) ## } roll() ## 6 The code that you place inside your function is known as the body of the function. When you run a function in R, R will execute all of the code in the body and then return the result of the last line of code. If the last line of code doesn’t return a value, neither will your function, so you want to ensure that your final line of code returns a value. One way to check this is to think about what would happen if you ran the body of code line by line in the command line. Would R display a result after the last line, or would it not? Here’s some code that would display a result: dice 1 + 1 sqrt(2) And here’s some code that would not: dice <- sample(die, size = 2, replace = TRUE) two <- 1 + 1 a <- sqrt(2) Do you notice the pattern? These lines of code do not return a value to the command line; they save a value to an object. 2.5 Arguments What if we removed one line of code from our function and changed the name die to bones, like this? roll2 <- function() { dice <- sample(bones, size = 2, replace = TRUE) sum(dice) } Now I’ll get an error when I run the function. The function needs the object bones to do its job, but there is no object named bones to be found: roll2() ## Error in sample(bones, size = 2, replace = TRUE) : ## object 'bones' not found You can supply bones when you call roll2 if you make bones an argument of the function. To do this, put the name bones in the parentheses that follow function when you define roll2: roll2 <- function(bones) { dice <- sample(bones, size = 2, replace = TRUE) sum(dice) } Now roll2 will work as long as you supply bones when you call the function. You can take advantage of this to roll different types of dice each time you call roll2. Dungeons and Dragons, here we come! Remember, we’re rolling pairs of dice: roll2(bones = 1:4) ## 3 roll2(bones = 1:6) ## 10 roll2(1:20) ## 31 Notice that roll2 will still give an error if you do not supply a value for the bones argument when you call roll2: roll2() ## Error in sample(bones, size = 2, replace = TRUE) : ## argument "bones" is missing, with no default You can prevent this error by giving the bones argument a default value. To do this, set bones equal to a value when you define roll2: roll2 <- function(bones = 1:6) { dice <- sample(bones, size = 2, replace = TRUE) sum(dice) } Now you can supply a new value for bones if you like, and roll2 will use the default if you do not: roll2() ## 9 You can give your functions as many arguments as you like. Just list their names, separated by commas, in the parentheses that follow function. When the function is run, R will replace each argument name in the function body with the value that the user supplies for the argument. If the user does not supply a value, R will replace the argument name with the argument’s default value (if you defined one). To summarize, function helps you construct your own R functions. You create a body of code for your function to run by writing code between the braces that follow function. You create arguments for your function to use by supplying their names in the parentheses that follow function. Finally, you give your function a name by saving its output to an R object, as shown in Figure 2.6. Once you’ve created your function, R will treat it like every other function in R. Think about how useful this is. Have you ever tried to create a new Excel option and add it to Microsoft’s menu bar? Or a new slide animation and add it to Powerpoint’s options? When you work with a programming language, you can do these types of things. As you learn to program in R, you will be able to create new, customized, reproducible tools for yourself whenever you like. Project 3: Slot Machine will teach you much more about writing functions in R. Figure 2.6: Every function in R has the same parts, and you can use function to create these parts. Assign the result to a name, so you can call the function later. 2.6 Scripts What if you want to edit roll2 again? You could go back and retype each line of code in roll2, but it would be so much easier if you had a draft of the code to start from. You can create a draft of your code as you go by using an R script. An R script is just a plain text file that you save R code in. You can open an R script in RStudio by going to File > New File > R script in the menu bar. RStudio will then open a fresh script above your console pane, as shown in Figure 2.7. I strongly encourage you to write and edit all of your R code in a script before you run it in the console. Why? This habit creates a reproducible record of your work. When you’re finished for the day, you can save your script and then use it to rerun your entire analysis the next day. Scripts are also very handy for editing and proofreading your code, and they make a nice copy of your work to share with others. To save a script, click the scripts pane, and then go to File > Save As in the menu bar. Figure 2.7: When you open an R Script (File > New File > R Script in the menu bar), RStudio creates a fourth pane above the console where you can write and edit your code. RStudio comes with many built-in features that make it easy to work with scripts. First, you can automatically execute a line of code in a script by clicking the Run button, as shown in Figure 2.8. R will run whichever line of code your cursor is on. If you have a whole section highlighted, R will run the highlighted code. Alternatively, you can run the entire script by clicking the Source button. Don’t like clicking buttons? You can use Control + Return as a shortcut for the Run button. On Macs, that would be Command + Return. Figure 2.8: You can run a highlighted portion of code in your script if you click the Run button at the top of the scripts pane. You can run the entire script by clicking the Source button. If you’re not convinced about scripts, you soon will be. It becomes a pain to write multi-line code in the console’s single-line command line. Let’s avoid that headache and open your first script now before we move to the next chapter. Extract function RStudio comes with a tool that can help you build functions. To use it, highlight the lines of code in your R script that you want to turn into a function. Then click Code > Extract Function in the menu bar. RStudio will ask you for a function name to use and then wrap your code in a function call. It will scan the code for undefined variables and use these as arguments. You may want to double-check RStudio’s work. It assumes that your code is correct, so if it does something surprising, you may have a problem in your code. 2.7 Summary You’ve covered a lot of ground already. You now have a virtual die stored in your computer’s memory, as well as your own R function that rolls a pair of dice. You’ve also begun speaking the R language. As you’ve seen, R is a language that you can use to talk to your computer. You write commands in R and run them at the command line for your computer to read. Your computer will sometimes talk back–for example, when you commit an error–but it usually just does what you ask and then displays the result. The two most important components of the R language are objects, which store data, and functions, which manipulate data. R also uses a host of operators like +, -, *, /, and <- to do basic tasks. As a data scientist, you will use R objects to store data in your computer’s memory, and you will use functions to automate tasks and do complicated calculations. We will examine objects in more depth later in Project 2: Playing Cards and dig further into functions in Project 3: Slot Machine. The vocabulary you have developed here will make each of those projects easier to understand. However, we’re not done with your dice yet. In Packages and Help Pages, you’ll run some simulations on your dice and build your first graphs in R. You’ll also look at two of the most useful components of the R language: R packages, which are collections of functions writted by R’s talented community of developers, and R documentation, which is a collection of help pages built into R that explains every function and data set in the language. "], -["packages.html", "3 Packages and Help Pages 3.1 Packages 3.2 Getting Help with Help Pages 3.3 Summary 3.4 Project 1 Wrap-up", " 3 Packages and Help Pages You now have a function that simulates rolling a pair of dice. Let’s make things a little more interesting by weighting the dice in your favor. The house always wins, right? Let’s make the dice roll high numbers slightly more often than it rolls low numbers. Before we weight the dice, we should make sure that they are fair to begin with. Two tools will help you do this: repetition and visualization. By coincidence, these tools are also two of the most useful superpowers in the world of data science. We will repeat our dice rolls with a function called replicate, and we will visualize our rolls with a function called qplot. qplot does not come with R when you download it; qplot comes in a standalone R package. Many of the most useful R tools come in R packages, so let’s take a moment to look at what R packages are and how you can use them. 3.1 Packages You’re not the only person writing your own functions with R. Many professors, programmers, and statisticians use R to design tools that can help people analyze data. They then make these tools free for anyone to use. To use these tools, you just have to download them. They come as preassembled collections of functions and objects called packages. Appendix 2: R Packages contains detailed instructions for downloading and updating R packages, but we’ll look at the basics here. We’re going to use the qplot function to make some quick plots. qplot comes in the ggplot2 package, a popular package for making graphs. Before you can use qplot, or anything else in the ggplot2 package, you need to download and install it. 3.1.1 install.packages Each R package is hosted at http://cran.r-project.org, the same website that hosts R. However, you don’t need to visit the website to download an R package; you can download packages straight from R’s command line. Here’s how: Open RStudio. Make sure you are connected to the Internet. Run install.packages("ggplot2") at the command line. That’s it. R will have your computer visit the website, download ggplot2, and install the package in your hard drive right where R wants to find it. You now have the ggplot2 package. If you would like to install another package, replace ggplot2 with your package name in the code. 3.1.2 library Installing a package doesn’t place its functions at your fingertips just yet: it simply places them in your hard drive. To use an R package, you next have to load it in your R session with the command library("ggplot2"). If you would like to load a different package, replace ggplot2 with your package name in the code. To see what this does, try an experiment. First, ask R to show you the qplot function. R won’t be able to find qplot because qplot lives in the ggplot2 package, which you haven’t loaded: qplot ## Error: object 'qplot' not found Now load the ggplot2 package: library("ggplot2") If you installed the package with install.packages as instructed, everything should go fine. Don’t worry if you don’t see any results or messages. No news is fine news when loading a package. Don’t worry if you do see a message either; ggplot2 sometimes displays helpful start up messages. As long as you do not see anything that says “Error,” you are doing fine. Now if you ask to see qplot, R will show you quite a bit of code (qplot is a long function): qplot ## (quite a bit of code) Appendix 2: R Packages contains many more details about acquiring and using packages. I recommend that you read it if you are unfamiliar with R’s package system. The main thing to remember is that you only need to install a package once, but you need to load it with library each time you wish to use it in a new R session. R will unload all of its packages each time you close RStudio. Now that you’ve loaded qplot, let’s take it for a spin. qplot makes “quick plots.” If you give qplot two vectors of equal lengths, qplot will draw a scatterplot for you. qplot will use the first vector as a set of x values and the second vector as a set of y values. Look for the plot to appear in the Plots tab of the bottom-right pane in your RStudio window. The following code will make the plot that appears in Figure 3.1. Until now, we’ve been creating sequences of numbers with the : operator; but you can also create vectors of numbers with the c function. Give c all of the numbers that you want to appear in the vector, separated by a comma. c stands for concatenate, but you can think of it as “collect” or “combine”: x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1) x ## -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 y <- x^3 y ## -1.000 -0.512 -0.216 -0.064 -0.008 0.000 0.008 ## 0.064 0.216 0.512 1.000 qplot(x, y) Figure 3.1: qplot makes a scatterplot when you give it two vectors. You don’t need to name your vectors x and y. I just did that to make the example clear. As you can see in Figure 3.1, a scatterplot is a set of points, each plotted according to its x and y values. Together, the vectors x and y describe a set of 10 points. How did R match up the values in x and y to make these points? With element-wise execution, as we saw in Figure 2.3. Scatterplots are useful for visualizing the relationship between two variables. However, we’re going to use a different type of graph, a histogram. A histogram visualizes the distribution of a single variable; it displays how many data points appear at each value of x. Let’s take a look at a histogram to see if this makes sense. qplot will make a histogram whenever you give it only one vector to plot. The following code makes the left-hand plot in Figure 3.2 (we’ll worry about the right-hand plot in just second). To make sure our graphs look the same, use the extra argument binwidth = 1: x <- c(1, 2, 2, 2, 3, 3) qplot(x, binwidth = 1) Figure 3.2: qplot makes a histogram when you give it a single vector. This plot shows that our vector contains one value in the interval [1, 2) by placing a bar of height 1 above that interval. Similarly, the plot shows that the vector contains three values in the interval [2, 3) by placing a bar of height 3 in that interval. It shows that the vector contains two values in the interval [3, 4) by placing a bar of height 2 in that interval. In these intervals, the hard bracket, [, means that the first number is included in the interval. The parenthesis, ), means that the last number is not included. Let’s try another histogram. This code makes the right-hand plot in Figure 3.2. Notice that there are five points with a value of 1 in x2. The histogram displays this by plotting a bar of height 5 above the interval x2 = [1, 2): x2 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4) qplot(x2, binwidth = 1) Exercise 3.1 (Visualize a Histogram) Let x3 be the following vector: x3 <- c(0, 1, 1, 2, 2, 2, 3, 3, 4) Imagine what a histogram of x3 would look like. Assume that the histogram has a bin width of 1. How many bars will the histogram have? Where will they appear? How high will each be? When you are done, plot a histogram of x3 with binwidth = 1, and see if you are right. Solution. You can make a histogram of x3 with qplot(x3, binwidth = 1). The histogram will look like a symmetric pyramid. The middle bar will have a height of 3 and will appear above [2, 3), but be sure to try it and see for yourself. You can use a histogram to display visually how common different values of x are. Numbers covered by a tall bar are no more common than numbers covered by a short bar. How can you use a histogram to check the accuracy of your dice? Well, if you roll your dice many times and keep track of the results, you would expect some numbers to occur more than others. This is because there are more ways to get some numbers by adding two dice together than to get other numbers, as shown in Figure 3.3. If you roll your dice many times and plot the results with qplot, the histogram will show you how often each sum appeared. The sums that occurred most often will have the highest bars. The histogram should look like the pattern in Figure 3.3 if the dice are fairly weighted. This is where replicate comes in. replicate provides an easy way to repeat an R command many times. To use it, first give replicate the number of times you wish to repeat an R command, and then give it the command you wish to repeat. replicate will run the command multiple times and store the results as a vector: replicate(3, 1 + 1) ## 2 2 2 replicate(10, roll()) ## 3 7 5 3 6 2 3 8 11 7 Figure 3.3: Each individual dice combination should occur with the same frequency. As a result, some sums will occur more often than others. With fair dice, each sum should appear in proportion to the number of combinations that make it. A histogram of your first 10 rolls probably won’t look like the pattern shown in Figure 3.3. Why not? There is too much randomness involved. Remember that we use dice in real life because they are effective random number generators. Patterns of long run frequencies will only appear over the long run. So let’s simulate 10,000 dice rolls and plot the results. Don’t worry; qplot and replicate can handle it. The results appear in Figure 3.4: rolls <- replicate(10000, roll()) qplot(rolls, binwidth = 1) The results suggest that the dice are fair. Over the long run, each number occurs in proportion to the number of combinations that generate it. Now how can you bias these results? The previous pattern occurs because each underlying combination of dice (e.g., (3,4)) occurs with the same frequency. If you could increase the probability that a 6 is rolled on either die, then any combination with a six in it will occur more often than any combination without a six in it. The combination (6, 6) would occur most of all. This won’t make the dice add up to 12 more often than they add up to seven, but it will skew the results toward the higher numbers. Figure 3.4: The behavior of our dice suggests that they are fair. Seven occurs more often than any other number, and frequencies diminish in proportion to the number of die combinations that create each number. To put it another way, the probability of rolling any single number on a fair die is 1/6. I’d like you to change the probability to 1/8 for each number below six, and then increase the probability of rolling a six to 3/8: Number Fair probability Weighted probability 1 1/6 1/8 2 1/6 1/8 3 1/6 1/8 4 1/6 1/8 5 1/6 1/8 6 1/6 3/8 You can change the probabilities by adding a new argument to the sample function. I’m not going to tell you what the argument is; instead I’ll point you to the help page for the sample function. What’s that? R functions come with help pages? Yes they do, so let’s learn how to read one. 3.2 Getting Help with Help Pages There are over 1,000 functions at the core of R, and new R functions are created all of the time. This can be a lot of material to memorize and learn! Luckily, each R function comes with its own help page, which you can access by typing the function’s name after a question mark. For example, each of these commands will open a help page. Look for the pages to appear in the Help tab of RStudio’s bottom-right pane: ?sqrt ?log10 ?sample Help pages contain useful information about what each function does. These help pages also serve as code documentation, so reading them can be bittersweet. They often seem to be written for people who already understand the function and do not need help. Don’t let this bother you—you can gain a lot from a help page by scanning it for information that makes sense and glossing over the rest. This technique will inevitably bring you to the most helpful part of each help page: the bottom. Here, almost every help page includes some example code that puts the function in action. Running this code is a great way to learn by example. If a function comes in an R package, R won’t be able to find its help page unless the package is loaded. 3.2.1 Parts of a Help Page Each help page is divided into sections. Which sections appear can vary from help page to help page, but you can usually expect to find these useful topics: Description - A short summary of what the function does. Usage - An example of how you would type the function. Each argument of the function will appear in the order R expects you to supply it (if you don’t use argument names). Arguments - A list of each argument the function takes, what type of information R expects you to supply for the argument, and what the function will do with the information. Details - A more in-depth description of the function and how it operates. The details section also gives the function author a chance to alert you to anything you might want to know when using the function. Value - A description of what the function returns when you run it. See Also - A short list of related R functions. Examples - Example code that uses the function and is guaranteed to work. The examples section of a help page usually demonstrates a couple different ways to use a function. This helps give you an idea of what the function is capable of. If you’d like to look up the help page for a function but have forgotten the function’s name, you can search by keyword. To do this, type two question marks followed by a keyword in R’s command line. R will pull up a list of links to help pages related to the keyword. You can think of this as the help page for the help page: ??log Let’s take a stroll through sample’s help page. Remember: we’re searching for anything that could help you change the probabilities involved in the sampling process. I’m not going to reproduce the whole help page here (just the juiciest parts), so you should follow along on your computer. First, open the help page. It will appear in the same pane in RStudio as your plots did (but in the Help tab, not the Plots tab): ?sample What do you see? Starting from the top: Random Samples and Permutations Description sample takes a sample of the specified size from the elements of x using either with or without replacement. So far, so good. You knew all of that. The next section, Usage, has a possible clue. It mentions an argument called prob: Usage sample(x, size, replace = FALSE, prob = NULL) If you scroll down to the arguments section, the description of +prob+ sounds very promising: A vector of probability weights for obtaining the elements of the vector being sampled. The Details section confirms our suspicions. In this case, it also tells you how to proceed: The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be nonnegative and not all zero. Although the help page does not say it here, these weights will be matched up to the elements being sampled in element-wise fashion. The first weight will describe the first element, the second weight the second element, and so on. This is common practice in R. Reading on: If replace is true, Walker's alias method (Ripley, 1987) is used... Okay, that looks like time to start skimming. We should have enough info now to figure out how to weight our dice. Exercise 3.2 (Roll a Pair of Dice) Rewrite the roll function below to roll a pair of weighted dice: roll <- function() { die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice) } You will need to add a prob argument to the sample function inside of roll. This argument should tell sample to sample the numbers one through five with probability 1/8 and the number 6 with probability 3/8. When you are finished, read on for a model answer. Solution. To weight your dice, you need to add a prob argument with a vector of weights to sample, like this: roll <- function() { die <- 1:6 dice <- sample(die, size = 2, replace = TRUE, prob = c(1/8, 1/8, 1/8, 1/8, 1/8, 3/8)) sum(dice) } This will cause roll to pick 1 through 5 with probability 1/8 and 6 with probability 3/8. Overwrite your previous version of roll with the new function (by running the previous code snippet in your command line). Then visualize the new long-term behavior of your dice. I’ve put the results in Figure 3.5 next to our original results: rolls <- replicate(10000, roll()) qplot(rolls, binwidth = 1) This confirms that we’ve effectively weighted the dice. High numbers occur much more often than low numbers. The remarkable thing is that this behavior will only be apparent when you examine long-term frequencies. On any single roll, the dice will appear to behave randomly. This is great news if you play Settlers of Catan (just tell your friends you lost the dice), but it should be disturbing if you analyze data, because it means that bias can easily occur without anyone noticing it in the short run. Figure 3.5: The dice are now clearly biased towards high numbers, since high sums occur much more often than low sums. 3.2.2 Getting More Help R also comes with a super active community of users that you can turn to for help on the R-help mailing list. You can email the list with questions, but there’s a great chance that your question has already been answered. Find out by searching the archives. Even better than the R-help list is Stack Overflow, a website that allows programmers to answer questions and users to rank answers based on helpfulness. Personally, I find the Stack Overflow format to be more user-friendly than the R-help email list (and the respondents to be more human friendly). You can submit your own question or search through Stack Overflow’s previously answered questions related to R. There are over 30,000. Best of all is community.rstudio.com, a friendly, inclusive place to share questions related to R. community.rstudio.com is a very active forum focused on R. Don’t be surprised if you ask a question about an R package, and the author of the package shows up to answer. For all of the R help list, Stack Overflow, and community.rstudio.com, you’re more likely to get a useful answer if you provide a reproducible example with your question. This means pasting in a short snippet of code that users can run to arrive at the bug or question you have in mind. 3.3 Summary R’s packages and help pages can make you a more productive programmer. You saw in The Very Basics that R gives you the power to write your own functions that do specific things, but often the function that you want to write will already exist in an R package. Professors, programmers, and scientists have developed over 13,000 packages for you to use, which can save you valuable programming time. To use a package, you need to install it to your computer once with install.packages, and then load it into each new R session with library. R’s help pages will help you master the functions that appear in R and its packages. Each function and data set in R has its own help page. Although help pages often contain advanced content, they also contain valuable clues and examples that can help you learn how to use a function. You have now seen enough of R to learn by doing, which is the best way to learn R. You can make your own R commands, run them, and get help when you need to understand something that I have not explained. I encourage you to experiment with your own ideas in R as you read through the next two projects. 3.4 Project 1 Wrap-up You’ve done more in this project than enable fraud and gambling; you’ve also learned how to speak to your computer in the language of R. R is a language like English, Spanish, or German, except R helps you talk to computers, not humans. You’ve met the nouns of the R language, objects. And hopefully you guessed that functions are the verbs (I suppose function arguments would be the adverbs). When you combine functions and objects, you express a complete thought. By stringing thoughts together in a logical sequence, you can build eloquent, even artistic statements. In that respect, R is not that different than any other language. R shares another characteristic of human languages: you won’t feel very comfortable speaking R until you build up a vocabulary of R commands to use. Fortunately, you don’t have to be bashful. Your computer will be the only one to “hear” you speak R. Your computer is not very forgiving, but it also doesn’t judge. Not that you need to worry; you’ll broaden your R vocabulary tremendously between here and the end of the book. Now that you can use R, it is time to become an expert at using R to do data science. The foundation of data science is the ability to store large amounts of data and recall values on demand. From this, all else follows—manipulating data, visualizing data, modeling data, and more. However, you cannot easily store a data set in your mind by memorizing it. Nor can you easily store a data set on paper by writing it down. The only efficient way to store large amounts of data is with a computer. In fact, computers are so efficient that their development over the last three decades has completely changed the type of data we can accumulate and the methods we can use to analyze it. In short, computer data storage has driven the revolution in science that we call data science. Project 2: Playing Cards will make you part of this revolution by teaching you how to use R to store data sets in your computer’s memory and how to retrieve and manipulate data once it’s there. "], +["packages.html", "3 Packages and Help Pages 3.1 Packages 3.2 Getting Help with Help Pages 3.3 Summary 3.4 Project 1 Wrap-up", " 3 Packages and Help Pages You now have a function that simulates rolling a pair of dice. Let’s make things a little more interesting by weighting the dice in your favor. The house always wins, right? Let’s make the dice roll high numbers slightly more often than it rolls low numbers. Before we weight the dice, we should make sure that they are fair to begin with. Two tools will help you do this: repetition and visualization. By coincidence, these tools are also two of the most useful superpowers in the world of data science. We will repeat our dice rolls with a function called replicate, and we will visualize our rolls with a function called qplot. qplot does not come with R when you download it; qplot comes in a standalone R package. Many of the most useful R tools come in R packages, so let’s take a moment to look at what R packages are and how you can use them. 3.1 Packages You’re not the only person writing your own functions with R. Many professors, programmers, and statisticians use R to design tools that can help people analyze data. They then make these tools free for anyone to use. To use these tools, you just have to download them. They come as preassembled collections of functions and objects called packages. Appendix 2: R Packages contains detailed instructions for downloading and updating R packages, but we’ll look at the basics here. We’re going to use the qplot function to make some quick plots. qplot comes in the ggplot2 package, a popular package for making graphs. Before you can use qplot, or anything else in the ggplot2 package, you need to download and install it. 3.1.1 install.packages Each R package is hosted at http://cran.r-project.org, the same website that hosts R. However, you don’t need to visit the website to download an R package; you can download packages straight from R’s command line. Here’s how: Open RStudio. Make sure you are connected to the Internet. Run install.packages("ggplot2") at the command line. That’s it. R will have your computer visit the website, download ggplot2, and install the package in your hard drive right where R wants to find it. You now have the ggplot2 package. If you would like to install another package, replace ggplot2 with your package name in the code. 3.1.2 library Installing a package doesn’t place its functions at your fingertips just yet: it simply places them in your hard drive. To use an R package, you next have to load it in your R session with the command library("ggplot2"). If you would like to load a different package, replace ggplot2 with your package name in the code. To see what this does, try an experiment. First, ask R to show you the qplot function. R won’t be able to find qplot because qplot lives in the ggplot2 package, which you haven’t loaded: qplot ## Error: object 'qplot' not found Now load the ggplot2 package: library("ggplot2") If you installed the package with install.packages as instructed, everything should go fine. Don’t worry if you don’t see any results or messages. No news is fine news when loading a package. Don’t worry if you do see a message either; ggplot2 sometimes displays helpful start up messages. As long as you do not see anything that says “Error,” you are doing fine. Now if you ask to see qplot, R will show you quite a bit of code (qplot is a long function): qplot ## (quite a bit of code) Appendix 2: R Packages contains many more details about acquiring and using packages. I recommend that you read it if you are unfamiliar with R’s package system. The main thing to remember is that you only need to install a package once, but you need to load it with library each time you wish to use it in a new R session. R will unload all of its packages each time you close RStudio. Now that you’ve loaded qplot, let’s take it for a spin. qplot makes “quick plots.” If you give qplot two vectors of equal lengths, qplot will draw a scatterplot for you. qplot will use the first vector as a set of x values and the second vector as a set of y values. Look for the plot to appear in the Plots tab of the bottom-right pane in your RStudio window. The following code will make the plot that appears in Figure 3.1. Until now, we’ve been creating sequences of numbers with the : operator; but you can also create vectors of numbers with the c function. Give c all of the numbers that you want to appear in the vector, separated by a comma. c stands for concatenate, but you can think of it as “collect” or “combine”: x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1) x ## -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 y <- x^3 y ## -1.000 -0.512 -0.216 -0.064 -0.008 0.000 0.008 ## 0.064 0.216 0.512 1.000 qplot(x, y) Figure 3.1: qplot makes a scatterplot when you give it two vectors. You don’t need to name your vectors x and y. I just did that to make the example clear. As you can see in Figure 3.1, a scatterplot is a set of points, each plotted according to its x and y values. Together, the vectors x and y describe a set of 10 points. How did R match up the values in x and y to make these points? With element-wise execution, as we saw in Figure 2.3. Scatterplots are useful for visualizing the relationship between two variables. However, we’re going to use a different type of graph, a histogram. A histogram visualizes the distribution of a single variable; it displays how many data points appear at each value of x. Let’s take a look at a histogram to see if this makes sense. qplot will make a histogram whenever you give it only one vector to plot. The following code makes the left-hand plot in Figure 3.2 (we’ll worry about the right-hand plot in just second). To make sure our graphs look the same, use the extra argument binwidth = 1: x <- c(1, 2, 2, 2, 3, 3) qplot(x, binwidth = 1) Figure 3.2: qplot makes a histogram when you give it a single vector. This plot shows that our vector contains one value in the interval [1, 2) by placing a bar of height 1 above that interval. Similarly, the plot shows that the vector contains three values in the interval [2, 3) by placing a bar of height 3 in that interval. It shows that the vector contains two values in the interval [3, 4) by placing a bar of height 2 in that interval. In these intervals, the hard bracket, [, means that the first number is included in the interval. The parenthesis, ), means that the last number is not included. Let’s try another histogram. This code makes the right-hand plot in Figure 3.2. Notice that there are five points with a value of 1 in x2. The histogram displays this by plotting a bar of height 5 above the interval x2 = [1, 2): x2 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4) qplot(x2, binwidth = 1) Exercise 3.1 (Visualize a Histogram) Let x3 be the following vector: x3 <- c(0, 1, 1, 2, 2, 2, 3, 3, 4) Imagine what a histogram of x3 would look like. Assume that the histogram has a bin width of 1. How many bars will the histogram have? Where will they appear? How high will each be? When you are done, plot a histogram of x3 with binwidth = 1, and see if you are right. Solution. You can make a histogram of x3 with qplot(x3, binwidth = 1). The histogram will look like a symmetric pyramid. The middle bar will have a height of 3 and will appear above [2, 3), but be sure to try it and see for yourself. You can use a histogram to display visually how common different values of x are. Numbers covered by a tall bar are more common than numbers covered by a short bar. How can you use a histogram to check the accuracy of your dice? Well, if you roll your dice many times and keep track of the results, you would expect some numbers to occur more than others. This is because there are more ways to get some numbers by adding two dice together than to get other numbers, as shown in Figure 3.3. If you roll your dice many times and plot the results with qplot, the histogram will show you how often each sum appeared. The sums that occurred most often will have the highest bars. The histogram should look like the pattern in Figure 3.3 if the dice are fairly weighted. This is where replicate comes in. replicate provides an easy way to repeat an R command many times. To use it, first give replicate the number of times you wish to repeat an R command, and then give it the command you wish to repeat. replicate will run the command multiple times and store the results as a vector: replicate(3, 1 + 1) ## 2 2 2 replicate(10, roll()) ## 3 7 5 3 6 2 3 8 11 7 Figure 3.3: Each individual dice combination should occur with the same frequency. As a result, some sums will occur more often than others. With fair dice, each sum should appear in proportion to the number of combinations that make it. A histogram of your first 10 rolls probably won’t look like the pattern shown in Figure 3.3. Why not? There is too much randomness involved. Remember that we use dice in real life because they are effective random number generators. Patterns of long run frequencies will only appear over the long run. So let’s simulate 10,000 dice rolls and plot the results. Don’t worry; qplot and replicate can handle it. The results appear in Figure 3.4: rolls <- replicate(10000, roll()) qplot(rolls, binwidth = 1) The results suggest that the dice are fair. Over the long run, each number occurs in proportion to the number of combinations that generate it. Now how can you bias these results? The previous pattern occurs because each underlying combination of dice (e.g., (3,4)) occurs with the same frequency. If you could increase the probability that a 6 is rolled on either die, then any combination with a six in it will occur more often than any combination without a six in it. The combination (6, 6) would occur most of all. This won’t make the dice add up to 12 more often than they add up to seven, but it will skew the results toward the higher numbers. Figure 3.4: The behavior of our dice suggests that they are fair. Seven occurs more often than any other number, and frequencies diminish in proportion to the number of die combinations that create each number. To put it another way, the probability of rolling any single number on a fair die is 1/6. I’d like you to change the probability to 1/8 for each number below six, and then increase the probability of rolling a six to 3/8: Number Fair probability Weighted probability 1 1/6 1/8 2 1/6 1/8 3 1/6 1/8 4 1/6 1/8 5 1/6 1/8 6 1/6 3/8 You can change the probabilities by adding a new argument to the sample function. I’m not going to tell you what the argument is; instead I’ll point you to the help page for the sample function. What’s that? R functions come with help pages? Yes they do, so let’s learn how to read one. 3.2 Getting Help with Help Pages There are over 1,000 functions at the core of R, and new R functions are created all of the time. This can be a lot of material to memorize and learn! Luckily, each R function comes with its own help page, which you can access by typing the function’s name after a question mark. For example, each of these commands will open a help page. Look for the pages to appear in the Help tab of RStudio’s bottom-right pane: ?sqrt ?log10 ?sample Help pages contain useful information about what each function does. These help pages also serve as code documentation, so reading them can be bittersweet. They often seem to be written for people who already understand the function and do not need help. Don’t let this bother you—you can gain a lot from a help page by scanning it for information that makes sense and glossing over the rest. This technique will inevitably bring you to the most helpful part of each help page: the bottom. Here, almost every help page includes some example code that puts the function in action. Running this code is a great way to learn by example. If a function comes in an R package, R won’t be able to find its help page unless the package is loaded. 3.2.1 Parts of a Help Page Each help page is divided into sections. Which sections appear can vary from help page to help page, but you can usually expect to find these useful topics: Description - A short summary of what the function does. Usage - An example of how you would type the function. Each argument of the function will appear in the order R expects you to supply it (if you don’t use argument names). Arguments - A list of each argument the function takes, what type of information R expects you to supply for the argument, and what the function will do with the information. Details - A more in-depth description of the function and how it operates. The details section also gives the function author a chance to alert you to anything you might want to know when using the function. Value - A description of what the function returns when you run it. See Also - A short list of related R functions. Examples - Example code that uses the function and is guaranteed to work. The examples section of a help page usually demonstrates a couple different ways to use a function. This helps give you an idea of what the function is capable of. If you’d like to look up the help page for a function but have forgotten the function’s name, you can search by keyword. To do this, type two question marks followed by a keyword in R’s command line. R will pull up a list of links to help pages related to the keyword. You can think of this as the help page for the help page: ??log Let’s take a stroll through sample’s help page. Remember: we’re searching for anything that could help you change the probabilities involved in the sampling process. I’m not going to reproduce the whole help page here (just the juiciest parts), so you should follow along on your computer. First, open the help page. It will appear in the same pane in RStudio as your plots did (but in the Help tab, not the Plots tab): ?sample What do you see? Starting from the top: Random Samples and Permutations Description sample takes a sample of the specified size from the elements of x using either with or without replacement. So far, so good. You knew all of that. The next section, Usage, has a possible clue. It mentions an argument called prob: Usage sample(x, size, replace = FALSE, prob = NULL) If you scroll down to the arguments section, the description of +prob+ sounds very promising: A vector of probability weights for obtaining the elements of the vector being sampled. The Details section confirms our suspicions. In this case, it also tells you how to proceed: The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be nonnegative and not all zero. Although the help page does not say it here, these weights will be matched up to the elements being sampled in element-wise fashion. The first weight will describe the first element, the second weight the second element, and so on. This is common practice in R. Reading on: If replace is true, Walker's alias method (Ripley, 1987) is used... Okay, that looks like time to start skimming. We should have enough info now to figure out how to weight our dice. Exercise 3.2 (Roll a Pair of Dice) Rewrite the roll function below to roll a pair of weighted dice: roll <- function() { die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice) } You will need to add a prob argument to the sample function inside of roll. This argument should tell sample to sample the numbers one through five with probability 1/8 and the number 6 with probability 3/8. When you are finished, read on for a model answer. Solution. To weight your dice, you need to add a prob argument with a vector of weights to sample, like this: roll <- function() { die <- 1:6 dice <- sample(die, size = 2, replace = TRUE, prob = c(1/8, 1/8, 1/8, 1/8, 1/8, 3/8)) sum(dice) } This will cause roll to pick 1 through 5 with probability 1/8 and 6 with probability 3/8. Overwrite your previous version of roll with the new function (by running the previous code snippet in your command line). Then visualize the new long-term behavior of your dice. I’ve put the results in Figure 3.5 next to our original results: rolls <- replicate(10000, roll()) qplot(rolls, binwidth = 1) This confirms that we’ve effectively weighted the dice. High numbers occur much more often than low numbers. The remarkable thing is that this behavior will only be apparent when you examine long-term frequencies. On any single roll, the dice will appear to behave randomly. This is great news if you play Settlers of Catan (just tell your friends you lost the dice), but it should be disturbing if you analyze data, because it means that bias can easily occur without anyone noticing it in the short run. Figure 3.5: The dice are now clearly biased towards high numbers, since high sums occur much more often than low sums. 3.2.2 Getting More Help R also comes with a super active community of users that you can turn to for help on the R-help mailing list. You can email the list with questions, but there’s a great chance that your question has already been answered. Find out by searching the archives. Even better than the R-help list is Stack Overflow, a website that allows programmers to answer questions and users to rank answers based on helpfulness. Personally, I find the Stack Overflow format to be more user-friendly than the R-help email list (and the respondents to be more human friendly). You can submit your own question or search through Stack Overflow’s previously answered questions related to R. There are over 30,000. Best of all is community.rstudio.com, a friendly, inclusive place to share questions related to R. community.rstudio.com is a very active forum focused on R. Don’t be surprised if you ask a question about an R package, and the author of the package shows up to answer. For all of the R help list, Stack Overflow, and community.rstudio.com, you’re more likely to get a useful answer if you provide a reproducible example with your question. This means pasting in a short snippet of code that users can run to arrive at the bug or question you have in mind. 3.3 Summary R’s packages and help pages can make you a more productive programmer. You saw in The Very Basics that R gives you the power to write your own functions that do specific things, but often the function that you want to write will already exist in an R package. Professors, programmers, and scientists have developed over 13,000 packages for you to use, which can save you valuable programming time. To use a package, you need to install it to your computer once with install.packages, and then load it into each new R session with library. R’s help pages will help you master the functions that appear in R and its packages. Each function and data set in R has its own help page. Although help pages often contain advanced content, they also contain valuable clues and examples that can help you learn how to use a function. You have now seen enough of R to learn by doing, which is the best way to learn R. You can make your own R commands, run them, and get help when you need to understand something that I have not explained. I encourage you to experiment with your own ideas in R as you read through the next two projects. 3.4 Project 1 Wrap-up You’ve done more in this project than enable fraud and gambling; you’ve also learned how to speak to your computer in the language of R. R is a language like English, Spanish, or German, except R helps you talk to computers, not humans. You’ve met the nouns of the R language, objects. And hopefully you guessed that functions are the verbs (I suppose function arguments would be the adverbs). When you combine functions and objects, you express a complete thought. By stringing thoughts together in a logical sequence, you can build eloquent, even artistic statements. In that respect, R is not that different than any other language. R shares another characteristic of human languages: you won’t feel very comfortable speaking R until you build up a vocabulary of R commands to use. Fortunately, you don’t have to be bashful. Your computer will be the only one to “hear” you speak R. Your computer is not very forgiving, but it also doesn’t judge. Not that you need to worry; you’ll broaden your R vocabulary tremendously between here and the end of the book. Now that you can use R, it is time to become an expert at using R to do data science. The foundation of data science is the ability to store large amounts of data and recall values on demand. From this, all else follows—manipulating data, visualizing data, modeling data, and more. However, you cannot easily store a data set in your mind by memorizing it. Nor can you easily store a data set on paper by writing it down. The only efficient way to store large amounts of data is with a computer. In fact, computers are so efficient that their development over the last three decades has completely changed the type of data we can accumulate and the methods we can use to analyze it. In short, computer data storage has driven the revolution in science that we call data science. Project 2: Playing Cards will make you part of this revolution by teaching you how to use R to store data sets in your computer’s memory and how to retrieve and manipulate data once it’s there. "], ["project-2-playing-cards.html", "4 Project 2: Playing Cards", " 4 Project 2: Playing Cards This project–which spans the next four chapters–will teach you how to store, retrieve, and change data values in your computer’s memory. These skills will help you save and manage data without accumulating errors. In the project, you’ll design a deck of playing cards that you can shuffle and deal from. Best of all, the deck will remember which cards have been dealt–just like a real deck. You can use the deck to play card games, tell fortunes, and test card-counting strategies. Along the way, you will learn how to: Save new types of data, like character strings and logical values Save a data set as a vector, matrix, array, list, or data frame Load and save your own data sets with R Extract individual values from a data set Change individual values within a data set Write logical tests Use R’s missing-value symbol, NA To keep the project simple, I’ve divided it into four tasks. Each task will teach you a new skill for managing data with R: Task 1: build the deck In R Objects, you will design and build a virtual deck of playing cards. This will be a complete data set, just like the ones you will use as a data scientist. You’ll need to know how to use R’s data types and data structures to make this work. Task 2: write functions that deal and shuffle Next, in R Notation, you will write two functions to use with the deck. One function will deal cards from the deck, and the other will reshuffle the deck. To write these functions, you’ll need to know how to extract values from a data set with R. Task 3: change the point system to suit your game In Modifying Values, you will use R’s notation system to change the point values of your cards to match the card games you may wish to play, like war, hearts, or blackjack. This will help you change values in place in existing data sets. Task 4: manage the state of the deck Finally, in Environments, you will make sure that your deck remembers which cards it has dealt. This is an advanced task, and it will introduce R’s environment system and scoping rules. To do it successfully, you will need to learn the minute details of how R looks up and uses the data that you have stored in your computer. "], ["r-objects.html", "5 R Objects 5.1 Atomic Vectors 5.2 Attributes 5.3 Matrices 5.4 Arrays 5.5 Class 5.6 Coercion 5.7 Lists 5.8 Data Frames 5.9 Loading Data 5.10 Saving Data 5.11 Summary", " 5 R Objects In this chapter, you’ll use R to assemble a deck of 52 playing cards. You’ll start by building simple R objects that represent playing cards and then work your way up to a full-blown table of data. In short, you’ll build the equivalent of an Excel spreadsheet from scratch. When you are finished, your deck of cards will look something like this: face suit value king spades 13 queen spades 12 jack spades 11 ten spades 10 nine spades 9 eight spades 8 ... Do you need to build a data set from scratch to use it in R? Not at all. You can load most data sets into R with one simple step, see Loading Data. But this exercise will teach you how R stores data, and how you can assemble—or disassemble—your own data sets. You will also learn about the various types of objects available for you to use in R (not all R objects are the same!). Consider this exercise a rite of passage; by doing it, you will become an expert on storing data in R. We’ll start with the very basics. The most simple type of object in R is an atomic vector. Atomic vectors are not nuclear powered, but they are very simple and they do show up everywhere. If you look closely enough, you’ll see that most structures in R are built from atomic vectors. 5.1 Atomic Vectors An atomic vector is just a simple vector of data. In fact, you’ve already made an atomic vector, your die object from Project 1: Weighted Dice. You can make an atomic vector by grouping some values of data together with c: die <- c(1, 2, 3, 4, 5, 6) die ## 1 2 3 4 5 6 is.vector(die) ## TRUE is.vector is.vector tests whether an object is an atomic vector. It returns TRUE if the object is an atomic vector and FALSE otherwise. You can also make an atomic vector with just one value. R saves single values as an atomic vector of length 1: five <- 5 five ## 5 is.vector(five) ## TRUE length(five) ## 1 length(die) ## 6 length length returns the length of an atomic vector. Each atomic vector stores its values as a one-dimensional vector, and each atomic vector can only store one type of data. You can save different types of data in R by using different types of atomic vectors. Altogether, R recognizes six basic types of atomic vectors: doubles, integers, characters, logicals, complex, and raw. To create your card deck, you will need to use different types of atomic vectors to save different types of information (text and numbers). You can do this by using some simple conventions when you enter your data. For example, you can create an integer vector by including a capital L with your input. You can create a character vector by surrounding your input in quotation marks: int <- 1L text <- "ace" Each type of atomic vector has its own convention (described below). R will recognize the convention and use it to create an atomic vector of the appropriate type. If you’d like to make atomic vectors that have more than one element in them, you can combine an element with the c function from Packages and Help Pages. Use the same convention with each element: int <- c(1L, 5L) text <- c("ace", "hearts") You may wonder why R uses multiple types of vectors. Vector types help R behave as you would expect. For example, R will do math with atomic vectors that contain numbers, but not with atomic vectors that contain character strings: sum(int) ## 6 sum(text) ## Error in sum(text) : invalid 'type' (character) of argument But we’re getting ahead of ourselves! Get ready to say hello to the six types of atomic vectors in R. 5.1.1 Doubles A double vector stores regular numbers. The numbers can be positive or negative, large or small, and have digits to the right of the decimal place or not. In general, R will save any number that you type in R as a double. So, for example, the die you made in Project 1: Weighted Dice was a double object: die <- c(1, 2, 3, 4, 5, 6) die ## 1 2 3 4 5 6 You’ll usually know what type of object you are working with in R (it will be obvious), but you can also ask R what type of object an object is with typeof. For example: typeof(die) ## "double" Some R functions refer to doubles as “numerics,” and I will often do the same. Double is a computer science term. It refers to the specific number of bytes your computer uses to store a number, but I find “numeric” to be much more intuitive when doing data science. 5.1.2 Integers Integer vectors store integers, numbers that can be written without a decimal component. As a data scientist, you won’t use the integer type very often because you can save integers as a double object. You can specifically create an integer in R by typing a number followed by an uppercase L. For example: int <- c(-1L, 2L, 4L) int ## -1 2 4 typeof(int) ## "integer" Note that R won’t save a number as an integer unless you include the L. Integer numbers without the L will be saved as doubles. The only difference between 4 and 4L is how R saves the number in your computer’s memory. Integers are defined more precisely in your computer’s memory than doubles (unless the integer is very large or small). Why would you save your data as an integer instead of a double? Sometimes a difference in precision can have surprising effects. Your computer allocates 64 bits of memory to store each double in an R program. This allows a lot of precision, but some numbers cannot be expressed exactly in 64 bits, the equivalent of a sequence of 64 ones and zeroes. For example, the number \\(\\pi\\) contains an endless sequences of digits to the right of the decimal place. Your computer must round \\(\\pi\\) to something close to, but not exactly equal to \\(\\pi\\) to store \\(\\pi\\) in its memory. Many decimal numbers share a similar fate. As a result, each double is accurate to about 16 significant digits. This introduces a little bit of error. In most cases, this rounding error will go unnoticed. However, in some situations, the rounding error can cause surprising results. For example, you may expect the result of the expression below to be zero, but it is not: sqrt(2)^2 - 2 ## 4.440892e-16 The square root of two cannot be expressed exactly in 16 significant digits. As a result, R has to round the quantity, and the expression resolves to something very close to—but not quite—zero. These errors are known as floating-point errors, and doing arithmetic in these conditions is known as floating-point arithmetic. Floating-point arithmetic is not a feature of R; it is a feature of computer programming. Usually floating-point errors won’t be enough to ruin your day. Just keep in mind that they may be the cause of surprising results. You can avoid floating-point errors by avoiding decimals and only using integers. However, this is not an option in most data-science situations. You cannot do much math with integers before you need a noninteger to express the result. Luckily, the errors caused by floating-point arithmetic are usually insignificant (and when they are not, they are easy to spot). As a result, you’ll generally use doubles instead of integers as a data scientist. 5.1.3 Characters A character vector stores small pieces of text. You can create a character vector in R by typing a character or string of characters surrounded by quotes: text <- c("Hello", "World") text ## "Hello" "World" typeof(text) ## "character" typeof("Hello") ## "character" The individual elements of a character vector are known as strings. Note that a string can contain more than just letters. You can assemble a character string from numbers or symbols as well. Exercise 5.1 (Character or Number?) Can you spot the difference between a character string and a number? Here’s a test: Which of these are character strings and which are numbers? 1, "1", "one". Solution. "1" and "one" are both character strings. Character strings can contain number characters, but that doesn’t make them numeric. They’re just strings that happen to have numbers in them. You can tell strings from real numbers because strings come surrounded by quotes. In fact, anything surrounded by quotes in R will be treated as a character string—no matter what appears between the quotes. It is easy to confuse R objects with character strings. Why? Because both appear as pieces of text in R code. For example, x is the name of an R object named “x,” "x" is a character string that contains the character “x.” One is an object that contains raw data, the other is a piece of raw data itself. Expect an error whenever you forget your quotation marks; R will start looking for an object that probably does not exist. 5.1.4 Logicals Logical vectors store TRUEs and FALSEs, R’s form of Boolean data. Logicals are very helpful for doing things like comparisons: 3 > 4 ## FALSE Any time you type TRUE or FALSE in capital letters (without quotation marks), R will treat your input as logical data. R also assumes that T and F are shorthand for TRUE and FALSE, unless they are defined elsewhere (e.g. T <- 500). Since the meaning of T and F can change, its best to stick with TRUE and FALSE: logic <- c(TRUE, FALSE, TRUE) logic ## TRUE FALSE TRUE typeof(logic) ## "logical" typeof(F) ## "logical" 5.1.5 Complex and Raw Doubles, integers, characters, and logicals are the most common types of atomic vectors in R, but R also recognizes two more types: complex and raw. It is doubtful that you will ever use these to analyze data, but here they are for the sake of thoroughness. Complex vectors store complex numbers. To create a complex vector, add an imaginary term to a number with i: comp <- c(1 + 1i, 1 + 2i, 1 + 3i) comp ## 1+1i 1+2i 1+3i typeof(comp) ## "complex" Raw vectors store raw bytes of data. Making raw vectors gets complicated, but you can make an empty raw vector of length n with raw(n). See the help page of raw for more options when working with this type of data: raw(3) ## 00 00 00 typeof(raw(3)) ## "raw" Exercise 5.2 (Vector of Cards) Create an atomic vector that stores just the face names of the cards in a royal flush, for example, the ace of spades, king of spades, queen of spades, jack of spades, and ten of spades. The face name of the ace of spades would be “ace,” and “spades” is the suit. Which type of vector will you use to save the names? Solution. A character vector is the most appropriate type of atomic vector in which to save card names. You can create one with the c function if you surround each name with quotation marks: hand <- c("ace", "king", "queen", "jack", "ten") hand ## "ace" "king" "queen" "jack" "ten" typeof(hand) ## "character" This creates a one-dimensional group of card names—great job! Now let’s make a more sophisticated data structure, a two-dimensional table of card names and suits. You can build a more sophisticated object from an atomic vector by giving it some attributes and assigning it a class. 5.2 Attributes An attribute is a piece of information that you can attach to an atomic vector (or any R object). The attribute won’t affect any of the values in the object, and it will not appear when you display your object. You can think of an attribute as “metadata”; it is just a convenient place to put information associated with an object. R will normally ignore this metadata, but some R functions will check for specific attributes. These functions may use the attributes to do special things with the data. You can see which attributes an object has with attributes. attributes will return NULL if an object has no attributes. An atomic vector, like die, won’t have any attributes unless you give it some: attributes(die) ## NULL NULL R uses NULL to represent the null set, an empty object. NULL is often returned by functions whose values are undefined. You can create a NULL object by typing NULL in capital letters. 5.2.1 Names The most common attributes to give an atomic vector are names, dimensions (dim), and classes. Each of these attributes has its own helper function that you can use to give attributes to an object. You can also use the helper functions to look up the value of these attributes for objects that already have them. For example, you can look up the value of the names attribute of die with names: names(die) ## NULL NULL means that die does not have a names attribute. You can give one to die by assigning a character vector to the output of names. The vector should include one name for each element in die: names(die) <- c("one", "two", "three", "four", "five", "six") Now die has a names attribute: names(die) ## "one" "two" "three" "four" "five" "six" attributes(die) ## $names ## [1] "one" "two" "three" "four" "five" "six" R will display the names above the elements of die whenever you look at the vector: die ## one two three four five six ## 1 2 3 4 5 6 However, the names won’t affect the actual values of the vector, nor will the names be affected when you manipulate the values of the vector: die + 1 ## one two three four five six ## 2 3 4 5 6 7 You can also use names to change the names attribute or remove it all together. To change the names, assign a new set of labels to names: names(die) <- c("uno", "dos", "tres", "quatro", "cinco", "seis") die ## uno dos tres quatro cinco seis ## 1 2 3 4 5 6 To remove the names attribute, set it to NULL: names(die) <- NULL die ## 1 2 3 4 5 6 5.2.2 Dim You can transform an atomic vector into an n-dimensional array by giving it a dimensions attribute with dim. To do this, set the dim attribute to a numeric vector of length n. R will reorganize the elements of the vector into n dimensions. Each dimension will have as many rows (or columns, etc.) as the nth value of the dim vector. For example, you can reorganize die into a 2 × 3 matrix (which has 2 rows and 3 columns): dim(die) <- c(2, 3) die ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 or a 3 × 2 matrix (which has 3 rows and 2 columns): dim(die) <- c(3, 2) die ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 or a 1 × 2 × 3 hypercube (which has 1 row, 2 columns, and 3 “slices”). This is a three-dimensional structure, but R will need to show it slice by slice by slice on your two-dimensional computer screen: dim(die) <- c(1, 2, 3) die ## , , 1 ## ## [,1] [,2] ## [1,] 1 2 ## ## , , 2 ## ## [,1] [,2] ## [1,] 3 4 ## ## , , 3 ## ## [,1] [,2] ## [1,] 5 6 R will always use the first value in dim for the number of rows and the second value for the number of columns. In general, rows always come first in R operations that deal with both rows and columns. You may notice that you don’t have much control over how R reorganizes the values into rows and columns. For example, R always fills up each matrix by columns, instead of by rows. If you’d like more control over this process, you can use one of R’s helper functions, matrix or array. They do the same thing as changing the dim attribute, but they provide extra arguments to customize the process. 5.3 Matrices Matrices store values in a two-dimensional array, just like a matrix from linear algebra. To create one, first give matrix an atomic vector to reorganize into a matrix. Then, define how many rows should be in the matrix by setting the nrow argument to a number. matrix will organize your vector of values into a matrix with the specified number of rows. Alternatively, you can set the ncol argument, which tells R how many columns to include in the matrix: m <- matrix(die, nrow = 2) m ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 matrix will fill up the matrix column by column by default, but you can fill the matrix row by row if you include the argument byrow = TRUE: m <- matrix(die, nrow = 2, byrow = TRUE) m ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 matrix also has other default arguments that you can use to customize your matrix. You can read about them at matrix’s help page (accessible by ?matrix). 5.4 Arrays The array function creates an n-dimensional array. For example, you could use array to sort values into a cube of three dimensions or a hypercube in 4, 5, or n dimensions. array is not as customizeable as matrix and basically does the same thing as setting the dim attribute. To use array, provide an atomic vector as the first argument, and a vector of dimensions as the second argument, now called dim: ar <- array(c(11:14, 21:24, 31:34), dim = c(2, 2, 3)) ar ## , , 1 ## ## [,1] [,2] ## [1,] 11 13 ## [2,] 12 14 ## ## , , 2 ## ## [,1] [,2] ## [1,] 21 23 ## [2,] 22 24 ## ## , , 3 ## ## [,1] [,2] ## [1,] 31 33 ## [2,] 32 34 Exercise 5.3 (Make a Matrix) Create the following matrix, which stores the name and suit of every card in a royal flush. ## [,1] [,2] ## [1,] "ace" "spades" ## [2,] "king" "spades" ## [3,] "queen" "spades" ## [4,] "jack" "spades" ## [5,] "ten" "spades" Solution. There is more than one way to build this matrix, but in every case, you will need to start by making a character vector with 10 values. If you start with the following character vector, you can turn it into a matrix with any of the following three commands: hand1 <- c("ace", "king", "queen", "jack", "ten", "spades", "spades", "spades", "spades", "spades") matrix(hand1, nrow = 5) matrix(hand1, ncol = 2) dim(hand1) <- c(5, 2) You can also start with a character vector that lists the cards in a slightly different order. In this case, you will need to ask R to fill the matrix row by row instead of column by column: hand2 <- c("ace", "spades", "king", "spades", "queen", "spades", "jack", "spades", "ten", "spades") matrix(hand2, nrow = 5, byrow = TRUE) matrix(hand2, ncol = 2, byrow = TRUE) 5.5 Class Notice that changing the dimensions of your object will not change the type of the object, but it will change the object’s class attribute: dim(die) <- c(2, 3) typeof(die) ## "double" class(die) ## "matrix" A matrix is a special case of an atomic vector. For example, the die matrix is a special case of a double vector. Every element in the matrix is still a double, but the elements have been arranged into a new structure. R added a class attribute to die when you changed its dimensions. This class describes die’s new format. Many R functions will specifically look for an object’s class attribute, and then handle the object in a predetermined way based on the attribute. Note that an object’s class attribute will not always appear when you run attributes; you may need to specifically search for it with class: attributes(die) ## $dim ## [1] 2 3 You can apply class to objects that do not have a class attribute. class will return a value based on the object’s atomic type. Notice that the “class” of a double is “numeric,” an odd deviation, but one I am thankful for. I think that the most important property of a double vector is that it contains numbers, a property that “numeric” makes obvious: class("Hello") ## "character" class(5) ## "numeric" You can also use class to set an object’s class attribute, but this is usually a bad idea. R will expect objects of a class to share certain traits, such as attributes, that your object may not possess. You’ll learn how to make and use your own classes in Project 3: Slot Machine. 5.5.1 Dates and Times The attribute system lets R represent more types of data than just doubles, integers, characters, logicals, complexes, and raws. The time looks like a character string when you display it, but its data type is actually "double", and its class is "POSIXct" "POSIXt" (it has two classes): now <- Sys.time() now ## "2014-03-17 12:00:00 UTC" typeof(now) ## "double" class(now) ## "POSIXct" "POSIXt" POSIXct is a widely used framework for representing dates and times. In the POSIXct framework, each time is represented by the number of seconds that have passed between the time and 12:00 AM January 1st 1970 (in the Universal Time Coordinated (UTC) zone). For example, the time above occurs 1,395,057,600 seconds after then. So in the POSIXct system, the time would be saved as 1395057600. R creates the time object by building a double vector with one element, 1395057600. You can see this vector by removing the class attribute of now, or by using the unclass function, which does the same thing: unclass(now) ## 1395057600 R then gives the double vector a class attribute that contains two classes, "POSIXct" and "POSIXt". This attribute alerts R functions that they are dealing with a POSIXct time, so they can treat it in a special way. For example, R functions will use the POSIXct standard to convert the time into a user-friendly character string before displaying it. You can take advantage of this system by giving the POSIXct class to random R objects. For example, have you ever wondered what day it was a million seconds after 12:00 a.m. Jan. 1, 1970? mil <- 1000000 mil ## 1e+06 class(mil) <- c("POSIXct", "POSIXt") mil ## "1970-01-12 13:46:40 UTC" Jan. 12, 1970. Yikes. A million seconds goes by faster than you would think. This conversion worked well because the POSIXct class does not rely on any additional attributes, but in general, forcing the class of an object is a bad idea. There are many different classes of data in R and its packages, and new classes are invented every day. It would be difficult to learn about every class, but you do not have to. Most classes are only useful in specific situations. Since each class comes with its own help page, you can wait to learn about a class until you encounter it. However, there is one class of data that is so ubiquitous in R that you should learn about it alongside the atomic data types. That class is factors. 5.5.2 Factors Factors are R’s way of storing categorical information, like ethnicity or eye color. Think of a factor as something like a gender; it can only have certain values (male or female), and these values may have their own idiosyncratic order (ladies first). This arrangement makes factors very useful for recording the treatment levels of a study and other categorical variables. To make a factor, pass an atomic vector into the factor function. R will recode the data in the vector as integers and store the results in an integer vector. R will also add a levels attribute to the integer, which contains a set of labels for displaying the factor values, and a class attribute, which contains the class factor: gender <- factor(c("male", "female", "female", "male")) typeof(gender) ## "integer" attributes(gender) ## $levels ## [1] "female" "male" ## ## $class ## [1] "factor" You can see exactly how R is storing your factor with unclass: unclass(gender) ## [1] 2 1 1 2 ## attr(,"levels") ## [1] "female" "male" R uses the levels attribute when it displays the factor, as you will see. R will display each 1 as female, the first label in the levels vector, and each 2 as male, the second label. If the factor included 3s, they would be displayed as the third label, and so on: gender ## male female female male ## Levels: female male Factors make it easy to put categorical variables into a statistical model because the variables are already coded as numbers. However, factors can be confusing since they look like character strings but behave like integers. R will often try to convert character strings to factors when you load and create data. In general, you will have a smoother experience if you do not let R make factors until you ask for them. I’ll show you how to do this when we start reading in data. You can convert a factor to a character string with the as.character function. R will retain the display version of the factor, not the integers stored in memory: as.character(gender) ## "male" "female" "female" "male" Now that you understand the possibilities provided by R’s atomic vectors, let’s make a more complicated type of playing card. Exercise 5.4 (Write a Card) Many card games assign a numerical value to each card. For example, in blackjack, each face card is worth 10 points, each number card is worth between 2 and 10 points, and each ace is worth 1 or 11 points, depending on the final score. Make a virtual playing card by combining “ace,” “heart,” and 1 into a vector. What type of atomic vector will result? Check if you are right. Solution. You may have guessed that this exercise would not go well. Each atomic vector can only store one type of data. As a result, R coerces all of your values to character strings: card <- c("ace", "hearts", 1) card ## "ace" "hearts" "1" This will cause trouble if you want to do math with that point value, for example, to see who won your game of blackjack. Data types in vectors If you try to put multiple types of data into a vector, R will convert the elements to a single type of data. Since matrices and arrays are special cases of atomic vectors, they suffer from the same behavior. Each can only store one type of data. This creates a couple of problems. First, many data sets contain multiple types of data. Simple programs like Excel and Numbers can save multiple types of data in the same data set, and you should hope that R can too. Don’t worry, it can. Second, coercion is a common behavior in R, so you’ll want to know how it works. 5.6 Coercion R’s coercion behavior may seem inconvenient, but it is not arbitrary. R always follows the same rules when it coerces data types. Once you are familiar with these rules, you can use R’s coercion behavior to do surprisingly useful things. So how does R coerce data types? If a character string is present in an atomic vector, R will convert everything else in the vector to character strings. If a vector only contains logicals and numbers, R will convert the logicals to numbers; every TRUE becomes a 1, and every FALSE becomes a 0, as shown in Figure 5.1. Figure 5.1: R always uses the same rules to coerce data to a single type. If character strings are present, everything will be coerced to a character string. Otherwise, logicals are coerced to numerics. This arrangement preserves information. It is easy to look at a character string and tell what information it used to contain. For example, you can easily spot the origins of "TRUE" and "5". You can also easily back-transform a vector of 1s and 0s to TRUEs and FALSEs. R uses the same coercion rules when you try to do math with logical values. So the following code: sum(c(TRUE, TRUE, FALSE, FALSE)) will become: sum(c(1, 1, 0, 0)) ## 2 This means that sum will count the number of TRUEs in a logical vector (and mean will calculate the proportion of TRUEs). Neat, huh? You can explicitly ask R to convert data from one type to another with the as functions. R will convert the data whenever there is a sensible way to do so: as.character(1) ## "1" as.logical(1) ## TRUE as.numeric(FALSE) ## 0 You now know how R coerces data types, but this won’t help you save a playing card. To do that, you will need to avoid coercion altogether. You can do this by using a new type of object, a list. Before we look at lists, let’s address a question that might be on your mind. Many data sets contain multiple types of information. The inability of vectors, matrices, and arrays to store multiple data types seems like a major limitation. So why bother with them? In some cases, using only a single type of data is a huge advantage. Vectors, matrices, and arrays make it very easy to do math on large sets of numbers because R knows that it can manipulate each value the same way. Operations with vectors, matrices, and arrays also tend to be fast because the objects are so simple to store in memory. In other cases, allowing only a single type of data is not a disadvantage. Vectors are the most common data structure in R because they store variables very well. Each value in a variable measures the same property, so there’s no need to use different types of data. 5.7 Lists Lists are like atomic vectors because they group data into a one-dimensional set. However, lists do not group together individual values; lists group together R objects, such as atomic vectors and other lists. For example, you can make a list that contains a numeric vector of length 31 in its first element, a character vector of length 1 in its second element, and a new list of length 2 in its third element. To do this, use the list function. list creates a list the same way c creates a vector. Separate each element in the list with a comma: list1 <- list(100:130, "R", list(TRUE, FALSE)) list1 ## [[1]] ## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 ## [14] 113 114 115 116 117 118 119 120 121 122 123 124 125 ## [27] 126 127 128 129 130 ## ## [[2]] ## [1] "R" ## ## [[3]] ## [[3]][[1]] ## [1] TRUE ## ## [[3]][[2]] ## [1] FALSE I left the [1] notation in the output so you can see how it changes for lists. The double-bracketed indexes tell you which element of the list is being displayed. The single-bracket indexes tell you which subelement of an element is being displayed. For example, 100 is the first subelement of the first element in the list. "R" is the first sub-element of the second element. This two-system notation arises because each element of a list can be any R object, including a new vector (or list) with its own indexes. Lists are a basic type of object in R, on par with atomic vectors. Like atomic vectors, they are used as building blocks to create many more spohisticated types of R objects. As you can imagine, the structure of lists can become quite complicated, but this flexibility makes lists a useful all-purpose storage tool in R: you can group together anything with a list. However, not every list needs to be complicated. You can store a playing card in a very simple list. Exercise 5.5 (Use a List to Make a Card) Use a list to store a single playing card, like the ace of hearts, which has a point value of one. The list should save the face of the card, the suit, and the point value in separate elements. Solution. You can create your card like this. In the following example, the first element of the list is a character vector (of length 1). The second element is also a character vector, and the third element is a numeric vector: card <- list("ace", "hearts", 1) card ## [[1]] ## [1] "ace" ## ## [[2]] ## [1] "hearts" ## ## [[3]] ## [1] 1 You can also use a list to store a whole deck of playing cards. Since you can save a single playing card as a list, you can save a deck of playing cards as a list of 52 sublists (one for each card). But let’s not bother—there’s a much cleaner way to do the same thing. You can use a special class of list, known as a data frame. 5.8 Data Frames Data frames are the two-dimensional version of a list. They are far and away the most useful storage structure for data analysis, and they provide an ideal way to store an entire deck of cards. You can think of a data frame as R’s equivalent to the Excel spreadsheet because it stores data in a similar format. Data frames group vectors together into a two-dimensional table. Each vector becomes a column in the table. As a result, each column of a data frame can contain a different type of data; but within a column, every cell must be the same type of data, as in Figure 5.2. Figure 5.2: Data frames store data as a sequence of columns. Each column can be a different data type. Every column in a data frame must be the same length. Creating a data frame by hand takes a lot of typing, but you can do it (if you like) with the data.frame function. Give data.frame any number of vectors, each separated with a comma. Each vector should be set equal to a name that describes the vector. data.frame will turn each vector into a column of the new data frame: df <- data.frame(face = c("ace", "two", "six"), suit = c("clubs", "clubs", "clubs"), value = c(1, 2, 3)) df ## face suit value ## ace clubs 1 ## two clubs 2 ## six clubs 3 You’ll need to make sure that each vector is the same length (or can be made so with R’s recycling rules; see Figure 2.4, as data frames cannot combine columns of different lengths. In the previous code, I named the arguments in data.frame face, suit, and value, but you can name the arguments whatever you like. data.frame will use your argument names to label the columns of the data frame. Names You can also give names to a list or vector when you create one of these objects. Use the same syntax as with data.frame: list(face = "ace", suit = "hearts", value = 1) c(face = "ace", suit = "hearts", value = "one") The names will be stored in the object’s names attribute. If you look at the type of a data frame, you will see that it is a list. In fact, each data frame is a list with class data.frame. You can see what types of objects are grouped together by a list (or data frame) with the str function: typeof(df) ## "list" class(df) ## "data.frame" str(df) ## 'data.frame': 3 obs. of 3 variables: ## $ face : Factor w/ 3 levels "ace","six","two": 1 3 2 ## $ suit : Factor w/ 1 level "clubs": 1 1 1 ## $ value: num 1 2 3 Notice that R saved your character strings as factors. I told you that R likes factors! It is not a very big deal here, but you can prevent this behavior by adding the argument stringsAsFactors = FALSE to data.frame: df <- data.frame(face = c("ace", "two", "six"), suit = c("clubs", "clubs", "clubs"), value = c(1, 2, 3), stringsAsFactors = FALSE) A data frame is a great way to build an entire deck of cards. You can make each row in the data frame a playing card, and each column a type of value—each with its own appropriate data type. The data frame would look something like this: ## face suit value ## king spades 13 ## queen spades 12 ## jack spades 11 ## ten spades 10 ## nine spades 9 ## eight spades 8 ## seven spades 7 ## six spades 6 ## five spades 5 ## four spades 4 ## three spades 3 ## two spades 2 ## ace spades 1 ## king clubs 13 ## queen clubs 12 ## jack clubs 11 ## ten clubs 10 ## ... and so on. You could create this data frame with data.frame, but look at the typing involved! You need to write three vectors, each with 52 elements: deck <- data.frame( face = c("king", "queen", "jack", "ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "ace", "king", "queen", "jack", "ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "ace", "king", "queen", "jack", "ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "ace", "king", "queen", "jack", "ten", "nine", "eight", "seven", "six", "five", "four", "three", "two", "ace"), suit = c("spades", "spades", "spades", "spades", "spades", "spades", "spades", "spades", "spades", "spades", "spades", "spades", "spades", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "clubs", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "diamonds", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts", "hearts"), value = c(13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1) ) You should avoid typing large data sets in by hand whenever possible. Typing invites typos and errors, not to mention RSI. It is always better to acquire large data sets as a computer file. You can then ask R to read the file and store the contents as an object. I’ve created a file for you to load that contains a data frame of playing-card information, so don’t worry about typing in the code. Instead, turn your attention toward loading data into R. 5.9 Loading Data You can load the deck data frame from the file deck.csv. Please take a moment to download the file before reading on. Visit the website, click “Download Zip,” and then unzip and open the folder that your web browser downloads. deck.csv will be inside. deck.csv is a comma-separated values file, or CSV for short. CSVs are plain-text files, which means you can open them in a text editor (as well as many other programs). If you open desk.csv, you’ll notice that it contains a table of data that looks like the following table. Each row of the table is saved on its own line, and a comma is used to separate the cells within each row. Every CSV file shares this basic format: "face","suit,"value" "king","spades",13 "queen","spades,12 "jack","spades,11 "ten","spades,10 "nine","spades,9 ... and so on. Most data-science applications can open plain-text files and export data as plain-text files. This makes plain-text files a sort of lingua franca for data science. To load a plain-text file into R, click the Import Dataset icon in RStudio, shown in Figure 5.3. Then select “From text file.” Figure 5.3: You can import data from plain-text files with RStudio’s Import Dataset. RStudio will ask you to select the file you want to import, then it will open a wizard to help you import the data, as in Figure 5.4. Use the wizard to tell RStudio what name to give the data set. You can also use the wizard to tell RStudio which character the data set uses as a separator, which character it uses to represent decimals (usually a period in the United States and a comma in Europe), and whether or not the data set comes with a row of column names (known as a header). To help you out, the wizard shows you what the raw file looks like, as well as what your loaded data will look like based on the input settings. You can also unclick the box “Strings as factors” in the wizard. I recommend doing this. If you do, R will load all of your character strings as character strings. If you do not, R will convert them to factors. Figure 5.4: RStudio’s import wizard. Once everything looks right, click Import. RStudio will read in the data and save it to a data frame. RStudio will also open a data viewer, so you can see your new data in a spreadsheet format. This is a good way to check that everything came through as expected. If all worked well, your file should appear in a View tab of RStudio, like in Figure 5.5. You can examine the data frame in the console with head(deck). Online data You can load a plain-text file straight from the Internet by clicking the “From Web URL…” option under Import Dataset. The file will need to have its own URL, and you will need to be connected. Figure 5.5: When you import a data set, RStudio will save the data to a data frame and then display the data frame in a View tab. You can open any data frame in a View tab at any time with the View function. Now it is your turn. Download deck.csv and import it into RStudio. Be sure to save the output to an R object called deck: you’ll use it in the next few chapters. If everything goes correctly, the first few lines of your data frame should look like this: head(deck) ## face suit value ## king spades 13 ## queen spades 12 ## jack spades 11 ## ten spades 10 ## nine spades 9 ## eight spades 8 head and tail are two functions that provide an easy way to peek at large data sets. head will return just the first six rows of the data set, and tail will return just the last six rows. To see a different number of rows, give head or tails a second argument, the number of rows you would like to view, for example, head(deck, 10). R can open many types of files—not just CSVs. Visit Loading and Saving Data in R to learn how to open other common types of files in R. 5.10 Saving Data Before we go any further, let’s save a copy of deck as a new .csv file. That way you can email it to a colleague, store it on a thumb drive, or open it in a different program. You can save any data frame in R to a .csv file with the command write.csv. To save deck, run: write.csv(deck, file = "cards.csv", row.names = FALSE) R will turn your data frame into a plain-text file with the comma-separated values format and save the file to your working directory. To see where your working directory is, run getwd(). To change the location of your working directory, visit Session > Set Working Directory > Choose Directory in the RStudio menu bar. You can customize the save process with write.csv’s large set of optional arguments (see ?write.csv for details). However, there are three arguments that you should use every time you run write.csv. First, you should give write.csv the name of the data frame that you wish to save. Next, you should provide a file name to give your file. R will take this name quite literally, so be sure to provide an extension. Finally, you should add the argument row.names = FALSE. This will prevent R from adding a column of numbers at the start of your data frame. These numbers will identify your rows from 1 to 52, but it is unlikely that whatever program you open cards.csv in will understand the row name system. More than likely, the program will assume that the row names are the first column of data in your data frame. In fact, this is exactly what R will assume if you reopen cards.csv. If you save and open cards.csv several times in R, you’ll notice duplicate columns of row numbers forming at the start of your data frame. I can’t explain why R does this, but I can explain how to avoid it: use row.names = FALSE whenever you save data with write.csv. For more details about saving files, including how to compress saved files and how to save files in other formats, see Loading and Saving Data in R. Good work. You now have a virtual deck of cards to work with. Take a breather, and when you come back, we’ll start writing some functions to use on your deck. 5.11 Summary You can save data in R with five different objects, which let you store different types of values in different types of relationships, as in Figure 5.6. Of these objects, data frames are by far the most useful for data science. Data frames store one of the most common forms of data used in data science, tabular data. Figure 5.6: R’s most common data structures are vectors, matrices, arrays, lists, and data frames. You can load tabular data into a data frame with RStudio’s Import Dataset button—so long as the data is saved as a plain-text file. This requirement is not as limiting as it sounds. Most software programs can export data as a plain-text file. So if you have an Excel file (for example) you can open the file in Excel and export the data as a CSV to use with R. In fact, opening a file in its original program is good practice. Excel files use metadata, like sheets and formulas, that help Excel work with the file. R can try to extract raw data from the file, but it won’t be as good at doing this as Microsoft Excel is. No program is better at converting Excel files than Excel. Similarly, no program is better at converting SAS Xport files than SAS, and so on. However, you may find yourself with a program-specific file, but not the program that created it. You wouldn’t want to buy a multi-thousand-dollar SAS license just to open a SAS file. Thankfully R can open many types of files, including files from other programs and databases. R even has its own program-specific formats that can help you save memory and time if you know that you will be working entirely in R. If you’d like to know more about all of your options for loading and saving data in R, see Loading and Saving Data in R. R Notation will build upon the skills you learned in this chapter. Here, you learned how to store data in R. In R Notation, you will learn how to access values once they’ve been stored. You’ll also write two functions that will let you start using your deck, a shuffle function and a deal function. "], ["r-notation.html", "6 R Notation 6.1 Selecting Values 6.2 Deal a Card 6.3 Shuffle the Deck 6.4 Dollar Signs and Double Brackets 6.5 Summary", " 6 R Notation Now that you have a deck of cards, you need a way to do card-like things with it. First, you’ll want to reshuffle the deck from time to time. And next, you’ll want to deal cards from the deck (one card at a time, whatever card is on top—we’re not cheaters). To do these things, you’ll need to work with the individual values inside your data frame, a task essential to data science. For example, to deal a card from the top of your deck, you’ll need to write a function that selects the first row of values in your data frame, like this deal(deck) ## face suit value ## king spades 13 You can select values within an R object with R’s notation system. 6.1 Selecting Values R has a notation system that lets you extract values from R objects. To extract a value or set of values from a data frame, write the data frame’s name followed by a pair of hard brackets: deck[ , ] Between the brackets will go two indexes separated by a comma. The indexes tell R which values to return. R will use the first index to subset the rows of the data frame and the second index to subset the columns. You have a choice when it comes to writing indexes. There are six different ways to write an index for R, and each does something slightly different. They are all very simple and quite handy, so let’s take a look at each of them. You can create indexes with: Positive integers Negative integers Zero Blank spaces Logical values Names The simplest of these to use is positive integers. 6.1.1 Positive Integers R treats positive integers just like ij notation in linear algebra: deck[i,j] will return the value of deck that is in the ith row and the jth column, Figure 6.1. Notice that i and j only need to be integers in the mathematical sense. They can be saved as numerics in R head(deck) ## face suit value ## king spades 13 ## queen spades 12 ## jack spades 11 ## ten spades 10 ## nine spades 9 ## eight spades 8 deck[1, 1] ## "king" To extract more than one value, use a vector of positive integers. For example, you can return the first row of deck with deck[1, c(1, 2, 3)] or deck[1, 1:3]: deck[1, c(1, 2, 3)] ## face suit value ## king spades 13 R will return the values of deck that are in both the first row and the first, second, and third columns. Note that R won’t actually remove these values from deck. R will give you a new set of values which are copies of the original values. You can then save this new set to an R object with R’s assignment operator: new <- deck[1, c(1, 2, 3)] new ## face suit value ## king spades 13 Repetition If you repeat a number in your index, R will return the corresponding value(s) more than once in your “subset.” This code will return the first row of deck twice: deck[c(1, 1), c(1, 2, 3)] ## face suit value ## king spades 13 ## king spades 13 Figure 6.1: R uses the ij notation system of linear algebra. The commands in this figure will return the shaded values. R’s notation system is not limited to data frames. You can use the same syntax to select values in any R object, as long as you supply one index for each dimension of the object. So, for example, you can subset a vector (which has one dimension) with a single index: vec <- c(6, 1, 3, 6, 10, 5) vec[1:3] ## 6 1 3 Indexing begins at 1 In some programming languages, indexing begins with 0. This means that 0 returns the first element of a vector, 1 returns the second element, and so on. This isn’t the case with R. Indexing in R behaves just like indexing in linear algebra. The first element is always indexed by 1. Why is R different? Maybe because it was written for mathematicians. Those of us who learned indexing from a linear algebra course wonder why computers programmers start with 0. drop = FALSE If you select two or more columns from a data frame, R will return a new data frame: deck[1:2, 1:2] ## face suit ## king spades ## queen spades However, if you select a single column, R will return a vector: deck[1:2, 1] ## "king" "queen" If you would prefer a data frame instead, you can add the optional argument drop = FALSE between the brackets: deck[1:2, 1, drop = FALSE] ## face ## king ## queen This method also works for selecting a single column from a matrix or an array. 6.1.2 Negative Integers Negative integers do the exact opposite of positive integers when indexing. R will return every element except the elements in a negative index. For example, deck[-1, 1:3] will return everything but the first row of deck. deck[-(2:52), 1:3] will return the first row (and exclude everything else): deck[-(2:52), 1:3] ## face suit value ## king spades 13 Negative integers are a more efficient way to subset than positive integers if you want to include the majority of a data frame’s rows or columns. R will return an error if you try to pair a negative integer with a positive integer in the same index: deck[c(-1, 1), 1] ## Error in xj[i] : only 0's may be mixed with negative subscripts However, you can use both negative and positive integers to subset an object if you use them in different indexes (e.g., if you use one in the rows index and one in the columns index, like deck[-1, 1]). 6.1.3 Zero What would happen if you used zero as an index? Zero is neither a positive integer nor a negative integer, but R will still use it to do a type of subsetting. R will return nothing from a dimension when you use zero as an index. This creates an empty object: deck[0, 0] ## data frame with 0 columns and 0 rows To be honest, indexing with zero is not very helpful. 6.1.4 Blank Spaces You can use a blank space to tell R to extract every value in a dimension. This lets you subset an object on one dimension but not the others, which is useful for extracting entire rows or columns from a data frame: deck[1, ] ## face suit value ## king spades 13 6.1.5 Logical Values If you supply a vector of TRUEs and FALSEs as your index, R will match each TRUE and FALSE to a row in your data frame (or a column depending on where you place the index). R will then return each row that corresponds to a TRUE, Figure 6.2. It may help to imagine R reading through the data frame and asking, "Should I return the _i_th row of the data structure?" and then consulting the _i_th value of the index for its answer. For this system to work, your vector must be as long as the dimension you are trying to subset: deck[1, c(TRUE, TRUE, FALSE)] ## face suit ## king spades rows <- c(TRUE, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F) deck[rows, ] ## face suit value ## king spades 13 Figure 6.2: You can use vectors of TRUEs and FALSEs to tell R exactly which values you want to extract and which you do not. The command would return just the numbers 1, 6, and 5. This system may seem odd—who wants to type so many TRUEs and FALSEs?—but it will become very powerful in Modifying Values. 6.1.6 Names Finally, you can ask for the elements you want by name—if your object has names (see Names). This is a common way to extract the columns of a data frame, since columns almost always have names: deck[1, c("face", "suit", "value")] ## face suit value ## king spades 13 # the entire value column deck[ , "value"] ## 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 ## 7 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 ## 1 13 12 11 10 9 8 7 6 5 4 3 2 1 6.2 Deal a Card Now that you know the basics of R’s notation system, let’s put it to use. Exercise 6.1 (Deal a Card) Complete the following code to make a function that returns the first row of a data frame: deal <- function(cards) { # ? } Solution. You can use any of the systems that return the first row of your data frame to write a deal function. I’ll use positive integers and blanks because I think they are easy to understand: deal <- function(cards) { cards[1, ] } The function does exactly what you want: it deals the top card from your data set. However, the function becomes less impressive if you run deal over and over again: deal(deck) ## face suit value ## king spades 13 deal(deck) ## face suit value ## king spades 13 deal(deck) ## face suit value ## king spades 13 deal always returns the king of spades because deck doesn’t know that we’ve dealt the card away. Hence, the king of spades stays where it is, at the top of the deck ready to be dealt again. This is a difficult problem to solve, and we will deal with it in Environments. In the meantime, you can fix the problem by shuffling your deck after every deal. Then a new card will always be at the top. Shuffling is a temporary compromise: the probabilities at play in your deck will not match the probabilities that occur when you play a game with a single deck of cards. For example, there will still be a probability that the king of spades appears twice in a row. However, things are not as bad as they may seem. Most casinos use five or six decks at a time in card games to prevent card counting. The probabilities that you would encounter in those situations are very close to the ones we will create here. 6.3 Shuffle the Deck When you shuffle a real deck of cards, you randomly rearrange the order of the cards. In your virtual deck, each card is a row in a data frame. To shuffle the deck, you need to randomly reorder the rows in the data frame. Can this be done? You bet! And you already know everything you need to do it. This may sound silly, but start by extracting every row in your data frame: deck2 <- deck[1:52, ] head(deck2) ## face suit value ## king spades 13 ## queen spades 12 ## jack spades 11 ## ten spades 10 ## nine spades 9 ## eight spades 8 What do you get? A new data frame whose order hasn’t changed at all. What if you asked R to extract the rows in a different order? For example, you could ask for row 2, then row 1, and then the rest of the cards: deck3 <- deck[c(2, 1, 3:52), ] head(deck3) ## face suit value ## queen spades 12 ## king spades 13 ## jack spades 11 ## ten spades 10 ## nine spades 9 ## eight spades 8 R complies. You’ll get all the rows back, and they’ll come in the order you ask for them. If you want the rows to come in a random order, then you need to sort the integers from 1 to 52 into a random order and use the results as a row index. How could you generate such a random collection of integers? With our friendly neighborhood sample function: random <- sample(1:52, size = 52) random ## 35 28 39 9 18 29 26 45 47 48 23 22 21 16 32 38 1 15 20 ## 11 2 4 14 49 34 25 8 6 10 41 46 17 33 5 7 44 3 27 ## 50 12 51 40 52 24 19 13 42 37 43 36 31 30 deck4 <- deck[random, ] head(deck4) ## face suit value ## five diamonds 5 ## queen diamonds 12 ## ace diamonds 1 ## five spades 5 ## nine clubs 9 ## jack diamonds 11 Now the new set is truly shuffled. You’ll be finished once you wrap these steps into a function. Exercise 6.2 (Shuffle a Deck) Use the preceding ideas to write a shuffle function. shuffle should take a data frame and return a shuffled copy of the data frame. Solution. Your shuffle function will look like the one that follows: shuffle <- function(cards) { random <- sample(1:52, size = 52) cards[random, ] } Nice work! Now you can shuffle your cards between each deal: deal(deck) ## face suit value ## king spades 13 deck2 <- shuffle(deck) deal(deck2) ## face suit value ## jack clubs 11 6.4 Dollar Signs and Double Brackets Two types of object in R obey an optional second system of notation. You can extract values from data frames and lists with the $ syntax. You will encounter the $ syntax again and again as an R programmer, so let’s examine how it works. To select a column from a data frame, write the data frame’s name and the column name separated by a $. Notice that no quotes should go around the column name: deck$value ## 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 7 ## 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 1 13 ## 12 11 10 9 8 7 6 5 4 3 2 1 R will return all of the values in the column as a vector. This $ notation is incredibly useful because you will often store the variables of your data sets as columns in a data frame. From time to time, you’ll want to run a function like mean or median on the values in a variable. In R, these functions expect a vector of values as input, and deck$value delivers your data in just the right format: mean(deck$value) ## 7 median(deck$value) ## 7 You can use the same $ notation with the elements of a list, if they have names. This notation has an advantage with lists, too. If you subset a list in the usual way, R will return a new list that has the elements you requested. This is true even if you only request a single element. To see this, make a list: lst <- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c")) lst ## $numbers ## [1] 1 2 ## $logical ## [1] TRUE ## $strings ## [1] "a" "b" "c" And then subset it: lst[1] ## $numbers ## [1] 1 2 The result is a smaller list with one element. That element is the vector c(1, 2). This can be annoying because many R functions do not work with lists. For example, sum(lst[1]) will return an error. It would be horrible if once you stored a vector in a list, you could only ever get it back as a list: sum(lst[1]) ## Error in sum(lst[1]) : invalid 'type' (list) of argument When you use the $ notation, R will return the selected values as they are, with no list structure around them: lst$numbers ## 1 2 You can then immediately feed the results to a function: sum(lst$numbers) ## 3 If the elements in your list do not have names (or you do not wish to use the names), you can use two brackets, instead of one, to subset the list. This notation will do the same thing as the $ notation: lst[[1]] ## 1 2 In other words, if you subset a list with single-bracket notation, R will return a smaller list. If you subset a list with double-bracket notation, R will return just the values that were inside an element of the list. You can combine this feature with any of R’s indexing methods: lst["numbers"] ## $numbers ## [1] 1 2 lst[["numbers"]] ## 1 2 This difference is subtle but important. In the R community, there is a popular, and helpful, way to think about it, Figure 6.3. Imagine that each list is a train and each element is a train car. When you use single brackets, R selects individual train cars and returns them as a new train. Each car keeps its contents, but those contents are still inside a train car (i.e., a list). When you use double brackets, R actually unloads the car and gives you back the contents. Figure 6.3: It can be helpful to think of your list as a train. Use single brackets to select train cars, double brackets to select the contents inside of a car. Never attach In R’s early days, it became popular to use attach() on a data set once you had it loaded. Don’t do this! attach recreates a computing environment similar to those used in other statistics applications like Stata and SPSS, which crossover users liked. However, R is not Stata or SPSS. R is optimized to use the R computing environment, and running attach() can cause confusion with some R functions. What does attach() do? On the surface, attach saves you typing. If you attach the deck data set, you can refer to each of its variables by name; instead of typing deck$face, you can just type face. But typing isn’t bad. It gives you a chance to be explicit, and in computer programming, explicit is good. Attaching a data set creates the possibility that R will confuse two variable names. If this occurs within a function, you’re likely to get unusable results and an unhelpful error message to explain what happened. Now that you are an expert at retrieving values stored in R, let’s summarize what you’ve accomplished. 6.5 Summary You have learned how to access values that have been stored in R. You can retrieve a copy of values that live inside a data frame and use the copies for new computations. In fact, you can use R’s notation system to access values in any R object. To use it, write the name of an object followed by brackets and indexes. If your object is one-dimensional, like a vector, you only need to supply one index. If it is two-dimensional, like a data frame, you need to supply two indexes separated by a comma. And, if it is n-dimensional, you need to supply n indexes, each separated by a comma. In Modifying Values, you’ll take this system a step further and learn how to change the actual values that are stored inside your data frame. This is all adding up to something special: complete control of your data. You can now store your data in your computer, retrieve individual values at will, and use your computer to perform correct calculations with those values. Does this sound basic? It may be, but it is also powerful and essential for efficient data science. You no longer need to memorize everything in your head, nor worry about doing mental arithmetic wrong. This low-level control over your data is also a prerequisite for more efficient R programs, the subject of Project 3: Slot Machine. "],