forked from rstudio-conf-2020/r-for-excel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsearch_index.json
1 lines (1 loc) · 158 KB
/
search_index.json
1
[["index.html", "R for Excel Users Chapter 1 Welcome 1.1 Agenda 1.2 Prerequisites 1.3 Data citations", " R for Excel Users Julie Lowndes & Allison Horst 2021-01-08 Chapter 1 Welcome Hello! This is a course taught by Dr. Julie Stewart Lowndes and Dr. Allison Horst at the RStudio Conference: January 27-28 in San Francisco, California. This course is for Excel users who want to add or integrate R and RStudio into their existing data analysis toolkit. It is a friendly intro to becoming a modern R user, full of tidyverse, RMarkdown, GitHub, collaboration & reproducibility. This book is written to be used as a reference, to teach, or as self-paced learning. And also, awesomely, it’s created with the same tools and practices we will be talking about: R and RStudio — specifically bookdown — and GitHub. It is being fine-tuned but the most recent version is always available: This book: https://rstudio-conf-2020.github.io/r-for-excel/ Book GitHub repo: https://github.com/rstudio-conf-2020/r-for-excel Accompanying slides: Google Slides Blog: https://education.rstudio.com/blog/2020/02/conf20-r-excel/ About us We are environmental scientists who use and teach R in our daily work. We both work at the University of California Santa Barbara, USA. Julie Lowndes is a Senior Fellow and Director of Openscapes at the National Center for Ecological Analysis and Synthesis. Allison Horst is a Lecturer of Data Science & Statistics at the Bren School of Environmental Science and Management. She is also Artist in Residence at RStudio! 1.1 Agenda Time Day 1 Day 2 9-10:30 Overview, R & RStudio, RMarkdown (JL) Tidying (AH) break 11-12:30 Intro to GitHub (JL) Filters & joins (AH) lunch 13:30-15:00 Graphs with ggplot2 (AH) Collaborating & getting help (JL) break 15:30-17:00 Pivot Tables & dplyr (JL) Synthesis (AH) 1.2 Prerequisites Before the training, please do the following (20 minutes). All software is free. Download and install R and RStudio R: https://cloud.r-project.org/ RStudio: http://www.rstudio.com/download Follow your operating system’s normal installation process Create a GitHub account GitHub: https://github.com Follow optional advice on choosing your username Remember your username, email and password; we will need them for the workshop! Download and install Git Git: https://git-scm.com/downloads Follow your operating system’s normal installation process. Note: you will not see an application called Git listed but if the installation process completed it was likely successful, and we will confirm together Download workshop data Google Drive folder: r-for-excel-data Save it temporarily somewhere you will remember; we will move it together 1.3 Data citations We use the following data from the Santa Barbara Coastal Term Ecological Research and National Oceanic and Atmospheric Administration in this workshop: fish.csv Description: Reef fish abundance, SB coast Link: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=17&revision=newest Citation: Reed D. 2018. SBC LTER: Reef: Kelp Forest Community Dynamics: Fish abundance. Environmental Data Initiative. doi. inverts.xlsx Description: Invertebrate counts, SB coast Link: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=19&revision=newest Citation: Reed D. 2018. SBC LTER: Reef: Kelp Forest Community Dynamics: Invertebrate and algal density. Environmental Data Initiative. doi. kelp.xlsx Description: Giant kelp abundance and size, SB coast Link: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=18&revision=newest Citation: Reed D. 2018. SBC LTER: Reef: Kelp Forest Community Dynamics: Abundance and size of Giant Kelp (Macrocystis Pyrifera), ongoing since 2000. Environmental Data Initiative. doi. lobsters.xlsx and lobsters2.xlsx Description: Lobster size, abundance and fishing pressure (SB coast) Link: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=77&revision=newest Citation: Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. doi. noaa_fisheries.csv Description: NOAA Commercial Fisheries Landing data (1950 - 2017) Link: https://www.st.nmfs.noaa.gov/commercial-fisheries/commercial-landings/ Source: Fisheries Statistics Division of the NOAA Fisheries substrate.xlsx Description: Algal cover, invertebrates and substrates near Santa Cruz Island Link: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=38&revision=newest Citation: Schmitt R. J., S. J. Holbrook. 2012. SBC LTER: Santa Cruz Island: Cover of Algae, Invertebrates and Benthic Substrate. Environmental Data Initiative. doi. ca_np.csv and ci_np.xlsx Description: US National Parks visitation data (1904 - 2016) Link: https://data.world/inform8n/us-national-parks-visitation-1904-2016-with-boundaries Source: Data originally accessed from the US Department of the Interior National Park Service’s Integrated Resource Management Applications data portal (https://irma.nps.gov/) "],["overview.html", "Chapter 2 Overview 2.1 Welcome! 2.2 Guiding principles / recurring themes", " Chapter 2 Overview 2.1 Welcome! In this workshop you will learn hands-on how to begin to interoperate between Excel and R. But this workshop is not only about learning R; we will learn R using additional software: RStudio and GitHub. These tools will help us develop good habits for working in a reproducible and collaborative way — critical attributes of the modern analyst. It’s going to be fun and empowering! Accompanying slides: Google Slides 2.1.1 You are all welcome here This is going to be a fun workshop. This workshop will give you hands-on experience and confidence with R, and how to interoperate between Excel and R — it is not about wholesale replacing everything you do in Excel into R. We will learn technical skills that you can incrementally incorporate into your existing workflows. But a big part of interfacing between Excel and R is not only skillsets, it is mindsets. It is the mindset about how we think about data. How we shape data and organize data and analyze data. And how what we do now can make our analytical life better in the future. A modern R user has a workflow framed around collaboration, and uses an ecosystem of tools and practices. We will be learning three main things all at the same time: coding with best practices (R/RStudio/tidyverse) collaborative bookkeeping (Git/GitHub) reporting and publishing (RMarkdown/GitHub) We are going to go through a lot in these two days and it’s less important that you remember it all. More importantly, you’ll have experience with it and confidence that you can do it. The main thing to take away is that there are good ways to work between R and Excel; we will teach you to expect that so you can find what you need and use it! A theme throughout is that tools exist and are being developed by real, and extraordinarily nice, people to meet you where you are and help you do what you need to do. You are all welcome here, please be respectful of one another. Everyone in this workshop is coming from a different place with different experiences and expectations. But everyone will learn something new here, because there is so much innovation in the data science world. Instructors and helpers learn something new every time, from each other and from your questions. If you are already familiar with some of this material, focus on how we teach, and how you might teach it to others. Use these workshop materials not only as a reference in the future but also for talking points so you can communicate the importance of these tools to your communities. A big part of this training is not only for you to learn these skills, but for you to also teach others and increase the value and practice of open data science in science as a whole. 2.2 Guiding principles / recurring themes “Keep the raw data raw” — A hard line separating raw data and analyses. In R, we have data in one file and written computational commands saved as a separate file. Think ahead for Future You, Future Us. Help make lives easier — first and foremost your own. Create breadcrumbs for yourselves and others: document and share your work. "],["rstudio.html", "Chapter 3 R & RStudio, RMarkdown 3.1 Summary 3.2 RStudio Orientation 3.3 Intro to RMarkdown 3.4 R code in the Console 3.5 R functions 3.6 Help pages 3.7 Commenting 3.8 Assigning objects with <- 3.9 R Packages 3.10 GitHub brief intro & config", " Chapter 3 R & RStudio, RMarkdown 3.1 Summary We will begin learning R through RMarkdown, which helps you tell your story of data analysis because you can write text alongside the code. We are actually learning two languages at once: R and Markdown. 3.1.1 Objectives In this lesson we will get familiar with: the RStudio interface RMarkdown functions, packages, help pages, and error messages assigning variables and commenting configuring GitHub with RStudio 3.1.2 Resources What is RMarkdown? awesome 1-minute video by RStudio R for Data Science by Hadley Wickham and Garrett Grolemund STAT 545 by Jenny Bryan Happy Git with R by Jenny Bryan R for Excel Users by Gordon Shotwell Welcome to the tidyverse by Hadley Wickham et al. A GIF-based introduction to RStudio - Shannon Pileggi, Piping Hot Data 3.2 RStudio Orientation What is the RStudio IDE (integrated development environment)? The RStudio IDE is software that greatly improves your R experience. I think that R is your airplane, and the RStudio IDE is your airport. You are the pilot, and you use R to go places! With practice you’ll gain skills and confidence; you can fly further distances and get through tricky situations. You will become an awesome pilot and can fly your plane anywhere. And the RStudio IDE provides support! Runways, communication, community, and other services that makes your life as a pilot much easier. It provides not only the infrastructure but a hub for the community that you can interact with. To launch RStudio, double-click on the RStudio icon. Launching RStudio also launches R, and you will probably never open R by itself. Notice the default panes: Console (entire left) Environment/History (tabbed in upper right) Files/Plots/Packages/Help (tabbed in lower right) We won’t click through this all immediately but we will become familiar with more of the options and capabilities throughout the next few days. Something critical to know now is that you can make everything you see BIGGER by going to the navigation pane: View > Zoom In. Learn these keyboard shortcuts; being able to see what you’re typing will help avoid typos & help us help you. An important first question: where are we? If you’ve have opened RStudio for the first time, you’ll be in your Home directory. This is noted by the ~/ at the top of the console. You can see too that the Files pane in the lower right shows what is in the Home directory where you are. You can navigate around within that Files pane and explore, but note that you won’t change where you are: even as you click through you’ll still be Home: ~/. We are going to have our first experience with R through RMarkdown, so let’s do the following. 3.3 Intro to RMarkdown An RMarkdown file is a plain text file that allow us to write code and text together, and when it is “knit,” the code will be evaluated and the text formatted so that it creates a reproducible report or document that is nice to read as a human. This is really critical to reproducibility, and it also saves time. This document will recreate your figures for you in the same document where you are writing text. So no more doing analysis, saving a plot, pasting that plot into Word, redoing the analysis, re-saving, re-pasting, etc. This 1-minute video does the best job of introducing RMarkdown: What is RMarkdown?. Now let’s experience this a bit ourselves and then we’ll talk about it more. 3.3.1 Create an RMarkdown file Let’s do this together: File -> New File -> RMarkdown… (or alternatively you can click the green plus in the top left -> RMarkdown). Let’s title it “Testing” and write our name as author, then click OK with the recommended Default Output Format, which is HTML. OK, first off: by opening a file, we are seeing the 4th pane of the RStudio console, which here is a text editor. This lets us dock and organize our files within RStudio instead of having a bunch of different windows open (but there are options to pop them out if that is what you prefer). Let’s have a look at this file — it’s not blank; there is some initial text is already provided for you. Let’s have a high-level look through of it: The top part has the Title and Author we provided, as well as today’s date and the output type as an HTML document like we selected above. There are white and grey sections. These are the 2 main languages that make up an RMarkdown file. Grey sections are R code White sections are Markdown text There is black and blue text (we’ll ignore the green text for now). 3.3.2 Knit your RMarkdown file Let’s go ahead and “Knit” by clicking the blue yarn at the top of the RMarkdown file. It’s going to ask us to save first, I’ll name mine “testing.Rmd.” Note that this is by default going to save this file in your home directory /~. Since this is a testing document this is fine to save here; we will get more organized about where we save files very soon. Once you click Save, the knit process will be able to continue. OK so how cool is this, we’ve just made an html file! This is a single webpage that we are viewing locally on our own computers. Knitting this RMarkdown document has rendered — we also say formatted — both the Markdown text (white) and the R code (grey), and the it also executed — we also say ran — the R code. Let’s have a look at them side-by-side: Let’s take a deeper look at these two files. So much of learning to code is looking for patterns. 3.3.2.1 Activity Introduce yourself to the person sitting next to you. Discuss what you notice with these two files. Then we will have a brief share-out with the group. (5 mins) 3.3.3 Markdown text Let’s look more deeply at the Markdown text. Markdown is a formatting language for plain text, and there are only a handful of rules to know. Notice the syntax for: headers with # or ## bold with ** To see more of the rules, let’s look at RStudio’s built-in reference. Let’s do this: Help > Markdown Quick Reference There are also good cheatsheets available online. 3.3.4 R code Let’s look at the R code that we see executed in our knitted document. We see that: summary(cars) produces a table with information about cars plot(pressure) produces a plot with information about pressure There are a couple of things going on here. summary() and plot() are called functions; they are operations and these ones come installed with R. We call functions installed with R base R functions. This is similar to Excel’s functions and formulas. cars and pressure are small datasets that come installed with R. We’ll talk more about functions and data shortly. 3.3.5 Code chunks R code is written in code chunks, which are grey. Each of them start with 3 backticks and {r label} that signify there will be R code following. Anything inside the brackets ({ }) is instructions for RMarkdown about that code to run. For example: the first chunk labeled “setup” says include=FALSE, and we don’t see it included in the HTML document. the second chunk labeled “cars” has no additional instructions, and in the HTML document we see the code and the evaluation of that code (a summary table) the third chunk labeled “pressure” says echo=FALSE, and in the HTML document we do not see the code echoed, we only see the plot when the code is executed. Aside: Code chunk labels It is possible to label your code chunks. This is to help us navigate between them and keep them organized. In our example Rmd, our three chunks say r as the language, and have a label (setup, cars, pressure). Labels are optional, but will become powerful as you become a powerful R user. But if you label your code chunks, you must have unique labels. Notice how the word FALSE is all capitals. Capitalization matters in R; TRUE/FALSE is something that R can interpret as a binary yes/no or 1/0. There are many more options available that we will discuss as we get more familiar with RMarkdown. 3.3.5.1 New code chunks We can create a new chunk in your RMarkdown first in one of these ways: click “Insert > R” at the top of the editor pane (with the green plus and green box) type it by hand: ```{r} ``` copy-paste an existing chunk — but remember to relabel it something unique! (we’ll explore this more in a moment) Aside: doesn’t have to be only R, other languages supported. Let’s create a new code chunk at the end of our document. Now, let’s write some code in R. Let’s say we want to see the summary of the pressure data. I’m going to press enter to to add some extra carriage returns because sometimes I find it more pleasant to look at my code, and it helps in troubleshooting, which is often about identifying typos. R lets you use as much whitespace as you would like. summary(pressure) We can knit this and see the summary of pressure. This is the same data that we see with the plot just above. Troubleshooting: Did trying to knit your document produce an error? Start by looking at your code again. Do you have both open ( and close ) parentheses? Are your code chunk fences (```) correct? 3.4 R code in the Console So far we have been telling R to execute our code only when we knit the document, but we can also write code in the Console to interact with the live R process. The Console (bottom left pane of the RStudio IDE) is where you can interact with the R engine and run code directly. Let’s type this in the Console: summary(pressure) and hit enter. We see the pressure summary table returned; it is the same information that we saw in our knitted html document. By default, R will display (we also say “print”) the executed result in the Console summary(pressure) We can also do math as we can in Excel: type the following and press enter. 8*22.3 3.4.1 Error messages When you code in R or any language, you will encounter errors. We will discuss troubleshooting tips more deeply tomorrow in Collaborating & getting help; here we will just get a little comfortable with them. 3.4.1.1 R error messages Error messages are your friends. What do they look like? I’ll demo typing in the Console summary(pressur) summary(pressur) #> Error in summary(pressur): object 'pressur' not found Error messages are R’s way of saying that it didn’t understand what you said. This is like in English when we say “What?” or “Pardon?” And like in spoken language, some error messages are more helpful than others. Like if someone says “Sorry, could you repeat that last word” rather than only “What?” In this case, R is saying “I didn’t understand pressur.” R tracks the datasets it has available as objects, as well as any additional objects that you make. pressur is not among them, so it says that it is not found. The first step of becoming a proficient R user is to move past the exasperation of “it’s not working!” and read the error message. Errors will be less frustrating with the mindset that most likely the problem is your typo or misuse, and not that R is broken or hates you. Read the error message to learn what is wrong. 3.4.1.2 RMarkdown error messages Errors can also occur in RMarkdown. I said a moment ago that you label your code chunks, they need to be unique. Let’s see what happens if not. If I (re)name our summary(pressure) chunk to “cars,” I will see an error when you try to knit: processing file: testing.Rmd Error in parse_block(g[-1], g[1], params.src) : duplicate label 'cars' Calls: <Anonymous> ... process_file -> split_file -> lapply -> FUN -> parse_block Execution halted There are two things to focus on here. First: This error message starts out in a pretty cryptic way: I don’t expect you to know what parse_block(g[-1]... means. But, expecting that the error message is really trying to help me, I continue scanning the message which allows me to identify the problem: duplicate label 'cars'. Second: This error is in the “R Markdown” tab on the bottom left of the RStudio IDE; it is not in the Console. That is because when RMarkdown is knitted, it actually spins up an R workspace separately from what is passed to the Console; this is one of the ways that R Markdown enables reproducibility because it is a self-contained instance of R. You can click back and forth between the Console and the R Markdown tab; this is something to look out for as we continue. We will work in the Console and R Markdown and will discuss strategies for where and how to work as we go. Let’s click back to Console now. 3.4.2 Running RMarkdown code chunks So far we have written code in our RMarkdown file that is executed when we knit the file. We have also written code directly in the Console that is executed when we press enter/return. Additionally, we can write code in an RMarkdown code chunk and execute it by sending it into the Console (i.e. we can execute code without knitting the document). How do we do it? There are several ways. Let’s do each of these with summary(pressure). First approach: send R code to the Console. This approach involves selecting (highlighting) the R code only (summary(pressure)), not any of the backticks/fences from the code chunk. Troubleshooting: If you see Error: attempt to use zero-length variable name it is because you have accidentally highlighted the backticks along with the R code. Try again — and don’t forget that you can add spaces within the code chunk or make your RStudio session bigger (View > Zoom In)! Do this by selecting code and then: copy-pasting into the Console and press enter/return. clicking ‘Run’ from RStudio IDE. This is available from: the bar above the file (green arrow) the menu bar: Code > Run Selected Line(s) keyboard shortcut: command-return Second approach: run full code chunk. Since we are already grouping relevant code together in chunks, it’s reasonable that we might want to run it all together at once. Do this by placing your curser within a code chunk and then: clicking the little black down arrow next to the Run green arrow and selecting Run Current Chunk. Notice there are also options to run all chunks, run all chunks above or below… 3.4.3 Writing code in a file vs. Console When should you write code in a file (.Rmd or .R script) and when should you write it in the Console? We write things in the file that are necessary for our analysis and that we want to preserve for reproducibility; we will be doing this throughout the workshop to give you a good sense of this. A file is also a great way for you to take notes to yourself. The Console is good for doing quick calculations like 8*22.3, testing functions, for calling help pages, for installing packages. We’ll explore these things next. 3.5 R functions Like Excel, the power of R comes not from doing small operations individually (like 8*22.3). R’s power comes from being able to operate on whole suites of numbers and datasets. And also like Excel, some of the biggest power in R is that there are built-in functions that you can use in your analyses (and, as we’ll see, R users can easily create and share functions, and it is this open source developer and contributor community that makes R so awesome). R has a mind-blowing collection of built-in functions that are used with the same syntax: function name with parentheses around what the function needs to do what it is supposed to do. We’ve seen a few functions already: we’ve seen plot() and summary(). Functions always have the same structure: a name, parentheses, and arguments that you can specify. function_name(arguments). When we talk about function names, we use the convention function_name() (the name with empty parentheses), but in practice, we usually supply arguments to the function function_name(arguments) so that it works on some data. Let’s see a few more function examples. Like in Excel, there is a function called “sum” to calculate a total. In R, it is spelled lowercase: sum(). (As I type in the Console, R will provide suggestions). Let’s use the sum() function to calculate the sum of all the distances traveled in the cars dataset. We specify a single column of a dataset using the $ operator: sum(cars$dist) Another function is simply called c(); which combines values together. So let’s create a new R code chunk. And we’ll write: c(1, 7:9) ## [1] 1 7 8 9 Aside: some functions don’t require arguments: try typing date() into the Console. Be sure to type the parentheses (date()); otherwise R will return the code behind the date() function rather than the output that you want/expect. So you can see that this combines these values all into the same place, which is called a vector here. We could also do this with a non-numeric examples, which are called “strings”: c("San Francisco", "Cal Academy") ## [1] "San Francisco" "Cal Academy" We need to put quotes around non-numeric values so that R does not interpret them as an object. It would definitely get grumpy and give us an error that it did not have an object by these names. And you see that R also prints in quotes. We can also put functions inside of other functions. This is called nested functions. When we add another function inside a function, R will evaluate them from the inside-out. c(sum(cars$dist), "San Francisco", "Cal Academy") ## [1] "2149" "San Francisco" "Cal Academy" So R first evaluated the sum(cars$dist), and then evaluates the c() statement. This example demonstrates another key idea in R: the idea of classes. The output R provides is called a vector, and everything within that vector has to be the same type of thing: we can’t have both numbers and words inside. So here R is able to first calculate sum(cars$dist) as a number, but then c() will turn that number into a text, called a “string” in R: you see that it is in quotes. It is no longer a numeric, it is a string. This is a big difference between R and Excel, since Excel allows you to have a mix of text and numeric in the same column or row. R’s way can feel restrictive, but it is also more predictable. In Excel, you might have a single number in your whole sheet that Excel is silently interpreting as text so it is causing errors in the analyses. In R, the whole column will be the same type. This can still cause trouble, but that is where the good practices that we are learning together can help minimize that kind of trouble. We will not discuss classes or work with nested functions very much in this workshop (the tidyverse design and pipe operator make nested functions less prevalent). But we wanted to introduce them to you because they will be something you encounter as you continue on your journey with R. 3.6 Help pages Every function available to you should have a help page, and you access it by typing a question mark preceding the function name in the Console. Let’s have a deeper look at the arguments for plot(), using the help pages. ?plot This opens up the correct page in the Help Tab in the bottom-right of the RStudio IDE. You can also click on the tab and type in the function name in the search bar. All help pages will have the same format, here is how I look at it: The help page tells the name of the package in the top left, and broken down into sections: Help pages - Description: An extended description of what the function does. - Usage: The arguments of the function and their default values. - Arguments: An explanation of the data each argument is expecting. - Details: Any important details to be aware of. - Value: The data the function returns. - See Also: Any related functions you might find useful. - Examples: Some examples for how to use the function. When I look at a help page, I start with the Description to see if I am in the right place for what I need to do. Reading the description for plot lets me know that yup, this is the function I want. I next look at the usage and arguments, which give me a more concrete view into what the function does. plot requires arguments for x and y. But we passed only one argument to plot(): we passed the cars dataset (plot(cars)). R is able to understand that it should use the two columns in that dataset as x and y, and it does so based on order: the first column “speed” becomes x and the second column “dist” becomes y. The ... means that there are many other arguments we can pass to plot(), which we should expect: I think we can all agree that it would be nice to have the option of making this figure a little more beautiful and compelling. Glancing at some of the arguments, we can understand here to be about the style of the plots. Next, I usually scroll down to the bottom to the examples. This is where I can actually see how the function is used, and I can also paste those examples into the Console to see their output. Let’s try it: plot(sin, -pi, 2*pi) 3.7 Commenting I’ve been working in the Console to illustrate working interactively with the live R process. But it is likely that you may want to write some of these things as notes in your R Markdown file. That’s great! But you may not want everything you type to be run when you knit your document. So you can tell R not to run something by “commenting it out.” This is done with one or more pound/hash/number signs: #. So if I wanted to write a note to myself about using ? to open the help pages, I would write this in my R Markdown code chunk: ## open help pages with ?: # ?plot RStudio color-codes comments as green so they are easier to see. Notice that my convention is to use two ##’s for my notes, and only one for the code that I don’t want to run now, but might want to run other times. I like this convention because in RStudio you can uncomment/recomment multiple lines of code at once if you use just one #: do this by going to the menu Code > Comment/Uncomment Lines (keyboard shortcut on my Mac: Shift-Command-C). Aside: Note also that the hashtag # is used differently in Markdown and in R. In R, a hashtag indicates a comment that will not be evaluated. You can use as many as you want: # is equivalent to ######. In Markdown, a hashtag indicates a level of a header. And the number you use matters: # is a “level one header,” meaning the biggest font and the top of the hierarchy. ### is a level three header, and will show up nested below the # and ## headers. 3.8 Assigning objects with <- In Excel, data are stored in the spreadsheet. In R, they are stored in objects. Data can be a variety of formats, for example numeric and strings like we just talked about. We will be working with data objects that are rectangular in shape. If they only have one column or one row, they are also called a vector. And we assign these objects names. This is a big difference with Excel, where you usually identify data by its location on the grid, like $A1:D$20. (You can do this with Excel by naming ranges of cells, but many people don’t do this.) We assign an object a name by writing the name along with the assignment operator <-. Let’s try it by creating a variable called “x” and assigning it to 10. x <- 10 When I see this written, in my head I hear “x gets 10.” When we send this to the Console (I do this with Command - Enter), notice how nothing is printed in return. This is because when we assign a variable, by default it is not returned. We can see what x is by typing it in the Console and hitting enter. We can also assign objects with existing objects. Let’s say we want to have the distance traveled by cars in its own variable, and multiply by 1000 (assuming these data are in km and we want m). dist_m <- cars$dist * 1000 Object names can be whatever you want, although it is wise to not name objects by functions that you know exist, for example “c” or “false.” Additionally, they cannot start with a digit and cannot contain spaces. Different folks have different conventions; you will be wise to adopt a convention for demarcating words in names. ## i_use_snake_case ## other.people.use.periods ## evenOthersUseCamelCase ## also-there-is-kebab-case 3.9 R Packages So far we’ve been using a couple functions that are included with R out-of-the-box such as plot() and c(). We say that these functions are from “Base R.” But, one of the amazing things about R is that a vast user community is always creating new functions and packages that expand R’s capabilities. In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation (including to create the help pages), and tests, and is easy to share with others. They increase the power of R by improving existing base R functionalities, or by adding new ones. The traditional place to download packages is from CRAN, the Comprehensive R Archive Network, which is where you downloaded R. CRAN is like a grocery store or iTunes for vetted R packages. Aside: You can also install packages from GitHub; see devtools::install_github() You don’t need to go to CRAN’s website to install packages, this can be accomplished within R using the command install.packages(\"package-name-in-quotes\"). 3.9.1 How do you know what packages/functions exist? How do you know what packages exist? Well, how do you know what movies exist on iTunes? You learn what’s available based on your needs, interests the community around you. We’ll introduce you to several really powerful packages that we work with and help you find others that might be of interest to you. provide examples here 3.9.2 Installing R Packages Let’s install several packages that we will be using shortly. Write this in your R Markdown document and run it: ## setup packages install.packages("usethis") And after you run it, comment it out: ## setup packages # install.packages("usethis") Now we’ve installed the package, but we need to tell R that we are going to use the functions within the usethis package. We do this by using the function library(). In my mind, this is analogous to needing to wire your house for electricity: this is something you do once; this is install.packages. But then you need to turn on the lights each time you need them (R Session). It’s a nice convention to do this on the same line as your commented-out install.packages() line; this makes it easier for someone (including you in a future time or computer) to install the package easily. ## setup packages library(usethis) # install.packages("usethis") When usethis is successfully attached, you won’t get any feedback in the Console. So unless you get an error, this worked for you. Now let’s do the same with the here package. library(here) # install.packages("here") ## here() starts at /Users/lowndes/github/rstudio-conf-2020/r-for-excel # here() starts at /Users/lowndes here also successfully attached but isn’t quiet about it. It is a “chatty” package; when we attached it did so, and responded with the filepath where we are working from. This is the same as ~/ which we saw earlier. Finally, let’s install the tidyverse package. # install.packages("tidyverse") “The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy.” - Joseph Rickert: What is the tidyverse?, RStudio Community Blog. This may take a little while to complete. 3.10 GitHub brief intro & config Before we break, we are going to set up Git and GitHub which we will be using along with R and RStudio for the rest of the workshop. Before we do the setup configuration, let me take a moment to talk about what Git and GitHub are. It helps me to think of GitHub like Dropbox: you identify folders for GitHub to ‘track’ and it syncs them to the cloud. This is good first-and-foremost because it makes a back-up copy of your files: if your computer dies not all of your work is gone. But with GitHub, you have to be more deliberate about when syncs are made. This is because GitHub saves these as different versions, with information about who contributed when, line-by-line. This makes collaboration easier, and it allows you to roll-back to different versions or contribute to others’ work. git will track and version your files, GitHub stores this online and enables you to collaborate with others (and yourself). Although git and GitHub are two different things, distinct from each other, we can think of them as a bundle since we will always use them together. 3.10.1 Configure GitHub This set up is a one-time thing! You will only have to do this once per computer. We’ll walk through this together. In a browser, go to github.com and to your profile page as a reminder. You will need to remember your GitHub username, the email address you created your GitHub account with, and your GitHub password. We will be using the use_git_config() function from the usethis package we just installed. Since we already installed and attached this package, type this into your Console: ## use_git_config function with my username and email as arguments use_git_config(user.name = "jules32", user.email = "[email protected]") If you see Error in use_git_config() : could not find function \"use_git_config\" please run library(\"usethis\") 3.10.2 Ensure that Git/GitHub/RStudio are communicating We are going to go through a few steps to ensure the Git/GitHub are communicating with RStudio 3.10.2.1 RStudio: New Project Click on New Project. There are a few different ways; you could also go to File > New Project…, or click the little green + with the R box in the top left. also in the File menu). 3.10.2.2 Select Version Control 3.10.2.3 Select Git Since we are using git. Do you see what I see? If yes, hooray! Time for a break! If no, we will help you troubleshoot. Double check that GitHub username and email are correct Troubleshooting, starting with HappyGitWithR’s troubleshooting chapter which git (Mac, Linux, or anything running a bash shell) where git (Windows, when not in a bash shell) Potentially set up a RStudio Cloud account: https://rstudio.cloud/ 3.10.3 Troubleshooting 3.10.3.1 Configure git from Terminal If usethis fails, the following is the classic approach to configuring git. Open the Git Bash program (Windows) or the Terminal (Mac) and type the following: # display your version of git git --version # replace USER with your Github user account git config --global user.name USER # replace [email protected] with the email you used to register with Github git config --global user.email [email protected] # list your config to confirm user.* variables set git config --list This will configure git with global (--global) commands, which means it will apply ‘globally’ to all your future github repositories, rather than only to this one now. Note for PCs: We’ve seen PC failures correct themselves by doing the above but omitting --global. (Then you will need to configure GitHub for every repo you clone but that is fine for now). 3.10.3.2 Troubleshooting All troubleshooting starts with reading Happy Git With R’s RStudio, Git, GitHub Hell troubleshooting chapter. 3.10.3.2.1 New(ish) Error on a Mac We’ve also seen the following errors from RStudio: error key does not contain a section --global terminal and fatal: not in a git directory To solve this, go to the Terminal and type: which git Look at the filepath that is returned. Does it say anything to do with Apple? -> If yes, then the Git you downloaded isn’t installed, please redownload if necessary, and follow instructions to install. -> If no, (in the example image, the filepath does not say anything with Apple) then proceed below: In RStudio, navigate to: Tools > Global Options > Git/SVN. Does the “Git executable” filepath match what the url in Terminal says? If not, click the browse button and navigate there. Note: on my laptop, even though I navigated to /usr/local/bin/git, it then automatically redirect because /usr/local/bin/git was an alias on my computer. That is fine. Click OK. 3.10.4 END RStudio/RMarkdown session! "],["github.html", "Chapter 4 GitHub 4.1 Summary 4.2 Why should R users use Github? 4.3 Github Configuration 4.4 Create a repository on Github.com 4.5 Clone your repository using RStudio 4.6 Sync from RStudio (local) to GitHub (remote) 4.7 Commit history 4.8 Project-oriented workflows 4.9 Project-oriented workflows in action (aka our analytical setup) 4.10 Committing - how often? Tracking changes in your files 4.11 Issues", " Chapter 4 GitHub 4.1 Summary We will learn about version control and practice a workflow with GitHub and RStudio that streamlines working with our most important collaborator: Future You. 4.1.1 Objectives Today, we’ll interface with GitHub from our local computers using RStudio. Aside: There are many other ways to interact with GitHub, including GitHub’s Desktop App and the command line (here is Jenny Bryan’s list of git clients). You have the largest suite of options if you interface through the command line, but the most common things you’ll do can be done through one of these other applications (i.e. RStudio). Here’s what we’ll do, since we’ve already set up git on your computers in the previous session (Chapter 4): create a repository on Github.com (remote) clone locally using RStudio sync local to remote: pull, stage, commit, push explore github.com files, commit history, README project-oriented workflows project-oriented workflows in action 4.1.2 Resources Excuse me, do you have a moment to talk about version control? by Jenny Bryan Happy Git with R by Jenny Bryan, specifically Detect Git from RStudio What They Forgot to Teach You About R by Jenny Bryan, specifically Project-oriented workflows GitHub Quickstart by Melanie Frazier GitHub for Project Management by Openscapes 4.2 Why should R users use Github? Modern R users use GitHub because it helps make coding collaborative and social while also providing huge benefits to organization, archiving, and being able to find your files easily when you need them. One of the most compelling reasons for me is that it ends (or nearly ends) the horror of keeping track of versions. Basically, we get away from this: This is a nightmare not only because I have NO idea which is truly the version we used in that analysis we need to update, but because it is going to take a lot of detective work to see what actually changed between each file. Also, it is very sad to think about the amount of time everyone involved is spending on bookkeeping: is everyone downloading an attachment, dragging it to wherever they organize this on their own computers, and then renaming everything? Hours and hours of all of our lives. But then there is GitHub. In GitHub, in this example you will likely only see a single file, which is the most recent version. GitHub’s job is to track who made any changes and when (so no need to save a copy with your name or date at the end), and it also requires that you write something human-readable that will be a breadcrumb for you in the future. It is also designed to be easy to compare versions, and you can easily revert to previous versions. GitHub also supercharges you as a collaborator. First and foremost with Future You, but also sets you up to collaborate with Future Us! GitHub, especially in combination with RStudio, is also game-changing for publishing and distributing. You can — and we will — publish and share files openly on the internet. 4.2.1 What is Github? And Git? OK so what is GitHub? And Git? Git is a program that you install on your computer: it is version control software that tracks changes to your files over time. Github is a website that is essentially a social media platform for your git-versioned files. GitHub stores all your versioned files as an archive, but also as allows you to interact with other people’s files and has management tools for the social side of software projects. It has many nice features to be able visualize differences between images, rendering & diffing map data files, render text data files, and track changes in text. Github was developed for software development, so much of the functionality and terminology of that is exciting for professional programmers (e.g., branches and pull requests) isn’t necessarily the right place for us as new R users to get started. So we will be learning and practicing GitHub’s features and terminology on a “need to know basis” as we start managing our projects with GitHub. 4.3 Github Configuration We’ve just configured Github this at the end of Chapter 3. So skip to the next section if you’ve just completed this! However, if you’re dropping in on this chapter to setup Github, make sure you first configure Github with these instructions before continuing. 4.4 Create a repository on Github.com Let’s get started by going to https://github.com and going to our user profile. You can do this by typing your username in the URL (github.com/username), or after signing in, by clicking on the top-right button and going to your profile. This will have an overview of you and your work, and then you can click on the Repository tab Repositories are the main “unit” of GitHub: they are what GitHub tracks. They are essentially project-level folders that will contain everything associated with a project. It’s where we’ll start too. We create a new repository (called a “repo”) by clicking “New repository.” Choose a name. Call it whatever you want (the shorter the better), or follow me for convenience. I will call mine r-workshop. Also, add a description, make it public, create a README file, and create your repo! The Add gitignore option adds a document where you can identify files or file-types you want Github to ignore. These files will stay in on the local Github folder (the one on your computer), but will not be uploaded onto the web version of Github. The Add a license option adds a license that describes how other people can use your Github files (e.g., open source, but no one can profit from them, etc.). We won’t worry about this today. Check out our new repository! Great! So now we have our new repository that exists in the Cloud. Let’s get it established locally on our computers: that is called “cloning.” 4.5 Clone your repository using RStudio Let’s clone this repo to our local computer using RStudio. Unlike downloading, cloning keeps all the version control and user information bundled with the files. 4.5.1 Copy the repo address First, copy the web address of the repository you want to clone. We will use HTTPS. Aside: HTTPS is default, but you could alternatively set up with SSH. This is more advanced than we will get into here, but allows 2-factor authentication. See Happy Git with R for more information. 4.5.2 RStudio: New Project Now go back to RStudio, and click on New Project. There are a few different ways; you could also go to File > New Project…, or click the little green + with the R box in the top left. also in the File menu). 4.5.3 Select Version Control 4.5.4 Select Git Since we are using git. 4.5.5 Paste the repo address Paste the repo address (which is still in your clipboard) into in the “Repository URL” field. The “Project directory name” should autofill; if it does not press tab, or type it in. It is best practice to keep the “Project directory name” THE SAME as the repository name. When cloned, this repository is going to become a folder on your computer. At this point you can save this repo anywhere. There are different schools of thought but we think it is useful to create a high-level folder where you will keep your github repos to keep them organized. We call ours github and keep it in our root folder (~/github), and so that is what we will demonstrate here — you are welcome to do the same. Press “Browse…” to navigate to a folder and you have the option of creating a new folder. Finally, click Create Project. 4.5.6 Admire your local repo If everything went well, the repository will show up in RStudio! The repository is also saved to the location you specified, and you can navigate to it as you normally would in Finder or Windows Explorer: Hooray! 4.5.7 Inspect your local repo Let’s notice a few things: First, our working directory is set to ~/github/r-workshop, and r-workshop is also named in the top right hand corner. Second, we have a Git tab in the top right pane! Let’s click on it. Our Git tab has 2 items: .gitignore file .Rproj file These have been added to our repo by RStudio — we can also see them in the File pane in the bottom right of RStudio. These are helper files that RStudio has added to streamline our workflow with GitHub and R. We will talk about these a bit more soon. One thing to note about these files is that they begin with a period (.) which means they are hidden files: they show up in the Files pane of RStudio but won’t show up in your Finder or Windows Explorer. Going back to the Git tab, both these files have little yellow icons with question marks ?. This is GitHub’s way of saying: “I am responsible for tracking everything that happens in this repo, but I’m not sure what is going on with these files yet. Do you want me to track them too?” We will handle this in a moment; first let’s look at the README.md file. 4.5.8 Edit your README file Let’s also open up the README.md. This is a Markdown file, which is the same language we just learned with R Markdown. It’s like an R Markdown file without the abilities to run R code. We will edit the file and illustrate how GitHub tracks files that have been modified (to complement seeing how it tracks files that have been added. README files are common in programming; they are the first place that someone will look to see why code exists and how to run it. In my README, I’ll write: This repo is for my analyses at RStudio::conf(2020). When I save this, notice how it shows up in my Git tab. It has a blue “M”: GitHub is already tracking this file, and tracking it line-by-line, so it knows that something is different: it’s Modified with an M. Great. Now let’s sync back to GitHub in 4 steps. 4.6 Sync from RStudio (local) to GitHub (remote) Syncing to GitHub.com means 4 steps: Pull Stage Commit Push We start off this whole process by clicking on the Commit section. 4.6.1 Pull We start off by “Pulling” from the remote repository (GitHub.com) to make sure that our local copy has the most up-to-date information that is available online. Right now, since we just created the repo and are the only ones that have permission to work on it, we can be pretty confident that there isn’t new information available. But we pull anyways because this is a very safe habit to get into for when you start collaborating with yourself across computers or others. Best practice is to pull often: it costs nothing (other than an internet connection). Pull by clicking the teal Down Arrow. (Notice also how when you highlight a filename, a preview of the differences displays below). 4.6.2 Stage Let’s click the boxes next to each file. This is called “staging a file”: you are indicating that you want GitHub to track this file, and that you will be syncing it shortly. Notice: .Rproj and .gitignore files: the question marks turn into an A because these are new files that have been added to your repo (automatically by RStudio, not by you). README.md file: the M indicates that this was modified (by you) These are the codes used to describe how the files are changed, (from the RStudio cheatsheet): 4.6.3 Commit Committing is different from saving our files (which we still have to do! RStudio will indicate a file is unsaved with red text and an asterix). We commit a single file or a group of files when we are ready to save a snapshot in time of the progress we’ve made. Maybe this is after a big part of the analysis was done, or when you’re done working for the day. Committing our files is a 2-step process. First, you write a “commit message,” which is a human-readable note about what has changed that will accompany GitHub’s non-human-readable alphanumeric code to track our files. I think of commit messages like breadcrumbs to my Future Self: how can I use this space to be useful for me if I’m trying to retrace my steps (and perhaps in a panic?). Second, you press Commit. When we have committed successfully, we get a rather unsuccessful-looking pop-up message. You can read this message as “Congratulations! You’ve successfully committed 3 files, 2 of which are new!” It is also providing you with that alphanumeric SHA code that GitHub is using to track these files. If our attempt was not successful, we will see an Error. Otherwise, interpret this message as a joyous one. Does your pop-up message say “Aborting commit due to empty commit message.” GitHub is really serious about writing human-readable commit messages. When we close this window there is going to be (in my opinion) a very subtle indication that we are not done with the syncing process. We have successfully committed our work as a breadcrumb-message-approved snapshot in time, but it still only exists locally on our computer. We can commit without an internet connection; we have not done anything yet to tell GitHub that we want this pushed to the remote repo at GitHub.com. So as the last step, we push. 4.6.4 Push The last step in the syncing process is to Push! Awesome! We’re done here in RStudio for the moment, let’s check out the remote on GitHub.com. 4.7 Commit history The files you added should be on github.com. Notice how the README.md file we created is automatically displayed at the bottom. Since it is good practice to have a README file that identifies what code does (i.e. why it exists), GitHub will display a Markdown file called README nicely formatted. Let’s also explore the commit history. The 2 commits we’ve made (the first was when we originally initiated the repo from GitHub.com) are there! 4.8 Project-oriented workflows Let’s go back to RStudio and how we set up well-organized projects and workflows for our data analyses. This GitHub repository is now also an RStudio Project (capital P Project). This just means that RStudio has saved this additional file with extension .Rproj (ours is r-workshop.Rproj) to store specific settings for this project. It’s a bit of technology to help us get into the good habit of having a project-oriented workflow. A project-oriented workflow means that we are going to organize all of the relevant things we need for our analyses in the same place. That means that this is the place where we keep all of our data, code, figures, notes, etc. R Projects are great for reproducibility, because our self-contained working directory will be the first place R looks for files. 4.8.1 Working directory Now that we have our Project, let’s revisit this important question: where are we? Now we are in our Project. Everything we do will by default be saved here so we can be nice and organized. And this is important because if Allison clones this repository that you just made and saves it in Allison/my/projects/way/over/here, she will still be able to interact with your files as you are here. 4.9 Project-oriented workflows in action (aka our analytical setup) Let’s get a bit organized. First, let’s create our a new R Markdown file where we will do our analyses. This will be nice because you can also write notes to yourself in this document. 4.9.1 Create a new Rmd file So let’s do this (again): File > New File > R Markdown … (or click the green plus in the top left corner). Let’s set up this file so we can use it for the rest of the day. I’m going to update the header with a new title and add my name, and then I’m going to delete the rest of the document so that we have a clean start. Efficiency Tip: I use Shift - Command - Down Arrow to highlight text from my cursor to the end of the document --- title: "Creating graphs in R with `ggplot2`" author: "Julie Lowndes" date: "01/27/2020" output: html_document --- # Plots with ggplot2 We are going to make plots in R and it's going to be amazing. Now, let’s save it. I’m going to call my file plots-ggplot.Rmd. Notice that when we save this file, it pops up in our Git tab. Git knows that there is something new in our repo. Let’s also knit this file. And look: Git also sees the knitted .html. And let’s practice syncing our file to GitHub: pull, stage, commit, push Troubleshooting: What if a file doesn’t show up in the Git tab and you expect that it should? Check to make sure you’ve saved the file. If the filename is red with an asterix, there have been changes since it was saved. Remember to save before syncing to GitHub! 4.9.2 Create data and figures folders Let’s create a few folders to be organized. Let’s have one for our the raw data, and one for the figures we will output. We can do this in RStudio, in the bottom right pane Files pane by clicking the New Folder button: folder called “data” folder called “figures” We can press the refresh button in the top-right of this pane (next to the “More” button) to have these show up in alphabetical order. Now let’s go to our Finder or Windows Explorer: our new folders are there as well! 4.9.3 Move data files to data folder You downloaded several files for this workshop from the r-for-excel-data folder, and we’ll move these data into our repo now. These data files are a mix of comma separate value (.csv) files and some as Excel spreadsheets (.xlsx): ca_np.csv ci_np.xlsx fish.csv inverts.xlsx kelp_fronds.xlsx lobsters.xlsx lobsters2.xlsx noaa_landings.csv substrate.xlsx Copy-paste or drag all of these files into the ‘data’ subfolder of your R project. Make sure you do not also copy the original folder; we don’t need any subfolders in our data folder. Now let’s go back to RStudio. We can click on the data folder in the Files tab and see them. The data folder also shows up in your Git tab. But the figures folder does not. That is because GitHub cannot track an empty folder, it can only track files within a folder. Let’s sync these data files (we will be able to sync the figures folder shortly). We can stage multiple files at once by typing Command - A and clicking “Stage” (or using the space bar). To Sync: pull - stage - commit - push! 4.9.4 Activity Edit your README and practice syncing (pull, stage, commit, push). For example, "We use the following data from the Santa Barbara Coastal Term Ecological Research and National Oceanic and Atmospheric Administration in our analyses" Explore your Commit History, and discuss with your neighbor. 4.10 Committing - how often? Tracking changes in your files Whenever you make changes to the files in Github, you will walk through the Pull -> Stage -> Commit -> Push steps. I tend to do this every time I finish a task (basically when I start getting nervous that I will lose my work). Once something is committed, it is very difficult to lose it. 4.11 Issues Let’s go back to our repo on GitHub.com, and talk about Issues. Issues “track ideas, enhancements, tasks, or bugs for work on GitHub.” - GitHub help article. You can create an issue for a topic, track progress, others ask questions, provide links and updates, close issue when completed. In a public repo, anyone with a username can create and comment on issues. In a private repo, only users with permission can create and comment on issues, or see them at all. GitHub search is awesome – will search code and issues! 4.11.1 Issues in the wild! Here are some examples of “traditional” and “less traditional” Issues: Bug reports, code, feature, & help requests: ggplot2 Project submissions and progress tracking: MozillaFestival Private conversations and archiving: OHI Fellows (private) 4.11.2 END GitHub session! We’ll continue practicing GitHub throughout the rest of the book, but see Chapter 9 for explicit instructions on collaborating in GitHub. "],["ggplot2.html", "Chapter 5 Graphs with ggplot2 5.1 Summary 5.2 Getting started - In existing .Rmd, attach packages 5.3 Read in .xlsx and .csv files 5.4 Our first ggplot graph: Visitors to Channel Islands NP 5.5 Intro to customizing ggplot graphs 5.6 Mapping variables onto aesthetics 5.7 ggplot2 complete themes 5.8 Updating axis labels and titles 5.9 Combining compatible geoms 5.10 Multi-series ggplot graphs 5.11 Faceting ggplot graphs 5.12 Exporting a ggplot graph with ggsave()", " Chapter 5 Graphs with ggplot2 5.1 Summary In this session, we’ll first learn how to read some external data (from .xls, .xlsx, and CSV files) into R with the readr and readxl packages (both part of the tidyverse). Then, we’ll write reproducible code to build graphs piece-by-piece. In Excel, graphs are made by manually selecting options - which, as we’ve discussed previously, may not be the best option for reproducibility. Also, if we haven’t built a graph with reproducible code, then we might not be able to easily recreate a graph or use that code again to make the same style graph with different data. Using ggplot2, the graphics package within the tidyverse, we’ll write reproducible code to manually and thoughtfully build our graphs. “ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places.” - R4DS So yeah…that gg is from “grammar of graphics.” We’ll use the ggplot2 package, but the function we use to initialize a graph will be ggplot, which works best for data in tidy format (i.e., a column for every variable, and a row for every observation). Graphics with ggplot are built step-by-step, adding new elements as layers with a plus sign (+) between layers (note: this is different from the pipe operator, %>%. Adding layers in this fashion allows for extensive flexibility and customization of plots. 5.1.1 Objectives Read in external data (Excel files, CSVs) with readr and readxl Initial data exploration Build several common types of graphs (scatterplot, column, line) in ggplot2 Customize gg-graph aesthetics (color, style, themes, etc.) Update axis labels and titles Combine compatible graph types (geoms) Build multiseries graphs Split up data into faceted graphs Export figures with ggsave() 5.1.2 Resources readr documentation from tidyverse.org readxl documentation from tidyverse.org readxl workflows from tidyverse.org Chapter 3 - Data Visualization in R for Data Science by Grolemund and Wickham ggplot2-cheatsheet-2.1.pdf Graphs with ggplot2 - Cookbook for R “Why I use ggplot2” - David Robinson Blog Post 5.2 Getting started - In existing .Rmd, attach packages In your existing plots-ggplot.Rmd from Session 2, remove everything below the first code chunk. The ggplot2 package is part of the tidyverse, so we don’t need to attach it separately. Attach the tidyverse, readxl and here packages in the top-most code chunk of your .Rmd. library(tidyverse) library(readxl) library(here) 5.3 Read in .xlsx and .csv files In this session, we’ll use data for parks visitation from two files: A comma-separated-value (CSV) file containing visitation data for all National Parks in California (ca_np.csv) A single Excel worksheet containing only visitation for Channel Islands National Park (ci_np.xlsx) 5.3.1 read_csv() to read in comma-separated-value (.csv) files There are many types of files containing data that you might want to work with in R. A common one is a comma separated value (CSV) file, which contains values with each column entry separated by a comma delimiter. CSVs can be opened, viewed, and worked with in Excel just like an .xls or .xlsx file - but let’s learn how to get data directly from a CSV into R where we can work with it more reproducibly. To read in the ca_np.csv file, we need to: insert a new code chunk use read_csv() to read in the file use here() within read_csv() to tell it where to look assign the stored data an object name (we’ll store ours as ca_np) ca_np <- read_csv(here("data", "ca_np.csv")) Look in your Environment to see that ca_np now shows up. Click on the object in the Environment, and R will automatically run the View() function for you to pull up your data in a separate viewing tab. Now we can look at it in the spreadsheet format we’re used to. We can explore our data frame a bit more to see what it contains. For example: names(): to see the variable (column) names head(): to see the first x rows (6 is the default) summary(): see a quick summary of each variable (Remember that names() is the name of the function but names(ca_np) is how we use it on the data.) Cool! Next, let’s read in ci_np.xlsx an Excel file) using read_excel(). 5.3.2 readxl to read in Excel files We also have an Excel file (ci_np.xlsx) that contains observations only for Channel Islands National Park visitation. Both readr and readxl are part of the tidyverse, which means we should expect their functions to have similar syntax and structure. Note: If readxl is part of the tidyverse, then why did I have to attach it separately? Great question! Too keep the tidyverse manageable, there are “core” packages (readr, dplyr, tidyr, ggplot2, forcats, purrr, tibble, stringr) that you would expect to use frequently, and those are automatically attached when you use library(tidyverse). But there are also more specialized tidyverse packages (e.g. readxl, reprex, lubridate, rvest) that are built with similar design philosophy, but are not automatically attached with library(tidyverse). Those specialized packages are installed along with the tidyverse, but need to be attached individually (e.g. with library(readxl)). Use read_excel() to get the ci_np.xlsx data into R: ci_np <- read_excel(here("data", "ci_np.xlsx")) Note: If you want to explicitly read in an .xlsx or .xls file, you can use read_xlsx() or read_xls() instead. read_excel() will make its best guess about which type of Excel file you’re reading in, so is the generic form. Explore the ci_np data frame as above using functions like View(), names(), head(), and summary(). Now that we have read in the National Parks visitation data, let’s use ggplot2 to visualize it. 5.4 Our first ggplot graph: Visitors to Channel Islands NP To create a bare-bones ggplot graph, we need to tell R three basic things: We’re using ggplot2::ggplot() Data we’re using & variables we’re plotting (i.e., what is x and/or y?) What type of graph we’re making (the type of geom) Generally, that structure will look like this: ggplot(data = df_name, aes(x = x_var_name, y = y_var_name)) + geom_type() Breaking that down: First, tell R you’re using ggplot() Then, tell it the object name where variables exist (data = df_name) Next, tell it the aesthetics aes() to specify which variables you want to plot Then add a layer for the type of geom (graph type) with geom_*() - for example, geom_point() is a scatterplot, geom_line() is a line graph, geom_col() is a column graph, etc. Let’s do that to create a line graph of visitors to Channel Islands National Park: ggplot(data = ci_np, aes(x = year, y = visitors)) + geom_line() We’re going to be doing a lot of plot variations with those same variables. Let’s store the first line as object gg_base so that we don’t need to retype it each time: gg_base <- ggplot(data = ci_np, aes(x = year, y = visitors)) Or, we could change that to a scatterplot just by updating the geom_*: gg_base + geom_point() We could even do that for a column graph: gg_base + geom_col() Or an area plot… gg_base + geom_area() We can see that updating to different geom_* types is quick, so long as the types of graphs we’re switching between are compatible. The data are there, now let’s do some data viz customization. 5.5 Intro to customizing ggplot graphs First, we’ll customize some aesthetics (e.g. colors, styles, axis labels, etc.) of our graphs based on non-variable values. We can change the aesthetics of elements in a ggplot graph by adding arguments within the layer where that element is created. Some common arguments we’ll use first are: color = or colour =: update point or line colors fill =: update fill color for objects with areas linetype =: update the line type (dashed, long dash, etc.) pch =: update the point style size =: update the element size (e.g. of points or line thickness) alpha =: update element opacity (1 = opaque, 0 = transparent) Building on our first line graph, let’s update the line color to “purple” and make the line type “dashed”: gg_base + geom_line( color = "purple", linetype = "dashed" ) How do we know which color names ggplot will recognize? If you google “R colors ggplot2” you’ll find a lot of good resources. Here’s one: SAPE ggplot2 colors quick reference guide Now let’s update the point, style and size of points on our previous scatterplot graph using color =, size =, and pch = (see ?pch for the different point styles, which can be further customized). gg_base + geom_point(color = "purple", pch = 17, size = 4, alpha = 0.5) 5.5.1 Activity: customize your own ggplot graph Update one of the example graphs you created above to customize at least an element color and size! 5.6 Mapping variables onto aesthetics In the examples above, we have customized aesthetics based on constants that we input as arguments (e.g., the color / style / size isn’t changing based on a variable characteristic or value). Sometimes, however, we do want the aesthetics of a graph to depend on a variable. To do that, we’ll map variables onto graph aesthetics, meaning we’ll change how an element on the graph looks based on a variable characteristic (usually, character or value). When we want to customize a graph element based on a variable’s characteristic or value, add the argument within aes() in the appropriate geom_*() layer In short, if updating aesthetics based on a variable, make sure to put that argument inside of aes(). Example: Create a ggplot scatterplot graph where the size and color of the points change based on the number of visitors, and make all points the same level of opacity (alpha = 0.5). Notice the aes() around the size = and color = arguments. Also: this is overmapped and unnecessary. Avoid excessive / overcomplicated aesthetic mapping in data visualization. gg_base + geom_point( aes(size = visitors, color = visitors), alpha = 0.5 ) In the example above, notice that the two arguments that do depend on variables are within aes(), but since alpha = 0.5 doesn’t depend on a variable then it is outside the aes() but still within the geom_point() layer. 5.6.1 Activity: map variables onto graph aesthetics Create a column plot of Channel Islands National Park visitation over time, where the fill color (argument: fill =) changes based on the number of visitors. gg_base + geom_col(aes(fill = visitors)) Sync your project with your GitHub repo. 5.7 ggplot2 complete themes While every element of a ggplot graph is manually customizable, there are also built-in themes (theme_*()) that you can add to your ggplot code to make some major headway before making smaller tweaks manually. Here are a few to try today (but also notice all the options that appear as we start typing theme_ into our ggplot graph code!): theme_light() theme_minimal() theme_bw() Here, let’s update our previous graph with theme_minimal(): gg_base + geom_point( aes(size = visitors, color = visitors), alpha = 0.5 ) + theme_minimal() 5.8 Updating axis labels and titles Use labs() to update axis labels, and add a title and/or subtitle to your ggplot graph. gg_base + geom_line(linetype = "dotted") + theme_bw() + labs( x = "Year", y = "Annual park visitors", title = "Channel Islands NP Visitation", subtitle = "(1963 - 2016)" ) Note: If you want to update the formatting of axis values (for example, to convert to comma format instead of scientific format above), you can use the scales package options (see more from the R Cookbook). 5.9 Combining compatible geoms As long as the geoms are compatible, we can layer them on top of one another to further customize a graph. For example, adding points to a line graph: gg_base + geom_line(color = "purple") + geom_point(color = "orange", aes(size = year), alpha = 0.5) Or, combine a column and line graph (not sure why you’d want to do this, but you can): gg_base + geom_col(fill = "orange", color = "purple") + geom_line(color = "green") 5.10 Multi-series ggplot graphs In the examples above, we only had a single series - visitation at Channel Islands National Park. Often we’ll want to visualize multiple series. For example, from the ca_np object we have stored, we might want to plot visitation for all California National Parks. To do that, we need to add an aesthetic that lets ggplot know how things are going to be grouped. A demonstration of why that’s important - what happens if we don’t let ggplot know how to group things? ggplot(data = ca_np, aes(x = year, y = visitors)) + geom_line() Well that’s definitely a mess, and it’s because ggplot has no idea that these should be different series based on the different parks that appear in the ‘park_name’ column. We can make sure R does know by adding an explicit grouping argument (group =), or by updating an aesthetic based on park_name: ggplot(data = ca_np, aes(x = year, y = visitors, group = park_name)) + geom_line() Note: You could also add an aesthetic (color = park_name) in the geom_line() layer to create groupings, instead of in the topmost ggplot() layer. Let’s store that topmost line so that we can use it more quickly later on in the lesson: gg_np <- ggplot(data = ca_np, aes(x = year, y = visitors, group = park_name)) 5.11 Faceting ggplot graphs When we facet graphs, we split them up into multiple plotting panels, where each panel contains a subset of the data. In our case, we’ll split the graph above into different panels, each containing visitation data for a single park. Also notice that any general theme changes made will be applied to all of the graphs. gg_np + geom_line(show.legend = FALSE) + theme_light() + labs(x = "year", y = "annual visitors") + facet_wrap(~ park_name) 5.12 Exporting a ggplot graph with ggsave() If we want our graph to appear in a knitted html, then we don’t need to do anything else. But often we’ll need a saved image file, of specific size and resolution, to share or for publication. ggsave() will export the most recently run ggplot graph by default (plot = last_plot()), unless you give it the name of a different saved ggplot object. Some common arguments for ggsave(): width =: set exported image width (default inches) height =: set exported image height (default height) dpi =: set dpi (dots per inch) So to export the faceted graph above at 180 dpi, width a width of 8\" and a height of 7\", we can use: ggsave(here("figures", "np_graph.jpg"), dpi = 180, width = 8, height = 7) Notice that a .jpg image of that name and size is now stored in the figures\\ folder within your working directory. You can change the type of exported image, too (e.g. pdf, tiff, eps, png, mmp, svg). Sync your project with your GitHub repo. Stage Commit Pull (to check for remote changes) Push! 5.12.1 End ggplot session! "],["pivot-tables.html", "Chapter 6 Pivot Tables with dplyr 6.1 Summary 6.2 Overview & setup 6.3 Pivot table demo 6.4 group_by() %>% summarize() 6.5 Oh no, they sent the wrong data! 6.6 mutate() 6.7 select()", " Chapter 6 Pivot Tables with dplyr 6.1 Summary Pivot tables are powerful tools in Excel for summarizing data in different ways. We will create these tables using the group_by and summarize functions from the dplyr package (part of the Tidyverse). We will also learn how to format tables and practice creating a reproducible report using RMarkdown and sharing it with GitHub. Data used in the synthesis section: File name: lobsters.xlsx and lobsters2.xlsx Description: Lobster size, abundance and fishing pressure (Santa Barbara coast) Link: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=77&revision=newest Citation: Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. doi. 6.1.1 Objectives In R, we can use the dplyr package for pivot tables by using 2 functions group_by and summarize together with the pipe operator %>%. We will also continue to emphasize reproducibility in all our analyses. Discuss pivot tables in Excel Introduce group_by() %>% summarize() from the dplyr package Learn mutate() and select() to work column-wise Practice our reproducible workflow with RMarkdown and GitHub 6.1.2 Resources dplyr website: dplyr.tidyverse.org R for Data Science: Transform Chapter by Hadley Wickham & Garrett Grolemund Intro to Pivot Tables I-III videos by Excel Campus Data organization in spreadsheets by Karl Broman & Kara Woo 6.2 Overview & setup Wikipedia describes a pivot table as a “table of statistics that summarizes the data of a more extensive table…this summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.” Aside: Wikipedia also says that “Although pivot table is a generic term, Microsoft trademarked PivotTable in the United States in 1994.” Pivot tables are a really powerful tool for summarizing data, and we can have similar functionality in R — as well as nicely automating and reporting these tables. We will first have a look at our data, demo using pivot tables in Excel, and then create reproducible tables in R. 6.2.1 View data in Excel When reading in Excel files (or really any data that isn’t yours), it can be a good idea to open the data and look at it so you know what you’re up against. Let’s open the lobsters.xlsx data in Excel. It’s one sheet, and it’s rectangular. In this data set, every row is a unique observation. This is called “uncounted” data; you’ll see there is no row for how many lobsters were seen because each row is an observation, or an “n of 1.” But also notice that the data doesn’t start until line 5; there are 4 lines of metadata — data about the data that is super important! — that we don’t want to muddy our analyses. Now your first idea might be to delete these 4 rows from this Excel sheet and save them on another, but we also know that we need to keep the raw data raw. So let’s not touch this data in Excel, we’ll remove these lines in R. Let’s do that first so then we’ll be all set. 6.2.2 RMarkdown setup Let’s start a new RMarkdown file in our repo, at the top-level (where it will be created by default in our Project). I’ll call mine pivot_lobsters.Rmd. In the setup chunk, let’s attach our libraries and read in our lobster data. In addition to the tidyverse package we will also use the skimr package. You will have to install it, but don’t want it to be installed every time you write your code. The following is a nice convention for having the install instructions available (on the same line) as the library() call. ## attach libraries library(tidyverse) library(readxl) library(here) library(skimr) # install.packages('skimr') library(kableExtra) # install.packages('kableExtra') We used the read_excel() before, which is the generic function that reads both .xls and .xlsx files. Since we know that this is a .xlsx file, we will demo using the read_xlsx() function. We can expect that someone in the history of R and especially the history of the readxl package has needed to skip lines at the top of an Excel file before. So let’s look at the help pages ?read_xlsx: there is an argument called skip that we can set to 4 to skip 4 lines. ## read in data lobsters <- read_xlsx(here("data/lobsters.xlsx"), skip=4) Great. We’ve seen this data in Excel so I don’t feel the need to use head() here like we’ve done before, but I do like having a look at summary statistics and classes. 6.2.2.1 skimr::skim To look at summary statistics we’ve used summary, which is good for numeric columns, but it doesn’t give a lot of useful information for non-numeric data. So it means it wouldn’t tell us how many unique sites there are in this dataset. To have a look there I like using the skimr package: # explore data skimr::skim(lobsters) This skimr:: notation is a reminder to me that skim is from the skimr package. It is a nice convention: it’s a reminder to others (especially you!). skim lets us look more at each variable. Here we can look at our character variables and see that there are 5 unique sites (in the n_unique output). Also, I particularly like looking at missing data. There are 6 missing values in the size_mm variable. 6.2.3 Our task So now we have an idea of our data. But now we have a task: we’ve been asked by a colleague to report about how the average size of lobsters has changed for each site across time. We will complete this task with R by using the dplyr package for data wrangling, which we will do after demoing how this would do it with pivot tables in Excel. 6.3 Pivot table demo I will demo how we will make a pivot table with our lobster data. You are welcome to sit back and watch rather than following along. First let’s summarize how many lobsters were counted each year. This means I want to count of rows by year. So to do this in Excel we would initiate the Pivot Table Process: Excel will ask what data I would like to include, and it will do its best to suggest coordinates for my data within the spreadsheet (it can have difficulty with non-rectangular or “non-tidy” data). It does a good job here of ignoring those top lines of data description. It will also suggest we make our PivotTable in a new worksheet. And then we’ll see our new sheet and a little wizard to help us create the PivotTable. 6.3.1 pivot one variable I want to start by summarizing by year, so I first drag the year variable down into the “Rows” box. What I see at this point are the years listed: this confirms that I’m going to group by years. And then, to summarize the counts for each year, I actually drag the same year variable into the “Values” box. And it will create a Pivot Table for me! But “sum” as the default summary statistic; this doesn’t make a whole lot of sense for summarizing years. I can click the little “I” icon to change this summary statistic to what I want: Count of year. A few things to note: The pivot table is separate entity from our data (it’s on a different sheet); the original data has not been affected. This “keeps the raw data raw,” which is great practice. The pivot table summarizes on the variables you request meaning that we don’t see other columns (like date, month, or site). Excel also calculates the Grand total for all sites (in bold). This is nice for communicating about data. But it can be problematic in the future, because it might not be clear that this is a calculation and not data. It could be easy to take a total of this column and introduce errors by doubling the total count. So pivot tables are great because they summarize the data and keep the raw data raw — they even promote good practice because they by default ask you if you’d like to present the data in a new sheet rather than in the same sheet. 6.3.2 pivot two variables We can include multiple variables in our PivotTable. If we want to add site as a second variable, we can drag it down: But this is comparing sites within a year; we want to compare years within a site. We can reverse the order easily enough by dragging (you just have to remember to do all of these steps the next time you’d want to repeat this): So in terms of our full task, which is to compare the average lobster size by site and year, we are on our way! I’ll leave this as a cliff-hanger here in Excel and we will carry forward in R. Just to recap what we did here: we told Excel we wanted to group by something (here: year and site) and then summarize by something (here: count, not sum!) 6.4 group_by() %>% summarize() In R, we can create the functionality of pivot tables with the same logic: we will tell R to group by something and then summarize by something. Visually, it looks like this: This graphic is from RStudio’s old-school data wrangling cheatsheet; all cheatsheets available from https://rstudio.com/resources/cheatsheets). It’s incredibly powerful to visualize what we are talking about with our data when do do these kinds of operations. And in code, it looks like this: data %>% group_by() %>% summarize() It reads: “Take the data and then group by something and then summarize by something.” The pipe operator %>% is a really critical feature of the dplyr package, originally created for the magrittr package. It lets us chain together steps of our data wrangling, enabling us to tell a clear story about our entire data analysis. This is not only a written story to archive what we’ve done, but it will be a reproducible story that can be rerun and remixed. It is not difficult to read as a human, and it is not a series of clicks to remember. Let’s try it out! 6.4.1 group_by one variable Let’s use group_by() %>% summarize() with our lobsters data, just like we did in Excel. We will first group_by year and then summarize by count, using the function n() (in the dplyr package). n() counts the number of times an observation shows up, and since this is uncounted data, this will count each row. We can say this out loud while we write it: “take the lobsters data and then group_by year and then summarize by count in a new column we’ll call count_by_year.” lobsters %>% group_by(year) %>% summarize(count_by_year = n()) Notice how together, group_by and summarize minimize the amount of information we see. We also saw this with the pivot table. We lose the other columns that aren’t involved here. Question: What if you don’t group_by first? Let’s try it and discuss what’s going on. lobsters %>% summarize(count = n()) ## # A tibble: 1 x 1 ## count ## <int> ## 1 2893 So if we don’t group_by first, we will get a single summary statistic (sum in this case) for the whole dataset. Another question: what if we only group_by? lobsters %>% group_by(year) ## # A tibble: 2,893 x 7 ## # Groups: year [5] ## year month date site transect replicate size_mm ## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> ## 1 2012 8 8/20/12 ivee 3 A 70 ## 2 2012 8 8/20/12 ivee 3 B 60 ## 3 2012 8 8/20/12 ivee 3 B 65 ## 4 2012 8 8/20/12 ivee 3 B 70 ## 5 2012 8 8/20/12 ivee 3 B 85 ## 6 2012 8 8/20/12 ivee 3 C 60 ## 7 2012 8 8/20/12 ivee 3 C 65 ## 8 2012 8 8/20/12 ivee 3 C 67 ## 9 2012 8 8/20/12 ivee 3 D 70 ## 10 2012 8 8/20/12 ivee 4 B 85 ## # … with 2,883 more rows R doesn’t summarize our data, but you can see from the output that it is indeed grouped. However, we haven’t done anything to the original data: we are only exploring. We are keeping the raw data raw. To convince ourselves, let’s now check the lobsters variable. We can do this by clicking on lobsters in the Environment pane in RStudio. We see that we haven’t changed any of our original data that was stored in this variable. (Just like how the pivot table didn’t affect the raw data on the original sheet). Aside: You’ll also see that when you click on the variable name in the Environment pane, View(lobsters) shows up in your Console. View() (capital V) is the R function to view any variable in the viewer. So this is something that you can write in your RMarkdown script, although RMarkdown will not be able to knit this view feature into the formatted document. So, if you want include View() in your RMarkdown document you will need to either comment it out #View() or add eval=FALSE to the top of the code chunk so that the full line reads {r, eval=FALSE}. 6.4.2 group_by multiple variables Great. Now let’s summarize by both year and site like we did in the pivot table. We are able to group_by more than one variable. Let’s do this together: lobsters %>% group_by(site, year) %>% summarize(count_by_siteyear = n()) We put the site first because that is what we want as an end product. But we could easily have put year first. We saw visually what would happen when we did this in the Pivot Table. Great. 6.4.3 summarize multiple variables We can summarize multiple variables at a time. So far we’ve summarized the count of lobster observations. Let’s also calculate the mean and standard deviation. First let’s use the mean() function to calculate the mean. We do this within the same summarize() function, but we can add a new line to make it easier to read. Notice how when you put your curser within the parenthesis and hit return, the indentation will automatically align. lobsters %>% group_by(site, year) %>% summarize(count_by_siteyear = n(), mean_size_mm = mean(size_mm)) Aside Command-I will properly indent selected lines. Great! But this will actually calculate some of the means as NA because one or more values in that year are NA. So we can pass an argument that says to remove NAs first before calculating the average. Let’s do that, and then also calculate the standard deviation with the sd() function: lobsters %>% group_by(site, year) %>% summarize(count_by_siteyear = n(), mean_size_mm = mean(size_mm, na.rm=TRUE), sd_size_mm = sd(size_mm, na.rm=TRUE)) So we can make the equivalent of Excel’s pivot table in R with group_by() %>% summarize(). Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting. So let’s add a variable assignment to that first line: siteyear_summary <- lobsters %>% group_by(site, year) %>% summarize(count_by_siteyear = n(), mean_size_mm = mean(size_mm, na.rm = TRUE), sd_size_mm = sd(size_mm, na.rm = TRUE)) ## `summarise()` regrouping output by 'site' (override with `.groups` argument) ## inspect our new variable siteyear_summary 6.4.4 Table formatting with kable() There are several options for formatting tables in RMarkdown; we’ll show one here from the kableExtra package and learn more about it tomorrow. It works nicely with the pipe operator, so we can build do this from our new object: ## make a table with our new variable siteyear_summary %>% kable() 6.4.5 R code in-line in RMarkdown Before we let you try this on your own, let’s go outside of our code chunk and write in Markdown. I want to demo something that is a really powerful RMarkdown feature that we can already leverage with what we know in R. Write this in Markdown but replace the # with a backtick (`): “There are #r nrow(lobsters)# total lobsters included in this report.” Let’s knit to see what happens. I hope you can start to imagine the possibilities. If you wanted to write which year had the most observations, or which site had a decreasing trend, you would be able to. 6.4.6 Activity Build from our analysis and calculate the median lobster size for each site year. Your calculation will use the size_mm variable and function to calculate the median (Hint: ?median) create and ggsave() a plot. Then, save, commit, and push your .Rmd, .html, and .png. Solution (no peeking): siteyear_summary <- lobsters %>% group_by(site, year) %>% summarize(count_by_siteyear = n(), mean_size_mm = mean(size_mm, na.rm = TRUE), sd_size_mm = sd(size_mm, na.rm = TRUE), median_size_mm = median(size_mm, na.rm = TRUE)) ## `summarise()` regrouping output by 'site' (override with `.groups` argument) ## a ggplot option: ggplot(data = siteyear_summary, aes(x = year, y = median_size_mm, color = site)) + geom_line() ggsave(here("figures", "lobsters-line.png")) ## Saving 7 x 5 in image ## another option: ggplot(siteyear_summary, aes(x = year, y = median_size_mm)) + geom_col() + facet_wrap(~site) ggsave(here("figures", "lobsters-col.png")) ## Saving 7 x 5 in image Don’t forget to knit, commit, and push! Nice work everybody. 6.5 Oh no, they sent the wrong data! Oh no! After all our analyses and everything we’ve done, our colleague just emailed us at 4:30pm on Friday that he sent the wrong data and we need to redo all our analyses with a new .xlsx file: lobsters2.xlsx, not lobsters.xlsx. Aaaaah! If we were doing this in Excel, this would be a bummer; we’d have to rebuild our pivot table and click through all of our logic again. And then export our figures and save them into our report. But, since we did it in R, we are much safer. R’s power is not only in analytical power, but in automation and reproducibility. This means we can go back to the top of our RMarkdown file, and read in this new data file, and then re-knit. We will still need to check that everything outputs correctly, (and that column headers haven’t been renamed), but our first pass will be to update the filename and re-knit: ## read in data lobsters <- read_xlsx(here("data/lobsters2.xlsx"), skip=4) And now we can see that our plot updated as well: siteyear_summary <- lobsters %>% group_by(site, year) %>% summarize(count_by_siteyear = n(), mean_size_mm = mean(size_mm, na.rm = TRUE), sd_size_mm = sd(size_mm, na.rm = TRUE), median_size_mm = median(size_mm, na.rm = TRUE), ) ## `summarise()` regrouping output by 'site' (override with `.groups` argument) siteyear_summary ## a ggplot option: ggplot(data = siteyear_summary, aes(x = year, y = median_size_mm, color = site)) + geom_line() ggsave(here("figures", "lobsters-line.png")) ## another option: ggplot(siteyear_summary, aes(x = year, y = median_size_mm)) + geom_col() + facet_wrap(~site) ggsave(here("figures", "lobsters-col.png")) 6.5.1 Knit, push, & show differences on GitHub So cool. 6.5.2 dplyr::count() Now that we’ve spent time with group_by %>% summarize, there is a shortcut if you only want to summarize by count. This is with a function called count(), and it will group_by your selected variable, count, and then also ungroup. It looks like this: lobsters %>% count(site, year) ## This is the same as: lobsters %>% group_by(site, year) %>% summarize(n = n()) %>% ungroup() Hey, we could update our RMarkdown text knowing this: There are #r count(lobsters)# total lobsters included in this summary. Switching gears… 6.6 mutate() There are a lot of times where you don’t want to summarize your data, but you do want to operate beyond the original data. This is often done by adding a column. We do this with the mutate() function from dplyr. Let’s try this with our original lobsters data. The sizes are in millimeters but let’s say it was important for them to be in meters. We can add a column with this calculation: lobsters %>% mutate(size_m = size_mm / 1000) If we want to add a column that has the same value repeated, we can pass it just one value, either a number or a character string (in quotes). And let’s save this as a variable called lobsters_detailed lobsters_detailed <- lobsters %>% mutate(size_m = size_mm / 1000, millenia = 2000, observer = "Allison Horst") 6.7 select() We will end with one final function, select. This is how to choose, retain, and move your data by columns: Let’s say that we want to present this data finally with only columns for date, site, and size in meters. We would do this: lobsters_detailed %>% select(date, site, size_m) One last time, let’s knit, save, commit, and push to GitHub. 6.7.1 END dplyr-pivot-tables session! "],["tidying.html", "Chapter 7 Tidying 7.1 Summary 7.2 Set-up 7.3 tidyr::pivot_longer() to reshape from wider-to-longer format 7.4 tidyr::pivot_wider() to convert from longer-to-wider format 7.5 janitor::clean_names() to clean up column names 7.6 tidyr::unite() and tidyr::separate() to combine or separate information in column(s) 7.7 stringr::str_replace() to replace a pattern", " Chapter 7 Tidying 7.1 Summary In previous sessions, we learned to read in data, do some wrangling, and create a graph and table. Here, we’ll continue by reshaping data frames (converting from long-to-wide, or wide-to-long format), separating and uniting variable (column) contents, and finding and replacing string patterns. 7.1.1 Tidy data “Tidy” might sound like a generic way to describe non-messy looking data, but it is actually a specific data structure. When data is tidy, it is rectangular with each variable as a column, each row an observation, and each cell contains a single value (see: Ch. 12 in R for Data Science by Grolemund & Wickham). 7.1.2 Objectives In this session we’ll learn some tools to help make our data tidy and more coder-friendly. Those include: Use tidyr::pivot_wider() and tidyr::pivot_longer() to reshape data frames janitor::clean_names() to make column headers more manageable tidyr::unite() and tidyr::separate() to merge or separate information from different columns Detect or replace a string with stringr functions 7.1.3 Resources – Ch. 12 Tidy Data, in R for Data Science by Grolemund & Wickham - tidyr documentation from tidyverse.org - janitor repo / information from Sam Firke 7.2 Set-up 7.2.1 Create a new R Markdown and attach packages Open your project from Day 1 (click on the .Rproj file) PULL to make sure your project is up to date Create a new R Markdown file called my_tidying.Rmd Remove all example code / text below the first code chunk Attach the packages we’ll use here (library(package_name)): tidyverse here janitor readxl Knit and save your new .Rmd within the project folder. # Attach packages library(tidyverse) library(janitor) library(here) library(readxl) 7.2.2 read_excel() to read in data from an Excel worksheet We’ve used both read_csv() and read_excel() to import data from spreadsheets into R. Use read_excel() to read in the inverts.xlsx data as an objected called inverts. inverts <- read_excel(here("data", "inverts.xlsx")) Be sure to explore the imported data a bit: View(inverts) names(inverts) summary(inverts) 7.3 tidyr::pivot_longer() to reshape from wider-to-longer format If we look at inverts, we can see that the year variable is actually split over 3 columns, so we’d say this is currently in wide format. There may be times when you want to have data in wide format, but often with code it is more efficient to convert to long format by gathering together observations for a variable that is currently split into multiple columns. Schematically, converting from wide to long format using pivot_longer() looks like this: We’ll use tidyr::pivot_longer() to gather data from all years in inverts (columns 2016, 2017, and 2018) into two columns: one called year, which contains the year one called sp_count containing the number of each species observed. The new data frame will be stored as inverts_long: # Note: Either single-quotes, double-quotes, OR backticks around years work! inverts_long <- pivot_longer(data = inverts, cols = '2016':'2018', names_to = "year", values_to = "sp_count") The outcome is the new long-format inverts_long data frame: inverts_long ## # A tibble: 165 x 5 ## month site common_name year sp_count ## <chr> <chr> <chr> <chr> <dbl> ## 1 7 abur california cone snail 2016 451 ## 2 7 abur california cone snail 2017 28 ## 3 7 abur california cone snail 2018 762 ## 4 7 abur california spiny lobster 2016 17 ## 5 7 abur california spiny lobster 2017 17 ## 6 7 abur california spiny lobster 2018 16 ## 7 7 abur orange cup coral 2016 24 ## 8 7 abur orange cup coral 2017 24 ## 9 7 abur orange cup coral 2018 24 ## 10 7 abur purple urchin 2016 48 ## # … with 155 more rows Hooray, long format! One thing that isn’t obvious at first (but would become obvious if you continued working with this data) is that since those year numbers were initially column names (characters), when they are stacked into the year column, their class wasn’t auto-updated to numeric. Explore the class of year in inverts_long: class(inverts_long$year) ## [1] "character" That’s a good thing! We don’t want R to update classes of our data without our instruction. We’ll use dplyr::mutate() in a different way here: to create a new column (that’s how we’ve used mutate() previously) that has the same name of an existing column, in order to update and overwrite the existing column. In this case, we’ll mutate() to add a column called year, which contains an as.numeric() version of the existing year variable: # Coerce "year" class to numeric: inverts_long <- inverts_long %>% mutate(year = as.numeric(year)) Checking the class again, we see that year has been updated to a numeric variable: class(inverts_long$year) ## [1] "numeric" 7.4 tidyr::pivot_wider() to convert from longer-to-wider format In the previous example, we had information spread over multiple columns that we wanted to gather. Sometimes, we’ll have data that we want to spread over multiple columns. For example, imagine that starting from inverts_long we want each species in the common_name column to exist as its own column. In that case, we would be converting from a longer to a wider format, and will use tidyr::pivot_wider(). Specifically for our data, we’ll use pivot_wider() to spread the common_name across multiple columns as follows: inverts_wide <- inverts_long %>% pivot_wider(names_from = common_name, values_from = sp_count) inverts_wide ## # A tibble: 33 x 8 ## month site year `california con… `california spi… `orange cup cor… ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 7 abur 2016 451 17 24 ## 2 7 abur 2017 28 17 24 ## 3 7 abur 2018 762 16 24 ## 4 7 ahnd 2016 27 16 24 ## 5 7 ahnd 2017 24 16 24 ## 6 7 ahnd 2018 24 16 24 ## 7 7 aque 2016 4971 48 1526 ## 8 7 aque 2017 1752 48 1623 ## 9 7 aque 2018 2616 48 1859 ## 10 7 bull 2016 1735 24 36 ## # … with 23 more rows, and 2 more variables: `purple urchin` <dbl>, `rock ## # scallop` <dbl> We can see that now each species has its own column (wider format). But also notice that those column headers (since they have spaces) might not be in the most coder-friendly format… 7.5 janitor::clean_names() to clean up column names The janitor package by Sam Firke is a great collection of functions for some quick data cleaning, like: janitor::clean_names(): update column headers to a case of your choosing janitor::get_dupes(): see all rows that are duplicates within variables you choose janitor::remove_empty(): remove empty rows and/or columns janitor::adorn_*(): jazz up tables Here, we’ll use janitor::clean_names() to convert all of our column headers to a more convenient case - the default is lower_snake_case, which means all spaces and symbols are replaced with an underscore (or a word describing the symbol), all characters are lowercase, and a few other nice adjustments. For example, janitor::clean_names() would update these nightmare column names into much nicer forms: My...RECENT-income! becomes my_recent_income SAMPLE2.!test1 becomes sample2_test1 ThisIsTheName becomes this_is_the_name 2015 becomes x2015 If we wanted to then use these columns (which we probably would, since we created them), we could clean the names to get them into more coder-friendly lower_snake_case with janitor::clean_names(): inverts_wide <- inverts_wide %>% clean_names() names(inverts_wide) ## [1] "month" "site" ## [3] "year" "california_cone_snail" ## [5] "california_spiny_lobster" "orange_cup_coral" ## [7] "purple_urchin" "rock_scallop" And there are other case options in clean_names(), like: “snake” produces snake_case (the default) “lower_camel” or “small_camel” produces lowerCamel “upper_camel” or “big_camel” produces UpperCamel “screaming_snake” or “all_caps” produces ALL_CAPS “lower_upper” produces lowerUPPER “upper_lower” produces UPPERlower 7.6 tidyr::unite() and tidyr::separate() to combine or separate information in column(s) Sometimes we’ll want to separate contents of a single column into multiple columns, or combine entries from different columns into a single column. For example, the following data frame has genus and species in separate columns: We may want to combine the genus and species into a single column, scientific_name: Or we may want to do the reverse (separate information from a single column into multiple columns). Here, we’ll learn tidyr::unite() and tidyr::separate() to help us do both. 7.6.1 tidyr::unite() to merge information from separate columns Use tidyr::unite() to combine information from multiple columns into a single column (as for the scientific name example above) To demonstrate uniting information from separate columns, we’ll make a single column that has the combined information from site abbreviation and year in inverts_long. We need to give tidyr::unite() several arguments: data: the data frame containing columns we want to combine (or pipe into the function from the data frame) col: the name of the new “united” column the columns you are uniting sep: the symbol, value or character to put between the united information from each column inverts_unite <- inverts_long %>% unite(col = "site_year", # What to name the new united column c(site, year), # The columns we'll unite (site, year) sep = "_") # How to separate the things we're uniting ## # A tibble: 6 x 4 ## month site_year common_name sp_count ## <chr> <chr> <chr> <dbl> ## 1 7 abur_2016 california cone snail 451 ## 2 7 abur_2017 california cone snail 28 ## 3 7 abur_2018 california cone snail 762 ## 4 7 abur_2016 california spiny lobster 17 ## 5 7 abur_2017 california spiny lobster 17 ## 6 7 abur_2018 california spiny lobster 16 7.6.1.1 Activity: Task: Create a new object called ‘inverts_moyr,’ starting from inverts_long, that unites the month and year columns into a single column named “mo_yr,” using a slash “/” as the separator. Then try updating the separator to something else! Like “hello!” Solution: inverts_moyr <- inverts_long %>% unite(col = "mo_yr", # What to name the new united column c(month, year), # The columns we'll unite (site, year) sep = "/") Merging information from > 2 columns (not done in workshop) tidyr::unite() can also combine information from more than two columns. For example, to combine the site, common_name and year columns from inverts_long, we could use: # Uniting more than 2 columns: inverts_triple_unite <- inverts_long %>% tidyr::unite(col = "year_site_name", c(year, site, common_name), sep = "-") # Note: this is a dash head(inverts_triple_unite) ## # A tibble: 6 x 3 ## month year_site_name sp_count ## <chr> <chr> <dbl> ## 1 7 2016-abur-california cone snail 451 ## 2 7 2017-abur-california cone snail 28 ## 3 7 2018-abur-california cone snail 762 ## 4 7 2016-abur-california spiny lobster 17 ## 5 7 2017-abur-california spiny lobster 17 ## 6 7 2018-abur-california spiny lobster 16 7.6.2 tidyr::separate() to separate information into multiple columns While tidyr::unite() allows us to combine information from multiple columns, it’s more likely that you’ll start with a single column that you want to split up into pieces. For example, I might want to split up a column containing the genus and species (Scorpaena guttata) into two separate columns (Scorpaena | guttata), so that I can count how many Scorpaena organisms exist in my dataset at the genus level. Use tidyr::separate() to “separate a character column into multiple columns using a regular expression separator.” Let’s start again with inverts_unite, where we have combined the site and year into a single column called site_year. If we want to separate those, we can use: inverts_sep <- inverts_unite %>% tidyr::separate(site_year, into = c("my_site", "my_year")) 7.7 stringr::str_replace() to replace a pattern Was data entered in a way that’s difficult to code with, or is just plain annoying? Did someone wrongly enter “fish” as “fsh” throughout the spreadsheet, and you want to update it everywhere? Use stringr::str_replace() to automatically replace a string pattern. Warning: The pattern will be replaced everywhere - so if you ask to replace “fsh” with “fish,” then “offshore” would be updated to “offishore.” Be careful to ensure that when you think you’re making one replacement, you’re not also replacing something else unexpectedly. Starting with inverts, let’s any place we find “california” we want to replace it with the abbreviation “CA”: ca_abbr <- inverts %>% mutate( common_name = str_replace(common_name, pattern = "california", replacement = "CA") ) Now, check to confirm that “california” has been replaced with “CA.” 7.7.1 END tidying session! "],["filter-join.html", "Chapter 8 Filters and joins 8.1 Summary 8.2 Set-up: Create a new .Rmd, attach packages & get data 8.3 dplyr::filter() to conditionally subset by rows 8.4 dplyr::*_join() to merge data frames 8.5 An HTML table with kable() and kableExtra", " Chapter 8 Filters and joins 8.1 Summary In previous sessions, we’ve learned to do some basic wrangling and find summary information with functions in the dplyr package, which exists within the tidyverse. In this session, we’ll expand our data wrangling toolkit using: filter() to conditionally subset our data by rows, and *_join() functions to merge data frames together And we’ll make a nicely formatted HTML table with kable() and kableExtra The combination of filter() and *_join() - to return rows satisfying a condition we specify, and merging data frames by like variables - is analogous to the useful VLOOKUP function in Excel. 8.1.1 Objectives Use filter() to subset data frames, returning rows that satisfy variable conditions Use full_join(), left_join(), and inner_join() to merge data frames, with different endpoints in mind Use filter() and *_join() as part of a wrangling sequence 8.1.2 Resources filter() documentation from tidyverse.org join() documentation from tidyverse.org Chapters 5 and 13 in R for Data Science by Garrett Grolemund and Hadley Wickham “Create awesome HTML tables with knitr::kable() and kableExtra” by Hao Zhu 8.2 Set-up: Create a new .Rmd, attach packages & get data Create a new R Markdown document in your r-workshop project and knit to save as filter_join.Rmd. Remove all the example code (everything below the set-up code chunk). In this session, we’ll attach four packages: tidyverse readxl here kableExtra Attach the packages in the setup code chunk in your .Rmd: library(tidyverse) library(readxl) library(here) library(kableExtra) Then create a new code chunk to read in two files from your ‘data’ subfolder: fish.csv kelp_fronds.xlsx (read in only the “abur” worksheet by adding argument sheet = \"abur\" to read_excel()) # Read in data: fish <- read_csv(here("data", "fish.csv")) kelp_abur <- read_excel(here("data", "kelp_fronds.xlsx"), sheet = "abur") We should always explore the data we’ve read in. Use functions like View(), names(), summary(), head() and tail() to check them out. Now, let’s use filter() to decide which observations (rows) we’ll keep or exclude in new subsets, similar to using Excel’s VLOOKUP function or filter tool. 8.3 dplyr::filter() to conditionally subset by rows Use filter() to let R know which rows you want to keep or exclude, based whether or not their contents match conditions that you set for one or more variables. Some examples in words that might inspire you to use filter(): “I only want to keep rows where the temperature is greater than 90°F.” “I want to keep all observations except those where the tree type is listed as unknown.” “I want to make a new subset with only data for mountain lions (the species variable) in California (the state variable).” When we use filter(), we need to let R know a couple of things: What data frame we’re filtering from What condition(s) we want observations to match and/or not match in order to keep them in the new subset Here, we’ll learn some common ways to use filter(). 8.3.1 Filter rows by matching a single character string Let’s say we want to keep all observations from the fish data frame where the common name is “garibaldi” (fun fact: that’s California’s official marine state fish, protected in California coastal waters!). Here, we need to tell R to only keep rows from the fish data frame when the common name (common_name variable) exactly matches garibaldi. Use == to ask R to look for exact matches: fish_garibaldi <- fish %>% filter(common_name == "garibaldi") Check out the fish_garibaldi object to ensure that only garibaldi observations remain. 8.3.1.1 Activity Task: Create a subset starting from the fish data frame, stored as object fish_mohk, that only contains observations from Mohawk Reef (site entered as “mohk”). Solution: fish_mohk <- fish %>% filter(site == "mohk") Explore the subset you just created to ensure that only Mohawk Reef observations are returned. 8.3.2 Filter rows based on numeric conditions Use expected operators (>, <, >=, <=, ==) to set conditions for a numeric variable when filtering. For this example, we only want to retain observations when the total_count column value is >= 50: fish_over50 <- fish %>% filter(total_count >= 50) 8.3.3 Filter to return rows that match this OR that OR that What if we want to return a subset of the fish df that contains garibaldi, blacksmith OR black surfperch? There are several ways to write an “OR” statement for filtering, which will keep any observations that match Condition A or Condition B or Condition C. In this example, we will create a subset from fish that only contains rows where the common_name is garibaldi or blacksmith or black surfperch. Way 1: You can indicate OR using the vertical line operator | to indicate “OR”: fish_3sp <- fish %>% filter(common_name == "garibaldi" | common_name == "blacksmith" | common_name == "black surfperch") Alternatively, if you’re looking for multiple matches in the same variable, you can use the %in% operator instead. Use %in% to ask R to look for any matches within a vector: fish_3sp <- fish %>% filter(common_name %in% c("garibaldi", "blacksmith", "black surfperch")) Notice that the two methods above return the same thing. Critical thinking: In what scenario might you NOT want to use %in% for an “or” filter statement? Hint: What if the “or” conditions aren’t different outcomes for the same variable? 8.3.3.1 Activity Task: Create a subset from fish called fish_gar_2016 that keeps all observations if the year is 2016 OR the common name is “garibaldi.” Solution: fish_gar_2016 <- fish %>% filter(year == 2016 | common_name == "garibaldi") 8.3.4 Filter to return observations that match this AND that In the examples above, we learned to keep observations that matched any of a number of conditions (or statements). Sometimes we’ll only want to keep observations that satisfy multiple conditions (e.g., to keep this observation it must satisfy this condition AND that condition). For example, we may want to create a subset that only returns rows from fish where the year is 2018 and the site is Arroyo Quemado “aque” In filter(), add a comma (or ampersand ‘&’) between arguments for multiple “and” conditions: aque_2018 <- fish %>% filter(year == 2018, site == "aque") Check it out to see that only observations where the site is “aque” in 2018 are retained: aque_2018 ## # A tibble: 5 x 4 ## year site common_name total_count ## <dbl> <chr> <chr> <dbl> ## 1 2018 aque black surfperch 2 ## 2 2018 aque blacksmith 1 ## 3 2018 aque garibaldi 1 ## 4 2018 aque rock wrasse 4 ## 5 2018 aque senorita 36 Like most things in R, there are other ways to do the same thing. For example, you could do the same thing using & (instead of a comma) between “and” conditions: # Use the ampersand (&) to add another condition "and this must be true": aque_2018 <- fish %>% filter(year == 2018 & site == "aque") Or you could just do two filter steps in sequence: # Written as sequential filter steps: aque_2018 <- fish %>% filter(year == 2018) %>% filter(site == "aque") 8.3.5 Activity: combined filter conditions Challenge task: Create a subset from the fish data frame, called low_gb_wr that only contains: Observations for garibaldi or rock wrasse AND the total_count is less than or equal to 10 Solution: low_gb_wr <- fish %>% filter(common_name %in% c("garibaldi", "rock wrasse"), total_count <= 10) 8.3.6 stringr::str_detect() to filter by a partial pattern Sometimes we’ll want to keep observations that contain a specific string pattern within a variable of interest. For example, consider the fantasy data below: id species 1 rainbow rockfish 2 blue rockfish 3 sparkle urchin 4 royal blue fish There might be a time when we would want to use observations that: Contain the string “fish,” in isolation or within a larger string (like “rockfish”) Contain the string “blue” In those cases, it would be useful to detect a string pattern, and potentially keep any rows that contain it. Here, we’ll use stringr::str_detect() to find and keep observations that contain our specified string pattern. Let’s detect and keep observations from fish where the common_name variable contains string pattern “black.” Note that there are two fish, blacksmith and black surfperch, that would satisfy this condition. Using filter() + str_detect() in combination to find and keep observations where the site variable contains pattern “sc”: fish_bl <- fish %>% filter(str_detect(common_name, pattern = "black")) So str_detect() returns is a series of TRUE/FALSE responses for each row, based on whether or not they contain the specified pattern. In that example, any row that does contain “black” returns TRUE, and any row that does not contain “black” returns FALSE. 8.3.7 Activity Task: Create a new object called fish_it, starting from fish, that only contains observations if the common_name variable contains the string pattern “it.” What species remain? Solution: fish_it <- fish %>% filter(str_detect(common_name, pattern = "it")) # blacksmITh and senorITa remain! We can also exclude observations that contain a set string pattern by adding the negate = TRUE argument within str_detect(). Sync your local project to your repo on GitHub. 8.4 dplyr::*_join() to merge data frames There are a number of ways to merge data frames in R. We’ll use full_join(), left_join(), and inner_join() in this session. From R Documentation (?join): full_join(): “returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.” Basically, nothing gets thrown out, even if a match doesn’t exist - making full_join() the safest option for merging data frames. When in doubt, full_join(). left_join(): “return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.” inner_join(): “returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.” This will drop observations that don’t have a match between the merged data frames, which makes it a riskier merging option if you’re not sure what you’re trying to do. Schematic (from RStudio data wrangling cheat sheet): We will use kelp_abur as our “left” data frame, and fish as our “right” data frame, to explore different join outcomes. 8.4.1 full_join() to merge data frames, keeping everything When we join data frames in R, we need to tell R a couple of things (and it does the hard joining work for us): Which data frames we want to merge together Which variables to merge by Use full_join() to safely combine two data frames, keeping everything from both and populating with NA as necessary. Example: use full_join() to combine kelp_abur and fish: abur_kelp_fish <- kelp_abur %>% full_join(fish, by = c("year", "site")) Let’s look at the merged data frame with View(abur_kelp_fish). A few things to notice about how full_join() has worked: All columns that existed in both data frames still exist All observations are retained, even if they don’t have a match. In this case, notice that for other sites (not ‘abur’) the observation for fish still exists, even though there was no corresponding kelp data to merge with it. The kelp frond data is joined to all observations where the joining variables (year, site) are a match, which is why it is repeated 5 times for each year (once for each fish species). Because all data (observations & columns) are retained, full_join() is the safest option if you’re unclear about how to merge data frames. 8.4.2 left_join(x,y) to merge data frames, keeping everything in the ‘x’ data frame and only matches from the ‘y’ data frame Now, we want to keep all observations in kelp_abur, and merge them with fish while only keeping observations from fish that match an observation within abur. When we use left_join(), any information from fish that don’t have a match (by year and site) in kelp_abur won’t be retained, because those wouldn’t have a match in the left data frame. kelp_fish_left <- kelp_abur %>% left_join(fish, by = c("year","site")) Notice when you look at kelp_fish_left, data for other sites that exist in fish do not get joined, because left_join(df_a, df_b) will only keep observations from df_b if they have a match in df_a! 8.4.3 inner_join() to merge data frames, only keeping observations with a match in both Use inner_join() if you only want to retain observations that have matches across both data frames. Caution: this is built to exclude any observations that don’t match across data frames by joined variables - double check to make sure this is actually what you want to do! For example, if we use inner_join() to merge fish and kelp_abur, then we are asking R to only return observations where the joining variables (year and site) have matches in both data frames. Let’s see what the outcome is: kelp_fish_injoin <- kelp_abur %>% inner_join(fish, by = c("year", "site")) # kelp_fish_injoin Here, we see that only observations (rows) where there is a match for year and site in both data frames are returned. 8.4.4 filter() and join() in a sequence Now let’s combine what we’ve learned about piping, filtering and joining! Let’s complete the following as part of a single sequence (remember, check to see what you’ve produced after each step) to create a new data frame called my_fish_join: Start with fish data frame Filter fish to only including observations for 2017 at Arroyo Burro Join the kelp_abur data frame to the resulting subset using left_join() Add a new column that contains the ‘fish per kelp fronds’ density (total_count / total_fronds) That sequence might look like this: my_fish_join <- fish %>% filter(year == 2017, site == "abur") %>% left_join(kelp_abur, by = c("year", "site")) %>% mutate(fish_per_frond = total_count / total_fronds) Explore the resulting my_fish_join data frame. 8.5 An HTML table with kable() and kableExtra With any data frame, you can a nicer looking table in your knitted HTML using knitr::kable() and functions in the kableExtra package. Start by using kable() with my_fish_join, and see what the default HTML table looks like in your knitted document: kable(my_fish_join) Simple, but quick to get a clear & useful table! Now let’s spruce it up a bit with kableExtra::kable_styling() to modify HTML table styles: my_fish_join %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE) …with many other options for customizing HTML tables! Make sure to check out “Create awesome HTML tables with knitr::kable() and kableExtra” by Hao Zhu for more examples and options. Sync your project with your repo on GitHub 8.5.1 End filter() + _join() section! "],["collaborating.html", "Chapter 9 Collaborating & getting help 9.1 Summary 9.2 R communities 9.3 How to use Twitter for #rstats 9.4 Getting help 9.5 Collaborating with GitHub 9.6 Merge conflicts 9.7 Create your collaborative website", " Chapter 9 Collaborating & getting help 9.1 Summary Since the GitHub session (Chapter 4), we have been practicing using GitHub with RStudio to collaborate with our most important collaborator: Future You. Here we will practice using GitHub with RStudio to collaborate with others now, with a mindset towards Future Us (your colleagues that you know and have yet to meet). We will also how to engage with the #rstats community, including how to engage on Twitter, and how to ask for help. We are going to teach you the simplest way to collaborate with someone, which is for both of you to have privileges to directly edit and add files to a repository. GitHub is built for software developer teams, and there is a lot of features that limit who can directly edit files (which lead to “pull requests”), but we won’t cover that today. 9.1.1 Objectives intro to R communities How to effectively ask for help Googling. Error messages are your friends How to use Twitter for #rstats Create a reproducible example with reprex create a new repo and give permission to a collaborator publish webpages online 9.1.2 Resources ESM 206 Intro to data science & stats, specifically ESM Lecture 2 - by Allison Horst Finding the YOU in the R community - by Thomas Mock reprex.tidyverse.org Reprex webinar - by Jenny Bryan Getting help in R: do as I say, not as I’ve done by Sam Tyner Making free websites with RStudio’s R Markdown - by Julie Lowndes 9.2 R communities We are going to start off by talking about communities that exist around R and how you can engage with them. R communities connect online and in person. And we use Twitter as a platform to connect with each other. Yes, Twitter is a legit tool for data science. Most communities have some degree of in-person and online presence, with Twitter being a big part of that online presence, and it enables you to talk directly with people. On Twitter, we connect using the #rstats hashtag, and thus often called the “rstats community” (more on Twitter in a moment). This is a small (and incomplete!) sampling to give you a sense of a few communities. Please see Thomas Mock’s presentation Finding the YOU in the R community for more details. 9.2.0.1 RStudio Community What is it: Online community forum for all questions R & RStudio Location: online at community.rstudio.com Also: RStudio on Twitter 9.2.0.2 RLadies RLadies is a world-wide organization to promote gender diversity in the R community. Location: online at rladies.org, on Twitter at rladiesglobal Also: WeAreRLadies 9.2.0.3 rOpenSci What is it: rOpenSci builds software with a community of users and developers, and educate scientists about transparent research practices. Location: online at ropensci.org, on Twitter at ropensci Also: roknowtifier, rocitations 9.2.0.4 R User Groups What is it: R User Groups (“RUGs”) are in-person meetups supported by The R Consortium. Location: local chapters. See a list of RUGs and conferences. Also: example: Los Angeles R Users Group 9.2.0.5 The Carpentries What is it: Network teaching foundational data science skills to researchers worldwide Location: online at carpentries.org, on Twitter at thecarpentries, local workshops worldwide 9.2.0.6 R4DS Community What is it: A community of R learners at all skill levels working together to improve our skills. Location: on Twitter: R4DScommunity, on Slack — sign up from rfordatasci.com Also: #tidytuesday, R4DS_es 9.2.1 Community awesomeness Example with Sam Firke’s janitor package: sfirke.github.io/janitor, highlighting the excel_numeric_to_date function and learning about it through Twitter. 9.3 How to use Twitter for #rstats Twitter is how we connect with other R users, learn from each other, develop together, and become friends. Especially at an event like RStudio::conf, it is a great way to stay connect and stay connected with folks you meet. Twitter is definitely a firehose of information, but if you use it deliberately, you can hear the signal through the noise. I was super skeptical of Twitter. I thought it was a megaphone for angry people. But it turns out it is a place to have small, thoughtful conversations and be part of innovative and friendly communities. 9.3.1 Examples Here are a few examples of how to use Twitter for #rstats. When I saw this tweet by Md_Harris, this was my internal monologue: Cool visualization! I want to represent my data this way He includes his code that I can look at to understand what he did, and I can run and remix The package is from sckottie — who I know from rOpenSci, which is a really amazing software developer community for science rnoaa is a package making NOAA [US environmental] data more accessible! I didn’t know about this, it will be so useful for my colleagues I will retweet so my network can benefit as well Another example, this tweet where JennyBryan is asking for feedback on a super useful package for interfacing between R and excel: readxl. My internal monologue: Yay, readxl is awesome, and also getting better thanks to Jenny Do I have any spreadsheets to contribute? In any case, I will retweet so others can contribute. And I’ll like it too because I appreciate this work 9.3.2 How to Twitter My advice for Twitter is to start off small and deliberately. Curate who you follow and start by listening. I use Twitter deliberately for R and science communities, so that is the majority of the folks I follow (but of course I also follow Mark Hamill. So start using Twitter to listen and learn, and then as you gradually build up courage, you can like and retweet things. And remember that liking and retweeting is not only a way to engage with the community yourself, but it is also a way to welcome and amplify other people. Sometimes I just reply saying how cool something is. Sometimes I like it. Sometimes I retweet. Sometimes I retweet with a quote/comment. But I also miss a lot of things since I limit how much time I give to Twitter, and that’s OK. You will always miss things but you are part of the community and they are there for you like you are for them. If you’re joining twitter to learn R, I suggest following: hadleywickham JennyBryan rOpenSci WeAreRLadies Listen to what they say and who joins those conversations, and follow other people and organizations. You could also look at who they are following. Also, check out the #rstats hashtag. This is not something that you can follow (although you can have it as a column in software like TweetDeck), but you can search it and you’ll see that the people you follow use it to help tag conversations. You’ll find other useful tags as well, within your domain, as well as other R-related interests, e.g. #rspatial. When I read marine science papers, I see if the authors are on Twitter; I sometimes follow them, ask them questions, or just tell them I liked their work! You can also follow us: juliesquid allison_horst jamiecmonty ECOuture9 These are just a few ways to learn and build community on Twitter. And as you feel comfortable, you can start sharing your ideas or your links too. Live-tweeting is a really great way to engage as well, and bridge in-person conferences with online communities. And of course, in addition to engaging on Twitter, check whether there are local RLadies chapters or other R meetups, and join! Or perhaps start one? So Twitter is a place to engage with folks and learn, and while it is also a place to ask questions, there are other places to look first, depending on your question. 9.4 Getting help Getting help, or really helping you help yourself, means moving beyond “it’s not working” and towards solution-oriented approaches. Part of this is the mindset where you expect that someone has encountered this problem before and that most likely the problem is your typo or misuse, and not that R is broken or hates you. We’re going to talk about how to ask for help, how to interpret responses, and how to act upon the help you receive. 9.4.1 Read the error message As we’ve talked about before, they may be red, they may be unfamiliar, but error messages are your friends. There are multiple types of messages that R will print. Read the message to figure out what it’s trying to tell you. Error: There’s a fatal error in your code that prevented it from being run through successfully. You need to fix it for the code to run. Warning: Non-fatal errors (don’t stop the code from running, but this is a potential problem that you should know about). Message: Here’s some helpful information about the code you just ran (you can hide these if you want to) 9.4.2 Googling The internet has the answer to all of your R questions, hopes, and dreams. When you get an error you don’t understand, copy it and paste it into Google. You can also add “rstats” or “tidyverse” or something to help Google (although it’s getting really good without it too). For error messages, copy-pasting the exact message is best. But if you have a “how do I…?” type question you can also enter this into Google. You’ll develop the vocabulary you need to refine your search terms as you become more familiar with R. It’s a continued learning process. And just as important as Googling your error message is being able to identify a useful result. Something I can’t emphasize enough: pay attention to filepaths. They tell you the source, they help you find pages again. Often remembering a few things about it will let you either google it again or navigate back there yourself. Check the date, check the source, check the relevance. Is this a modern solution, or one from 2013? Do I trust the person responding? Is this about my question or on a different topic? You will see links from many places, particularly: RStudio Community Stack Overflow Books, blogs, tutorials, courses, webinars GitHub Issues 9.4.3 Create a reprex A “reprex” is a REPRoducible EXample: code that you need help with and want to ask someone about. Jenny Bryan made the reprex package because “conversations about code are more productive with code that actually runs, that I don’t have to run, and that I can easily run.” Let me demo an example, and then you will do it yourself. This is Jenny’s summary from her reprex webinar of what I’ll do: reprex is part of the Tidyverse, so we all already have it installed, but we do need to attach it: library(reprex) First let me create a little example that I have a question about. I want to know how I change the color of the geom_points in my ggplot. (Reminder: this example is to illustrate reprex, not how you would actually look in the help pages!!!) I’ll type into our RMarkdown script: library(tidyverse) ggplot(cars, aes(speed, dist)) + geom_point() So this is the code I have a question about it. My next step is to select it all and copy it into my clipboard. Then I go to my Console and type: reprex() Reprex does its thing, making sure this is a reproducible example — this wouldn’t be without library(tidyverse)! — and displaying it in my Viewer on the bottom-right of the RStudio IDE. reprex includes the output — experienced programmers who you might be asking for help can often read your code and know where the problem lies, especially when they can see the output. When it finishes I also have what I see in the Viewer copied in my clipboard. I can paste it anywhere! In an email, Google Doc, in Slack. I’m going to paste mine in an Issue for my r-workshop repository. When I paste it: Notice that following the backticks, there is only r, not r{}. This is because what we have pasted is so GitHub can format as R code, but it won’t be executed by R through RMarkdown. I can click on the Preview button in the Issues to see how this will render: and it will show my code, nicely formatted for R. So in this example I might write at the top of the comment: “Practicing a reprex and Issues.”allison_horst how do I change the point color to cyan?\" reprex is a “workflow package”. That means that it’s something we don’t put in Rmds, scripts, or anything else. We use is in the Console when we are preparing to ask for help — from ourselves or someone else. 9.4.4 Activity Make a reprex using the built-in mtcars dataset and paste it in the Issues for your repository. (Have a look: head(mtcars); skimr::skim(mtcars)) install and attach the reprex package For your reprex: take the mtcars dataset and then filter it for observations where mpg is more than 26. Navigate to github.com/your_username/r-workshop/issues Hint: remember to read the error message. “could not find function %>%” means you’ve forgotten to attach the appropriate package with library() 9.4.4.1 Solution (no peeking) ## setup: run in Rmd or Console library(reprex) ## reprex code: run in Rmd or Console library(tidyverse) # or library(dplyr) or library(magrittr) mtcars %>% filter(mpg > 26) ## copy the above ## reprex call: run in Console reprex() ## paste in Issue! 9.5 Collaborating with GitHub Now we’re going to collaborate with a partner and set up for our last session, which will tie together everything we’ve been learning. 9.5.1 Create repo (Partner 1) Team up with a partner sitting next to you. Partner 1 will create a new repository. We will do this in the same way that we did in Chapter 4: Create a repository on Github.com. Let’s name it r-collab. 9.5.2 Create a gh-pages branch (Partner 1) We aren’t going to talk about branches very much, but they are a powerful feature of git/GitHub. I think of it as creating a copy of your work that becomes a parallel universe that you can modify safely because it’s not affecting your original work. And then you can choose to merge the universes back together if and when you want. By default, when you create a new repo you begin with one branch, and it is named master. When you create new branches, you can name them whatever you want. However, if you name one gh-pages (all lowercase, with a - and no spaces), this will let you create a website. And that’s our plan. So, Partner 1, do this to create a gh-pages branch: On the homepage for your repo on GitHub.com, click the button that says “Branch:master.” Here, you can switch to another branch (right now there aren’t any others besides master), or create one by typing a new name. Let’s type gh-pages. Let’s also change gh-pages to the default branch and delete the master branch: this will be a one-time-only thing that we do here: First click to control branches: And then click to change the default branch to gh-pages. I like to then delete the master branch when it has the little red trash can next to it. It will make you confirm that you really want to delete it, which I do! 9.5.3 Give your collaborator privileges (Partner 1 and 2) Now, Partner 1, go into Settings > Collaborators > enter Partner 2’s (your collaborator’s) username. Partner 2 then needs to check their email and accept as a collaborator. Notice that your collaborator has “Push access to the repository” (highlighted below): 9.5.4 Clone to a new R Project (Partner 1) Now let’s have Partner 1 clone the repository to their local computer. We’ll do this through RStudio like we did before (see Chapter 4: Clone your repository using RStudio), but with a final additional step before hitting “Create Project”: select “Open in a new Session.” Opening this Project in a new Session opens up a new world of awesomeness from RStudio. Having different RStudio project sessions allows you to keep your work separate and organized. So you can collaborate with this collaborator on this repository while also working on your other repository from this morning. I tend to have a lot of projects going at one time: Have a look in your git tab. Like we saw this morning, when you first clone a repo through RStudio, RStudio will add an .Rproj file to your repo. And if you didn’t add a .gitignore file when you originally created the repo on GitHub.com, RStudio will also add this for you. So, Partner 1, let’s go ahead and sync this back to GitHub.com. Remember: Let’s confirm that this was synced by looking at GitHub.com again. You may have to refresh the page, but you should see this commit where you added the .Rproj file. 9.5.5 Clone to a new R Project (Partner 2) Now it’s Partner 2’s turn! Partner 2, clone this repository following the same steps that Partner 1 just did. When you clone it, RStudio should not create any new files — why? Partner 1 already created and pushed the .Rproj and .gitignore files so they already exist in the repo. 9.5.6 Create data folder (Partner 2) Partner 2, let’s create a folder for our data and copy our noaa_landings.csv there. And now let’s sync back to GitHub: Pull, Stage, Commit, Push When we inspect on GitHub.com, click to view all the commits, you’ll see commits logged from both Partner 1 and 2! Question: Would you still be able clone a repository that you are not a collaborator on? What do you think would happen? Try it! Can you sync back? 9.5.7 State of the Repository OK, so where do things stand right now? GitHub.com has the most recent versions of all the repository’s files. Partner 2 also has these most recent versions locally. How about Partner 1? Partner 1 does not have the most recent versions of everything on their computer. Question: How can we change that? Or how could we even check? Answer: PULL. Let’s have Partner 1 go back to RStudio and Pull. If their files aren’t up-to-date, this will pull the most recent versions to their local computer. And if they already did have the most recent versions? Well, pulling doesn’t cost anything (other than an internet connection), so if everything is up-to-date, pulling is fine too. I recommend pulling every time you come back to a collaborative repository. Whether you haven’t opened RStudio in a month or you’ve just been away for a lunch break, pull. It might not be necessary, but it can save a lot of heartache later. 9.6 Merge conflicts What kind of heartache are we talking about? Merge conflicts. Within a file, GitHub tracks changes line-by-line. So you can also have collaborators working on different lines within the same file and GitHub will be able to weave those changes into each other – that’s it’s job! It’s when you have collaborators working on the same lines within the same file that you can have merge conflicts. This is when there is a conflict within the same line so that GitHub can’t merge automatically. They need a human to help decide what information to keep (which is good because you don’t want GitHub to decide for you). Merge conflicts can be frustrating, but like R’s error messages, they are actually trying to help you. So let’s experience this together: we will create and solve a merge conflict. Stop and watch me demo how to create and solve a merge conflict with my Partner 2, and then you will do the same with your partner. Here’s what I am going to do: 9.6.1 Pull (Partners 1 and 2) Both partners go to RStudio and pull so you have the most recent versions of all your files. 9.6.2 Create a conflict (Partners 1 and 2) Now, Partners 1 and 2, both go to the README.md, and on Line 4, write something, anything. Save the README. I’m not going to give any examples because when you do this I want to be sure that both Partners to write something different. Save the README. 9.6.3 Sync (Partner 2) OK. Now, let’s have Partner 2 sync: pull, stage, commit, push. Just like normal. Great. 9.6.4 Sync attempts & fixes (Partner 1) Now, let’s have Partner 1 (me) try. When I try to Pull, I get the first error we will see today: “Your local changes to README.md would be overwritten by merge.” GitHub is telling me that it knows I’ve modified my README, but since I haven’t staged and committed them, it can’t do its job and merge my conflicts with whatever is different about the version from GitHub.com. This is good: the alternative would be GitHub deciding which one to keep and it’s better that we have that kind of control and decision making. GitHub provides some guidance: either commit this work first, or “stash it,” which you can interpret that as moving the README temporarily to another folder somewhere outside of this GitHub repository so that you can successfully pull and then decide your next steps. Let’s follow their advice and have Partner 1 commit. Great. Now let’s try pulling again. New error: “Merge conflict in README…fix conflicts and then commit the result.” So this error is different from the previous: GitHub knows what has changed line-by-line in my file here, and it knows what has changed line-by-line in the version on GitHub.com. And it knows there is a conflict between them. So it’s asking me to now compare these changes, choose a preference, and commit. Note: if Partner 2 and I were not intentionally in this demo editing exactly the same lines, GitHub likely could have done its job and merged this file successfully after our first error fix above. We will again follow GitHub’s advice to fix the conflicts. Let’s close this window and inspect. Did you notice two other things that happened along with this message? First< in the Git tab, next to the README listing there are orange Us; this means that there is an unresolved conflict. It means my file is not staged with a check anymore because modifications have occurred to the file since it has been staged. Second, the README file itself changed; there is new text and symbols. (We got a preview in the diff pane also). <<<<<<< HEAD Julie is collaborating on this README. ======= **Allison is adding text here.** >>>>>>> 05a189b23372f0bdb5b42630f8cb318003cee19b In this example, Partner 1 is Julie and Partner 2 is Allison. GitHub is displaying the line that Julie wrote and the line Allison. wrote separated by =======. These are the two choices that I (Partner 1) has to decide between, which one do you want to keep? Where where does this decision start and end? The lines are bounded by <<<<<<<HEAD and >>>>>>>long commit identifier. So, to resolve this merge conflict, Partner 1 has to chose which one to keep. And I tell GitHub my choice by deleting everything in this bundle of tex except the line they want. So, Partner 1 will delete the <<<<<<HEAD, =====, >>>>long commit identifier and either Julie or Allison’s line that I don’t want to keep. I’ll do this, and then commit again. In this example, we’ve kept Allison’s line: Then I’ll stage, and write a commit message. I often write “resolving merge conflict” or something similar. When I stage the file, notice how now my edits look like a simple line replacement (compare with the image above before it was re-staged): And we’re done! We can inspect on GitHub.com that I am the most recent contributor to this repository. And if we look in the commit history we will see both Allison and my original commits, along with our merge conflict fix. 9.6.5 Activity Create a merge conflict with your partner, following the steps that we just did in the demo above. Practice different approaches to solving errors: for example, try stashing instead of committing. 9.6.6 How do you avoid merge conflicts? Merge conflicts can occur when you collaborate with others — I find most often it is collaborating with ME from a different computer. They will happen, but you can minimize them by getting into good habits. To minimize merge conflicts, pull often so that you are aware of anything that is different, and deal with it early. Similarly, commit and push often so that your contributions do not become too unweildly for yourself or others later on. Also, talk with your collaborators. Are they working on the exact same file right now that you need to be? If so, coordinate with them (in person, GChat, Slack, email). For example: “I’m working on X part and will push my changes before my meeting — then you can work on it and I’ll pull when I’m back.” Also, if you find yourself always working on the exact same file, you could consider breaking it into different files to minimize problems. But merge conflicts will occur and some of them will be heartbreaking and demoralizing. They happen to me when I collaborate with myself between my work computer and laptop. We demoed small conflicts with just one file, but they can occur across many files, particularly when your code is generating figures, scripts, or HTML files. Sometimes the best approach is the burn it all down method, where you delete your local copy of the repo and re-clone. Protect yourself by pulling and syncing often! 9.7 Create your collaborative website OK. Let’s have both Partners create a new RMarkdown file and name it my_name_fisheries.Rmd. Here’s what you will do: Pull Create a new RMarkdown file and name it my_name_fisheries.Rmd. Let’s do it all lowercase. These will become pages for our website We’ll start by testing: let’s simply change the title inside the Rmd, call it “My Name’s Fisheries Analysis” Knit Save and sync your .Rmd and your .html files (pull, stage, commit, push) Go to Partner 1’s repo, mine is https://github.com/jules32/r-collab/ GitHub also supports this as a website (because we set up our gh-pages branch) Where is it? Figure out your website’s url from your github repo’s url — pay attention to urls. - note that the url starts with my username.github.io my github repo: https://github.com/jules32/r-collab/ my website url: https://jules32.github.io/r-collab/ right now this displays the README as the “home page” for our website. Now navigate to your web page! For example: my github repo: https://github.com/jules32/r-collab/julie_fisheries my website url: https://jules32.github.io/r-collab/julie_fisheries ProTip Pay attention to URLs. An unsung skill of the modern analyst is to be able to navigate the internet by keeping an eye on patterns. So cool! You and your partner have created individual webpages here, but they do not talk to each other (i.e. you can’t navigate between them or even know that one exists from the other). We will not organize these pages into a website today, but you can practice this on your own with this hour-long tutorial: Making free websites with RStudio’s R Markdown. Aside: On websites, if something is called index.html, that defaults to the home page. So https://jules32.github.io/r-collab/ is the same as https://jules32.github.io/r-collab/index.html. So as you think about building websites you can develop your index.Rmd file rather than your README.md as your homepage. 9.7.0.1 Troubleshooting 404 error? Remove trailing / from the url Wants you to download? Remove trailing .Rmd from the url 9.7.1 END collaborating session! always_allow_html: true "],["synthesis.html", "Chapter 10 Synthesis 10.1 Summary 10.2 Attach packages, read in and explore the data 10.3 Some data cleaning to get salmon landings by species 10.4 Find total annual US value ($) for each salmon subgroup 10.5 Make a graph of US commercial fisheries value by species over time with ggplot2 10.6 Built-in color palettes 10.7 Sync with GitHub remote 10.8 Add an image to your partner’s document", " Chapter 10 Synthesis 10.1 Summary In this session, we’ll pull together many of the skills that we’ve learned so far. Working in our existing yourname_fisheries.Rmd file within your collaborative project/repo from the previous session (r-collab), we’ll wrangle and visualize data from spreadsheets in R Markdown, communicate between RStudio (locally) and GitHub (remotely) to keep our updates safe, then add something new to our collaborator’s document. And we’ll learn a few new things along the way! Data used in the synthesis section: File name: noaa_fisheries.csv Description: NOAA Commercial Fisheries Landing data (1950 - 2017) Accessed from: https://www.st.nmfs.noaa.gov/commercial-fisheries/commercial-landings/ Source: Fisheries Statistics Division of the NOAA Fisheries Note on the data: “aggregate” here means “These names represent aggregations of more than one species. They are not inclusive, but rather represent landings where we do not have species-specific data. Selecting”Sharks“, for example, will not return all sharks but only those where we do not have more specific information.” 10.1.1 Objectives Synthesize data wrangling and visualization skills learned so far Add a few new tools for data cleaning from stringr Work collaboratively in an R Markdown file Publish your collaborative work as a webpage 10.1.2 Resources Project oriented workflows by Jenny Bryan 10.2 Attach packages, read in and explore the data In your .Rmd, attach the necessary packages in the topmost code chunk: library(tidyverse) library(here) library(janitor) library(paletteer) # install.packages("paletteer") Open the noaa_landings.csv file in Excel. Note that cells we want to be stored as NA actually have the words “no data” - but we can include an additional argument in read_csv() to specify what we want to replace with NA. Read in the noaa_landings.csv data as object us_landings, adding argument na = \"no data\" to automatically reassign any “no data” entries to NA during import: us_landings <- read_csv(here("data","noaa_landings.csv"), na = "no data") Go exploring a bit: summary(us_landings) View(us_landings) names(us_landings) head(us_landings) tail(us_landings) 10.3 Some data cleaning to get salmon landings by species Now that we have our data in R, let’s think about some ways that we might want to make it more coder- and analysis-friendly. Brainstorm with your partner about ways you might clean the data up a bit. Things to consider: Do you like typing in all caps? Are the column names manageable? Do we want symbols alongside values? If your answer to all three is “no,” then we’re flying on the same plane. Here we’ll do some wrangling led by your recommendations for step-by-step cleaning. Which of these would it make sense to do first, to make any subsequent steps easier for coding? We’ll start with janitor::clean_names() to get all column names into lowercase_snake_case: salmon_clean <- us_landings %>% clean_names() Continue building on that sequence to: Convert everything to lower case with mutate() + (str_to_lower()) Remove dollar signs in value column (mutate() + parse_number()) Keep only observations that include “salmon” (filter() + str_detect()) Separate “salmon” from any additional refined information on species (separate()) The entire thing might look like this: salmon_clean <- us_landings %>% clean_names() %>% # Make column headers snake_case mutate( afs_name = str_to_lower(afs_name) ) %>% # Converts character columns to lowercase mutate(dollars_num = parse_number(dollars_usd)) %>% # Just keep numbers from $ column filter(str_detect(afs_name, pattern = "salmon")) %>% # Only keep entries w/"salmon" separate(afs_name, into = c("group", "subgroup"), sep = ", ") %>% # Note comma-space drop_na(dollars_num) # Drop (listwise deletion) any observations with NA for dollars_num Explore salmon_clean. 10.4 Find total annual US value ($) for each salmon subgroup Find the annual total US landings and dollar value (summing across all states) for each type of salmon using group_by() + summarize(). Think about what data/variables we want to use here: If we want to find annual values by subgroup, then what variables are we going to group by? Are we going to start from us_landings, or from salmon_clean? salmon_us_annual <- salmon_clean %>% group_by(year, subgroup) %>% summarize( tot_value = sum(dollars_num, na.rm = TRUE), ) ## `summarise()` regrouping output by 'year' (override with `.groups` argument) 10.5 Make a graph of US commercial fisheries value by species over time with ggplot2 salmon_gg <- ggplot(salmon_us_annual, aes(x = year, y = tot_value, group = subgroup)) + geom_line(aes(color = subgroup)) + theme_bw() + labs(x = "year", y = "US commercial salmon value (USD)") salmon_gg 10.6 Built-in color palettes Want to change the color scheme of your graph? Using a consistent theme and color scheme is a great way to make reports more cohesive within groups or organizations, and means less time is spent manually updating graphs to maintain consistency! Luckily, there are many already built color palettes. For a glimpse, check out the ReadMe for paletteer by Emil Hvidtfelt, which is an aggregate package of many existing color palette packages. In fact, let’s go ahead and install it by running install.packages(\"paletteer\") in the Console. Question: Once we have paletteer installed, what do have to do to actually use the functions & palettes in paletteer? Answer: Attach the package! Update the topmost code chunk with library(paletteer) to make sure all of its functions are available. Now, explore the different available packages and color palettes by typing (in the Console) View(palettes_d_names). Then, add a new color palette from the list to your discrete series with an adding ggplot layer that looks like this: scale_color_paletteer_d("package_name::palette_name") Note: Beware of the palette length - we have 7 subgroups of salmon, so we will want to pick palettes that have at least a length of 7. Once we add that layer, our entire graph code will look something like this (here, using the OkabeIto palette from the colorblindr package): salmon_gg <- ggplot(salmon_us_annual, aes(x = year, y = tot_value, group = subgroup)) + geom_line(aes(color = subgroup)) + theme_bw() + labs(x = "year", y = "US commercial salmon value (USD)") + scale_color_paletteer_d("colorblindr::OkabeIto") salmon_gg Looking again at palettes_d_names, choose another color palette and update your gg-graph. 10.7 Sync with GitHub remote Stage, commit, (pull), and push your updates to GitHub for safe storage & sharing. Check to make sure that the changes have been stored in your shared r-collab repo. 10.8 Add an image to your partner’s document Now, let’s collaborate with our partner. First: Pull again to make sure you have your partner’s most updated versions (there shouldn’t be any conflicts since you’ve been working in different .Rmd files) Now, open your partner’s .Rmd in RStudio Second: Go to octodex.github.com and find a version of octocat that you like On the image, right click and choose “Copy image location” In your partner’s .Rmd, add the image at the end using: !()[paste_image_location_you_just_copied_here] Knit the .Rmd and check to ensure that your octocat shows up in their document Save, stage, commit, pull, then push to send your contributions back Pull again to make sure your .Rmd has updates from your collaborator Check out out your document as a webpage published with gh-pages! 10.8.0.1 Reminder for gh-pages link: username.github.io/repo-name/file-name 10.8.0.2 Troubleshooting gh-pages viewing revisited: 404 error? Remove trailing / from the url Wants you to download? Remove trailing .Rmd from the url 10.8.1 End Synthesis session! "]]