forked from elmoallistair/google-data-analytics
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9ae6e82
commit 91b3620
Showing
10 changed files
with
120,043 additions
and
0 deletions.
There are no files selected for viewing
96 changes: 96 additions & 0 deletions
96
07_data-analysis-r/03_working-with-data-in-r/activity/Lesson2_Dataframe_Solutions.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
title: "Lesson 2: Dataframe Solutions" | ||
output: html_document | ||
--- | ||
|
||
## Create a data frame solutions | ||
This document contains the solutions for the create a data frame activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions. | ||
|
||
If you experience errors, remember that you can search the internet and the RStudio community for help: | ||
https://community.rstudio.com/# | ||
|
||
## Step 1: Load packages | ||
Start by installing the required package; in this case, you will want to install `tidyverse`. If you have already installed and loaded `tidyverse` in this session, feel free to skip the code chunks in this step. | ||
|
||
```{r} | ||
install.packages("tidyverse") | ||
``` | ||
```{r} | ||
library(tidyverse) | ||
``` | ||
|
||
## Step 2: Create data frame | ||
|
||
Sometimes you will need to generate a data frame directly in `R`. There are a number of ways to do this; one of the most common is to create individual vectors of data and then combine them into a data frame using the `data.frame()` function. | ||
|
||
Here's how this works. First, create a vector of names: | ||
```{r} | ||
names <- c("Peter", "Jennifer", "Julie", "Alex") | ||
``` | ||
|
||
Then create a vector of ages: | ||
|
||
```{r} | ||
age <- c(15, 19, 21, 25) | ||
``` | ||
|
||
With these two vectors, you can create a new data frame called `people`: | ||
```{r} | ||
people <- data.frame(names, age) | ||
``` | ||
|
||
## Step 3: inspect the data frame | ||
|
||
Now that you have this data frame, you can use some different functions to inspect it. | ||
|
||
One common function you can use to preview the data is the `head()` function, which returns the columns and the first several rows of data. You can check out how the `head()` function works by running the chunk below: | ||
|
||
```{r} | ||
head(people) | ||
``` | ||
|
||
In addition to `head()`, there are a number of other useful functions to summarize or preview your data. For example, the `str()` and `glimpse()` functions will both provide summaries of each column in your data arranged horizontally. You can check out these two functions in action by running the code chunks below: | ||
|
||
```{r} | ||
str(people) | ||
``` | ||
|
||
```{r} | ||
glimpse(people) | ||
``` | ||
|
||
You can also use `colnames()` to get a list the column names in your data set. Run the code chunk below to check out this function: | ||
|
||
```{r} | ||
colnames(people) | ||
``` | ||
|
||
Now that you have a data frame, you can work with it using all of the tools in `R`. For example, you could use `mutate()` if you wanted to create a new variable that would capture each person's age in twenty years. The code chunk below creates that new variable: | ||
|
||
```{r} | ||
mutate(people, age_in_20 = age + 20) | ||
``` | ||
|
||
## Step 4: Try it yourself | ||
|
||
To get more familiar with creating and using data frames, use the code chunks below to create your own custom data frame. | ||
|
||
First, create a vector of any five different fruits. You can type directly into the code chunk below; just place your cursor in the box and click to type. Once you have input the fruits you want in your data frame, run the code chunk. | ||
|
||
```{r} | ||
fruit <- c("Lemon", "Blueberry", "Grapefruit", "Mango", "Strawberry") | ||
``` | ||
|
||
Now, create a new vector with a number representing your own personal rank for each fruit. Give a 1 to the fruit you like the most, and a 5 to the fruit you like the least. Remember, the scores need to be in the same order as the fruit above. So if your favorite fruit is last in the list above, the score `1` needs to be in the last position in the list below. Once you have input your rankings, run the code chunk. | ||
|
||
```{r} | ||
rank <- c(4, 2, 5, 3, 1) | ||
``` | ||
|
||
Finally, combine the two vectors into a data frame. You can call it `fruit_ranks`. Edit the code chunk below and run it to create your data frame. | ||
|
||
```{r} | ||
fruit_ranks <- data.frame(fruit, rank) | ||
``` | ||
|
||
After you run this code chunk, it will create a data frame with your fruits and rankings. |
77 changes: 77 additions & 0 deletions
77
07_data-analysis-r/03_working-with-data-in-r/activity/Lesson2_Import_Solutions.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
--- | ||
title: "Lesson 2: Import Solutions" | ||
output: html_document | ||
--- | ||
|
||
## Importing and working with data activity solutions | ||
This document contains the solutions for the importing and working with data activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions. | ||
|
||
If you experience errors, remember that you can search the internet and the RStudio community for help: | ||
https://community.rstudio.com/# | ||
|
||
## Step 1: Load packages | ||
|
||
Start by installing your required package. If you have already installed and loaded `tidyverse` in this session, feel free to skip the code chunks in this step. | ||
|
||
```{r} | ||
install.packages("tidyverse") | ||
``` | ||
```{r} | ||
library(tidyverse) | ||
``` | ||
## Step 2: Import data | ||
|
||
The data in this example is originally from the article Hotel Booking Demand Datasets (https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. | ||
|
||
The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020 (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md). | ||
|
||
You can learn more about the dataset here: | ||
https://www.kaggle.com/jessemostipak/hotel-booking-demand | ||
|
||
In the chunk below, you will use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `bookings_df`: | ||
|
||
```{r} | ||
bookings_df <- read_csv("hotel_bookings.csv") | ||
``` | ||
|
||
Now that you have the `bookings_df`, you can work with it using all of the `R` functions you have learned so far. | ||
|
||
## Step 3: Inspect & clean data | ||
|
||
One common function you can use to preview the data is the `head()` function, which returns the columns and first several rows of data. Check out the `head()` function by running the chunk below: | ||
|
||
```{r} | ||
head(bookings_df) | ||
``` | ||
|
||
Check out the `str()` function by running the code chunk below: | ||
|
||
```{r} | ||
str(bookings_df) | ||
``` | ||
|
||
To find out what columns you have in your data frame, try running the the `colnames()` function in the code chunk below: | ||
|
||
```{r} | ||
colnames(bookings_df) | ||
``` | ||
|
||
If you want to create another data frame using `bookings_df` that focuses on the average daily rate, which is referred to as `adr` in the data frame, and `adults`, you can use the following code chunk to do that: | ||
|
||
```{r} | ||
new_df <- select(bookings_df, `adr`, adults) | ||
``` | ||
|
||
To create new variables in your data frame, you can use the `mutate()` function. This will make changes to the data frame, but not to the original data set you imported. That source data will remain unchanged. | ||
|
||
```{r} | ||
mutate(new_df, total = `adr` / adults) | ||
``` | ||
|
||
## Step 4: Import your own data | ||
|
||
Now you can find your own .csv to import! Using the RStudio Cloud interface, import and save the file in the same folder as this R Markdown document. Then write code in the chunk below to read that data into `R`: | ||
```{r} | ||
own_df <- read_csv("<filename.csv>") | ||
``` | ||
|
202 changes: 202 additions & 0 deletions
202
07_data-analysis-r/03_working-with-data-in-r/activity/Lesson3_Change_Solutions.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
--- | ||
title: "Lesson 3: Change Solutions" | ||
output: html_document | ||
--- | ||
|
||
## Changing data solutions | ||
This document contains the solutions for the changing activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions. | ||
|
||
If you experience errors, remember that you can search the internet and the RStudio community for help: | ||
https://community.rstudio.com/# | ||
|
||
## Step 1: Load packages | ||
|
||
Start by installing the required packages. If you have already installed and loaded `tidyverse`, `skimr`, and `janitor` in this session, feel free to skip the code chunks in this step. | ||
|
||
```{r install packages} | ||
install.packages("tidyverse") | ||
install.packages("skimr") | ||
install.packages("janitor") | ||
``` | ||
|
||
Once a package is installed, you can load it by running the `library()` function with the package name inside the parentheses: | ||
|
||
```{r load packages} | ||
library(tidyverse) | ||
library(skimr) | ||
library(janitor) | ||
``` | ||
|
||
## Step 2: Import data | ||
|
||
The data in this example is originally from the article Hotel Booking Demand Datasets (https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. | ||
|
||
The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020 (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md). | ||
|
||
You can learn more about the dataset here: | ||
https://www.kaggle.com/jessemostipak/hotel-booking-demand | ||
|
||
In the chunk below, you will use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `hotel_bookings`: | ||
|
||
```{r load dataset} | ||
hotel_bookings <- read_csv("hotel_bookings.csv") | ||
``` | ||
|
||
## Step 3: Getting to know your data | ||
|
||
Like you have been doing in other examples, use the `head()` function to preview the columns and the first several rows of data by running the chunk below: | ||
|
||
```{r head function} | ||
head(hotel_bookings) | ||
``` | ||
### Practice Quiz Answers | ||
|
||
1. How many columns are in this data set? | ||
A: 45 | ||
B: 100 | ||
C: 32 | ||
D: 60 | ||
Answer: C. There are 32 columns in this data set. The `head()` function returns a preview of the data set, including the first six rows of data and as many columns as will fit on the screen. At the bottom left of the table, it states that it is previewing 1-4 of 32 columns. | ||
|
||
2. The 'arrival_date_month' variable is chr or character type data. | ||
A: True | ||
B: False | ||
Answer: A. The ‘arrival_date_month’ variable is chr or character type data. Underneath the column name in the preview table, there is a description of the data type for each column. | ||
|
||
In addition to `head()` you can also use the `str()` and `glimpse()` functions to get summaries of each column in your data arranged horizontally. You can try these two functions by running the code chunks below: | ||
|
||
```{r str function} | ||
str(hotel_bookings) | ||
``` | ||
|
||
You can see the different column names and some sample values to the right of the colon. | ||
|
||
```{r glimpse function} | ||
glimpse(hotel_bookings) | ||
``` | ||
|
||
You can also use `colnames()` to get the names of the columns in your data set. Run the code chunk below to get the column names: | ||
|
||
```{r colnames function} | ||
colnames(hotel_bookings) | ||
``` | ||
|
||
## Manipulating your data | ||
|
||
Let's say you want to arrange the data by most lead time to least lead time because you want to focus on bookings that were made far in advance. You decide you want to try using the `arrange()` function and run the following command: | ||
|
||
```{r arrange function} | ||
arrange(hotel_bookings, lead_time) | ||
``` | ||
|
||
`arrange()` automatically orders by ascending order, and you need to specifically tell it when to order by descending order, like the below code chunk below: | ||
|
||
```{r arrange function descending} | ||
arrange(hotel_bookings, desc(lead_time)) | ||
``` | ||
## Practice Quiz Answers | ||
|
||
What is the highest lead time for a hotel booking in this data set? | ||
A: 737 | ||
B: 709 | ||
C: 629 | ||
D: 0 | ||
|
||
Answer: A. The highest lead time for a hotel booking in this data set is 737 days. After using the arrange() function to sort the hotel_bookings by lead time in descending order, you will notice that 737 is the first row. That is over two years in advance! | ||
|
||
Notice that when you just run `arrange()` without saving your data to a new data frame, it does not alter the existing data frame. Check it out by running `head()` again to find out if the highest lead times are first: | ||
|
||
```{r head function part two} | ||
head(hotel_bookings) | ||
``` | ||
|
||
If you wanted to create a new data frame that had those changes saved, you would use the <- as written in the code chunk below to store the arranged data in a data frame named 'hotel_bookings_v2' | ||
|
||
```{r new dataframe} | ||
hotel_bookings_v2 <- | ||
arrange(hotel_bookings, desc(lead_time)) | ||
``` | ||
|
||
Check out the new data frame: | ||
|
||
```{r new dataframe part two} | ||
head(hotel_bookings_v2) | ||
``` | ||
|
||
You can also find out the maximum and minimum lead times without sorting the whole data set using the `arrange()` function. Try it out using the max() and min() functions below: | ||
|
||
```{r} | ||
max(hotel_bookings$lead_time) | ||
``` | ||
|
||
```{r} | ||
min(hotel_bookings$lead_time) | ||
``` | ||
|
||
Remember, in this case, you need to specify which data set and which column using the $ symbol between their names. Try running the below to see what happens if you forget one of those pieces: | ||
|
||
```{r} | ||
min(lead_time) | ||
``` | ||
|
||
This is a common error that R users encounter. | ||
|
||
Now, let's say you just want to know what the average lead time for booking is because your boss asks you how early you should run promotions for hotel rooms. You can use the `mean()`function to answer that question: | ||
|
||
```{r mean} | ||
mean(hotel_bookings$lead_time) | ||
``` | ||
|
||
You should get the same answer even if you use the v2 data set that included the `arrange()` function. | ||
|
||
```{r mean part two} | ||
mean(hotel_bookings_v2$lead_time) | ||
``` | ||
|
||
## Practice Quiz | ||
|
||
What is the average lead time? | ||
A: 100 | ||
B: 45 | ||
C: 14 | ||
D: 104.0114 | ||
|
||
Answer: D. The average lead time is 104.0114 days. You were able to calculate this using the mean() function on the lead_time column in your data set. | ||
|
||
You were able to report to your boss what the average lead time before booking is, but now they want to know what the average lead time before booking is for just city hotels. They want to focus the promotion they're running by targeting major cities. | ||
|
||
You know that your first step will be creating a new data set that only contains data about city hotels. You can do that using the `filter()` function, and name your new data frame 'hotel_bookings_city': | ||
|
||
```{r filter} | ||
hotel_bookings_city <- | ||
filter(hotel_bookings, hotel_bookings$hotel=="City Hotel") | ||
``` | ||
|
||
Check out your new data set: | ||
|
||
```{r new dataset} | ||
head(hotel_bookings_city) | ||
``` | ||
|
||
You quickly check what the average lead time for this set of hotels is, just like you did for all of hotels before: | ||
|
||
```{r average lead time city hotels} | ||
mean(hotel_bookings_city$lead_time) | ||
``` | ||
|
||
Now, your boss wants to know a lot more information about city hotels, including the maximum and minimum lead time. They are also interested in how they are different from resort hotels. You don't want to run each line of code over and over again, so you decide to use the `group_by()`and`summarize()` functions. You can also use the pipe operator to make your code easier to follow. You will store the new data set in a data frame named 'hotel_summary': | ||
|
||
```{r group and summarize} | ||
hotel_summary <- | ||
hotel_bookings %>% | ||
group_by(hotel) %>% | ||
summarise(average_lead_time=mean(lead_time), | ||
min_lead_time=min(lead_time), | ||
max_lead_time=max(lead_time)) | ||
``` | ||
|
||
Check out your new data set using head() again: | ||
|
||
```{r} | ||
head(hotel_summary) | ||
``` |
Oops, something went wrong.