Skip to content

Commit

Permalink
add c7 week3
Browse files Browse the repository at this point in the history
  • Loading branch information
elmoallistair committed Apr 19, 2021
1 parent 9ae6e82 commit 91b3620
Show file tree
Hide file tree
Showing 10 changed files with 120,043 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "Lesson 2: Dataframe Solutions"
output: html_document
---

## Create a data frame solutions
This document contains the solutions for the create a data frame activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions.

If you experience errors, remember that you can search the internet and the RStudio community for help:
https://community.rstudio.com/#

## Step 1: Load packages
Start by installing the required package; in this case, you will want to install `tidyverse`. If you have already installed and loaded `tidyverse` in this session, feel free to skip the code chunks in this step.

```{r}
install.packages("tidyverse")
```
```{r}
library(tidyverse)
```

## Step 2: Create data frame

Sometimes you will need to generate a data frame directly in `R`. There are a number of ways to do this; one of the most common is to create individual vectors of data and then combine them into a data frame using the `data.frame()` function.

Here's how this works. First, create a vector of names:
```{r}
names <- c("Peter", "Jennifer", "Julie", "Alex")
```

Then create a vector of ages:

```{r}
age <- c(15, 19, 21, 25)
```

With these two vectors, you can create a new data frame called `people`:
```{r}
people <- data.frame(names, age)
```

## Step 3: inspect the data frame

Now that you have this data frame, you can use some different functions to inspect it.

One common function you can use to preview the data is the `head()` function, which returns the columns and the first several rows of data. You can check out how the `head()` function works by running the chunk below:

```{r}
head(people)
```

In addition to `head()`, there are a number of other useful functions to summarize or preview your data. For example, the `str()` and `glimpse()` functions will both provide summaries of each column in your data arranged horizontally. You can check out these two functions in action by running the code chunks below:

```{r}
str(people)
```

```{r}
glimpse(people)
```

You can also use `colnames()` to get a list the column names in your data set. Run the code chunk below to check out this function:

```{r}
colnames(people)
```

Now that you have a data frame, you can work with it using all of the tools in `R`. For example, you could use `mutate()` if you wanted to create a new variable that would capture each person's age in twenty years. The code chunk below creates that new variable:

```{r}
mutate(people, age_in_20 = age + 20)
```

## Step 4: Try it yourself

To get more familiar with creating and using data frames, use the code chunks below to create your own custom data frame.

First, create a vector of any five different fruits. You can type directly into the code chunk below; just place your cursor in the box and click to type. Once you have input the fruits you want in your data frame, run the code chunk.

```{r}
fruit <- c("Lemon", "Blueberry", "Grapefruit", "Mango", "Strawberry")
```

Now, create a new vector with a number representing your own personal rank for each fruit. Give a 1 to the fruit you like the most, and a 5 to the fruit you like the least. Remember, the scores need to be in the same order as the fruit above. So if your favorite fruit is last in the list above, the score `1` needs to be in the last position in the list below. Once you have input your rankings, run the code chunk.

```{r}
rank <- c(4, 2, 5, 3, 1)
```

Finally, combine the two vectors into a data frame. You can call it `fruit_ranks`. Edit the code chunk below and run it to create your data frame.

```{r}
fruit_ranks <- data.frame(fruit, rank)
```

After you run this code chunk, it will create a data frame with your fruits and rankings.
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
title: "Lesson 2: Import Solutions"
output: html_document
---

## Importing and working with data activity solutions
This document contains the solutions for the importing and working with data activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions.

If you experience errors, remember that you can search the internet and the RStudio community for help:
https://community.rstudio.com/#

## Step 1: Load packages

Start by installing your required package. If you have already installed and loaded `tidyverse` in this session, feel free to skip the code chunks in this step.

```{r}
install.packages("tidyverse")
```
```{r}
library(tidyverse)
```
## Step 2: Import data

The data in this example is originally from the article Hotel Booking Demand Datasets (https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020 (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md).

You can learn more about the dataset here:
https://www.kaggle.com/jessemostipak/hotel-booking-demand

In the chunk below, you will use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `bookings_df`:

```{r}
bookings_df <- read_csv("hotel_bookings.csv")
```

Now that you have the `bookings_df`, you can work with it using all of the `R` functions you have learned so far.

## Step 3: Inspect & clean data

One common function you can use to preview the data is the `head()` function, which returns the columns and first several rows of data. Check out the `head()` function by running the chunk below:

```{r}
head(bookings_df)
```

Check out the `str()` function by running the code chunk below:

```{r}
str(bookings_df)
```

To find out what columns you have in your data frame, try running the the `colnames()` function in the code chunk below:

```{r}
colnames(bookings_df)
```

If you want to create another data frame using `bookings_df` that focuses on the average daily rate, which is referred to as `adr` in the data frame, and `adults`, you can use the following code chunk to do that:

```{r}
new_df <- select(bookings_df, `adr`, adults)
```

To create new variables in your data frame, you can use the `mutate()` function. This will make changes to the data frame, but not to the original data set you imported. That source data will remain unchanged.

```{r}
mutate(new_df, total = `adr` / adults)
```

## Step 4: Import your own data

Now you can find your own .csv to import! Using the RStudio Cloud interface, import and save the file in the same folder as this R Markdown document. Then write code in the chunk below to read that data into `R`:
```{r}
own_df <- read_csv("<filename.csv>")
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
---
title: "Lesson 3: Change Solutions"
output: html_document
---

## Changing data solutions
This document contains the solutions for the changing activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions.

If you experience errors, remember that you can search the internet and the RStudio community for help:
https://community.rstudio.com/#

## Step 1: Load packages

Start by installing the required packages. If you have already installed and loaded `tidyverse`, `skimr`, and `janitor` in this session, feel free to skip the code chunks in this step.

```{r install packages}
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
```

Once a package is installed, you can load it by running the `library()` function with the package name inside the parentheses:

```{r load packages}
library(tidyverse)
library(skimr)
library(janitor)
```

## Step 2: Import data

The data in this example is originally from the article Hotel Booking Demand Datasets (https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020 (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md).

You can learn more about the dataset here:
https://www.kaggle.com/jessemostipak/hotel-booking-demand

In the chunk below, you will use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `hotel_bookings`:

```{r load dataset}
hotel_bookings <- read_csv("hotel_bookings.csv")
```

## Step 3: Getting to know your data

Like you have been doing in other examples, use the `head()` function to preview the columns and the first several rows of data by running the chunk below:

```{r head function}
head(hotel_bookings)
```
### Practice Quiz Answers

1. How many columns are in this data set?
A: 45
B: 100
C: 32
D: 60
Answer: C. There are 32 columns in this data set. The `head()` function returns a preview of the data set, including the first six rows of data and as many columns as will fit on the screen. At the bottom left of the table, it states that it is previewing 1-4 of 32 columns.

2. The 'arrival_date_month' variable is chr or character type data.
A: True
B: False
Answer: A. The ‘arrival_date_month’ variable is chr or character type data. Underneath the column name in the preview table, there is a description of the data type for each column.

In addition to `head()` you can also use the `str()` and `glimpse()` functions to get summaries of each column in your data arranged horizontally. You can try these two functions by running the code chunks below:

```{r str function}
str(hotel_bookings)
```

You can see the different column names and some sample values to the right of the colon.

```{r glimpse function}
glimpse(hotel_bookings)
```

You can also use `colnames()` to get the names of the columns in your data set. Run the code chunk below to get the column names:

```{r colnames function}
colnames(hotel_bookings)
```

## Manipulating your data

Let's say you want to arrange the data by most lead time to least lead time because you want to focus on bookings that were made far in advance. You decide you want to try using the `arrange()` function and run the following command:

```{r arrange function}
arrange(hotel_bookings, lead_time)
```

`arrange()` automatically orders by ascending order, and you need to specifically tell it when to order by descending order, like the below code chunk below:

```{r arrange function descending}
arrange(hotel_bookings, desc(lead_time))
```
## Practice Quiz Answers

What is the highest lead time for a hotel booking in this data set?
A: 737
B: 709
C: 629
D: 0

Answer: A. The highest lead time for a hotel booking in this data set is 737 days. After using the arrange() function to sort the hotel_bookings by lead time in descending order, you will notice that 737 is the first row. That is over two years in advance!

Notice that when you just run `arrange()` without saving your data to a new data frame, it does not alter the existing data frame. Check it out by running `head()` again to find out if the highest lead times are first:

```{r head function part two}
head(hotel_bookings)
```

If you wanted to create a new data frame that had those changes saved, you would use the <- as written in the code chunk below to store the arranged data in a data frame named 'hotel_bookings_v2'

```{r new dataframe}
hotel_bookings_v2 <-
arrange(hotel_bookings, desc(lead_time))
```

Check out the new data frame:

```{r new dataframe part two}
head(hotel_bookings_v2)
```

You can also find out the maximum and minimum lead times without sorting the whole data set using the `arrange()` function. Try it out using the max() and min() functions below:

```{r}
max(hotel_bookings$lead_time)
```

```{r}
min(hotel_bookings$lead_time)
```

Remember, in this case, you need to specify which data set and which column using the $ symbol between their names. Try running the below to see what happens if you forget one of those pieces:

```{r}
min(lead_time)
```

This is a common error that R users encounter.

Now, let's say you just want to know what the average lead time for booking is because your boss asks you how early you should run promotions for hotel rooms. You can use the `mean()`function to answer that question:

```{r mean}
mean(hotel_bookings$lead_time)
```

You should get the same answer even if you use the v2 data set that included the `arrange()` function.

```{r mean part two}
mean(hotel_bookings_v2$lead_time)
```

## Practice Quiz

What is the average lead time?
A: 100
B: 45
C: 14
D: 104.0114

Answer: D. The average lead time is 104.0114 days. You were able to calculate this using the mean() function on the lead_time column in your data set.

You were able to report to your boss what the average lead time before booking is, but now they want to know what the average lead time before booking is for just city hotels. They want to focus the promotion they're running by targeting major cities.

You know that your first step will be creating a new data set that only contains data about city hotels. You can do that using the `filter()` function, and name your new data frame 'hotel_bookings_city':

```{r filter}
hotel_bookings_city <-
filter(hotel_bookings, hotel_bookings$hotel=="City Hotel")
```

Check out your new data set:

```{r new dataset}
head(hotel_bookings_city)
```

You quickly check what the average lead time for this set of hotels is, just like you did for all of hotels before:

```{r average lead time city hotels}
mean(hotel_bookings_city$lead_time)
```

Now, your boss wants to know a lot more information about city hotels, including the maximum and minimum lead time. They are also interested in how they are different from resort hotels. You don't want to run each line of code over and over again, so you decide to use the `group_by()`and`summarize()` functions. You can also use the pipe operator to make your code easier to follow. You will store the new data set in a data frame named 'hotel_summary':

```{r group and summarize}
hotel_summary <-
hotel_bookings %>%
group_by(hotel) %>%
summarise(average_lead_time=mean(lead_time),
min_lead_time=min(lead_time),
max_lead_time=max(lead_time))
```

Check out your new data set using head() again:

```{r}
head(hotel_summary)
```
Loading

0 comments on commit 91b3620

Please sign in to comment.