Skip to content

Latest commit

 

History

History
798 lines (649 loc) · 21.8 KB

practiceCode.md

File metadata and controls

798 lines (649 loc) · 21.8 KB

Outline and concepts for workshop

This <- command basically takes anything on the right hand side and puts it into the left hand side. Like an equation. This is called variable assignment.

df <- read.csv('train.csv')

head checks only the first few rows of the dataset.

head(df)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

dim checks the dimensions of the dataset (number of rows and number of columns, in that order).

dim(df)
## [1] 891  12

str checks the structure of the dataset, showing what the df object is, what each item (or column) in the dataset is, such as numeric, factor, etc. This is pretty useful as it can give you a quick overview of what the variables have been classified as and if there are any problems.

str(df)
## 'data.frame':	891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

class checks only the type of object you are asking, not the contents (unlike str).

class(df)
## [1] "data.frame"

summary is an extremely useful command to check the basic descriptive statistics (mean, median, range, count for factors). I usually use this anytime I want to quickly look at my dataset, to get a sense of it.

summary(df)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

If you want to see specific columns or rows, you can use the [ command. The first number is the row [row, ], and the second number is the column [, column]. So together: [row, column].

df[1:2, 1:5]
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
##                                                  Name    Sex
## 1                             Braund, Mr. Owen Harris   male
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female

Also, numbers can be put together like so:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
-1:10
##  [1] -1  0  1  2  3  4  5  6  7  8  9 10
1:-10
##  [1]   1   0  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10
-10:1
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1

In dataframes, a negative number means remove:

df[1:2, ]
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
##   Parch    Ticket    Fare Cabin Embarked
## 1     0 A/5 21171  7.2500              S
## 2     0  PC 17599 71.2833   C85        C
df[1:2, -2:-4]
##   PassengerId    Sex Age SibSp Parch    Ticket    Fare Cabin Embarked
## 1           1   male  22     1     0 A/5 21171  7.2500              S
## 2           2 female  38     1     0  PC 17599 71.2833   C85        C

You can use strings (real words) to select a specific column.

df[1:2, 'Age']
## [1] 22 38

You can use the combine command c() to put two strings or numbers together.

df[c(1, 4), c('Age', 'Sex')]
##   Age    Sex
## 1  22   male
## 4  35 female

You can also subset the data using these commands:

head(df[df$Sex == 'male', ])
##    PassengerId Survived Pclass                           Name  Sex Age
## 1            1        0      3        Braund, Mr. Owen Harris male  22
## 5            5        0      3       Allen, Mr. William Henry male  35
## 6            6        0      3               Moran, Mr. James male  NA
## 7            7        0      1        McCarthy, Mr. Timothy J male  54
## 8            8        0      3 Palsson, Master. Gosta Leonard male   2
## 13          13        0      3 Saundercock, Mr. William Henry male  20
##    SibSp Parch    Ticket    Fare Cabin Embarked
## 1      1     0 A/5 21171  7.2500              S
## 5      0     0    373450  8.0500              S
## 6      0     0    330877  8.4583              Q
## 7      0     0     17463 51.8625   E46        S
## 8      3     1    349909 21.0750              S
## 13     0     0 A/5. 2151  8.0500              S
head(df[c(df$Sex == 'male', df$Age < 40), ])
##    PassengerId Survived Pclass                           Name  Sex Age
## 1            1        0      3        Braund, Mr. Owen Harris male  22
## 5            5        0      3       Allen, Mr. William Henry male  35
## 6            6        0      3               Moran, Mr. James male  NA
## 7            7        0      1        McCarthy, Mr. Timothy J male  54
## 8            8        0      3 Palsson, Master. Gosta Leonard male   2
## 13          13        0      3 Saundercock, Mr. William Henry male  20
##    SibSp Parch    Ticket    Fare Cabin Embarked
## 1      1     0 A/5 21171  7.2500              S
## 5      0     0    373450  8.0500              S
## 6      0     0    330877  8.4583              Q
## 7      0     0     17463 51.8625   E46        S
## 8      3     1    349909 21.0750              S
## 13     0     0 A/5. 2151  8.0500              S

dplyr and tidyr approach

However, these are a bit complicated, and hard to read! There is a better way. Install and/or load these packages:

install.packages('dplyr')
install.packages('tidyr')
library(dplyr)
library(tidyr)

To do the same as the above, use:

df %>%
  filter(Sex == 'male', Age < 40) %>%
  ## You can keep chaining
  tbl_df() %>%
  ## ... and chaining
  select(Sex, Age, Pclass, Parch) %>%
  ## ... and chaining
  summary()
##      Sex           Age            Pclass          Parch       
##  female:  0   Min.   : 0.42   Min.   :1.000   Min.   :0.0000  
##  male  :344   1st Qu.:19.00   1st Qu.:2.000   1st Qu.:0.0000  
##               Median :25.00   Median :3.000   Median :0.0000  
##               Mean   :24.28   Mean   :2.494   Mean   :0.2878  
##               3rd Qu.:31.00   3rd Qu.:3.000   3rd Qu.:0.0000  
##               Max.   :39.00   Max.   :3.000   Max.   :5.0000

The extremely useful %>% chain, or pipe command, is just like in the shell/terminal. It takes the output of the previous command and inputs it into the next command. Otherwise, without the %>% pipe, it looks like:

summary(select(tbl_df(filter(df, Sex == 'male', Age < 40)), Sex, Age, Pclass, Parch))
##      Sex           Age            Pclass          Parch       
##  female:  0   Min.   : 0.42   Min.   :1.000   Min.   :0.0000  
##  male  :344   1st Qu.:19.00   1st Qu.:2.000   1st Qu.:0.0000  
##               Median :25.00   Median :3.000   Median :0.0000  
##               Mean   :24.28   Mean   :2.494   Mean   :0.2878  
##               3rd Qu.:31.00   3rd Qu.:3.000   3rd Qu.:0.0000  
##               Max.   :39.00   Max.   :3.000   Max.   :5.0000

The pipe does this by basically making the output be named ., so really, the pipe is doing this:

df %>% select(., Fare, Sex) %>%
  filter(., Sex == 'male') %>%
  select(., Fare) %>%
  round(., 3) %>%
  head(.)
##     Fare
## 1  7.250
## 2  8.050
## 3  8.458
## 4 51.862
## 5 21.075
## 6  8.050

tbl_df makes the dataframe also a tbl object, so that the outout can be printed easily. The verbs for dplyr are:

  • select
  • filter
  • mutate
  • summarise
  • arrange
  • group_by

For more explanation of dplyr, check the documentation: https://github.com/hadley/dplyr or run this command vignette('introduction', package = 'dplyr')

The tbl_df() function makes the printing prettier.

df <- tbl_df(df)
df %>% summary
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186
df %>%
  ## subset the data by SibSp
  filter(SibSp >= 2) %>%
  ## select only the relevant columns
  select(Age, Sex, Survived, Cabin, Fare) %>%
  ## order the data (in descending) by Age
  arrange(Age) %>%
  ## create a new column
  mutate(d.Fare = cut(Fare, 3, labels = c('Low', 'Middle', 'High')))
## Source: local data frame [74 x 6]
## 
##     Age    Sex Survived Cabin    Fare d.Fare
## 1  0.75 female        1       19.2583    Low
## 2  0.75 female        1       19.2583    Low
## 3  1.00   male        0       39.6875    Low
## 4  1.00   male        1    F4 39.0000    Low
## 5  1.00   male        0       46.9000    Low
## 6  2.00   male        0       21.0750    Low
## 7  2.00   male        0       29.1250    Low
## 8  2.00 female        0       31.2750    Low
## 9  2.00 female        0       27.9000    Low
## 10 2.00   male        0       39.6875    Low
## ..  ...    ...      ...   ...     ...    ...

To do even more interesting things, we can combine the dplyr package with the tidyr package. The tidyr has basically two main verbs:

  • gather
  • spread
df %>%
  filter(SibSp >= 2) %>%
  select(Age, Sex, Survived, Cabin, Fare) %>%
  ## convert the data into a very long format
  gather(Measure, Value, -Sex) %>%
  ## make each summarise command run on the groups Sex and Measure
  group_by(Sex, Measure) %>%
  ## create summary statistics, in this cause the sample in each group
  ## (Sex and Measure)
  summarise(n = n()) %>%
  ## convert the data into a wide format
  spread(Sex, n)
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Source: local data frame [4 x 3]
## 
##    Measure female male
## 1      Age     34   40
## 2 Survived     34   40
## 3    Cabin     34   40
## 4     Fare     34   40

Check the content again:

df %>% summary
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

Compare the means of continuous variables of those who survived and those who didn't.

prep.table <- df %>%
  select(Survived, Age, Pclass, SibSp, Parch, Fare) %>%
  gather(Measure, Value, -Survived) %>%
  group_by(Survived, Measure) %>%
  ## remove missing values
  na.omit() %>%
  ## create a summary statistic (means)
  summarise(mean = mean(Value) %>% round(2)) %>%
  spread(Survived, mean)

prep.table
## Source: local data frame [5 x 3]
## 
##   Measure     0     1
## 1     Age 30.63 28.34
## 2  Pclass  2.53  1.95
## 3   SibSp  0.55  0.47
## 4   Parch  0.33  0.46
## 5    Fare 22.12 48.40

This can be created into a markdown table, so that it can be easily put into a manuscript or report. A very useful package is called pander which allows you to create markdown tables.

install.packages('pander')
library(pander)
prep.table %>% pander(style = 'rmarkdown')
Measure 0 1
Age 30.63 28.34
Pclass 2.53 1.95
SibSp 0.55 0.47
Parch 0.33 0.46
Fare 22.12 48.4

There is also join commands from dplyr:

  • left_join
  • outer_join
  • inner_join
  • anti_join

Code used in workshop

Exact code used in the workshop

ds <- df
ds %>% select(., Sex, Cabin, Fare) %>%
    filter(., Sex == 'female', Fare > 10)
## Source: local data frame [250 x 3]
## 
##       Sex Cabin    Fare
## 1  female   C85 71.2833
## 2  female  C123 53.1000
## 3  female       11.1333
## 4  female       30.0708
## 5  female    G6 16.7000
## 6  female  C103 26.5500
## 7  female       16.0000
## 8  female       18.0000
## 9  female       21.0750
## 10 female       31.3875
## ..    ...   ...     ...
filter(select(ds, Sex, Cabin, Fare),
       Sex == 'female')
## Source: local data frame [314 x 3]
## 
##       Sex Cabin    Fare
## 1  female   C85 71.2833
## 2  female        7.9250
## 3  female  C123 53.1000
## 4  female       11.1333
## 5  female       30.0708
## 6  female    G6 16.7000
## 7  female  C103 26.5500
## 8  female        7.8542
## 9  female       16.0000
## 10 female       18.0000
## ..    ...   ...     ...
ds %>%
    select(Sex, Cabin, Fare) %>%
    mutate(newcol = 10 * Fare) %>%
    arrange(Fare)
## Source: local data frame [891 x 4]
## 
##     Sex Cabin Fare newcol
## 1  male          0      0
## 2  male   B94    0      0
## 3  male          0      0
## 4  male          0      0
## 5  male          0      0
## 6  male          0      0
## 7  male          0      0
## 8  male          0      0
## 9  male          0      0
## 10 male          0      0
## ..  ...   ...  ...    ...
ds %>%
    select(Sex, Age, Fare) %>%
    group_by(Sex) %>%
    na.omit() %>%
    summarise(n = n(),
              mean = mean(Age))
## Source: local data frame [2 x 3]
## 
##      Sex   n     mean
## 1 female 261 27.91571
## 2   male 453 30.72664

Remove a column using the - sign.

ds %>% select(-Age)
## Source: local data frame [891 x 11]
## 
##    PassengerId Survived Pclass
## 1            1        0      3
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 5            5        0      3
## 6            6        0      3
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
## ..         ...      ...    ...
## Variables not shown: Name (fctr), Sex (fctr), SibSp (int), Parch (int),
##   Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)
prep.table2 <- ds %>%
    select(Sex, Fare, Age, Pclass, SibSp) %>%
    mutate(AgeDouble = Age * 2) %>%
    gather(Measure, Value, -Sex) %>%
    na.omit() %>%
    group_by(Sex, Measure) %>%
    ## create a column with the mean and standard deviation
    summarise(meanSD = paste0(mean(Value) %>% round(2),
                             ' (', sd(Value) %>% round(2),
                             ')')) %>%
    spread(Sex, meanSD)

prep.table2
## Source: local data frame [5 x 3]
## 
##     Measure        female          male
## 1      Fare    44.48 (58) 25.52 (43.14)
## 2       Age 27.92 (14.11) 30.73 (14.68)
## 3    Pclass   2.16 (0.86)   2.39 (0.81)
## 4     SibSp   0.69 (1.16)   0.43 (1.06)
## 5 AgeDouble 55.83 (28.22) 61.45 (29.36)

Create a table, with a caption!

pander(prep.table2, style = 'rmarkdown',
       caption = 'Testing caption for this table!')
Measure female male
Fare 44.48 (58) 25.52 (43.14)
Age 27.92 (14.11) 30.73 (14.68)
Pclass 2.16 (0.86) 2.39 (0.81)
SibSp 0.69 (1.16) 0.43 (1.06)
AgeDouble 55.83 (28.22) 61.45 (29.36)

Table: Testing caption for this table!

To use the grid.table function instead of pander, load the gridExtra package. You may need to install first (install.packages('gridExtra')).

library(gridExtra)
grid.table(prep.table2, rows = NULL)

plot of chunk unnamed-chunk-22

A package that does a tutorial within R! I've heard good things from it.

install.packages('swirl')

To see help files, you can run the vignette command.

vignette('introduction', package = 'dplyr')