RDesign.tex


\documentclass[11pt]{article}
\usepackage{graphicx}

\usepackage{times}
\usepackage{listings}
\usepackage{parcolumns}
\usepackage{hyperref}

\setlength{\parindent}{0in}
\setlength{\parskip}{0.1in}

\title{R Design Patterns, Base-R vs.\ Tidyverse \\
   With a view toward the teaching of R beginners}
\author{Norman Matloff \\
      Dept. of Computer Science, University of California, Davis}

\begin{document}

\maketitle

This document enables the reader to see at a glance the difference
between base-R and the tidyverse in common R design settings.  I believe
the base-R versions are generally simpler, thus more appropriate for R
learners.  Relative to learning via base-R, learners who are taught Tidy
take longer to reach proficiency, and even then are proficient only in a
very narrow range of operations.

I discuss this in much more detail (e.g. an entire section on
pipes) in \url{http://github.com/matloff/TidyverseSkeptic/README.md},
where I cite many other problems besides lack of simplicity, but I
believe the examples here begin to illustrate why Tidy is a bad vehicle
to use in teaching R beginners.  By the way, I use the latter term to
mean R learners with no prior programming background.

The examples here might also serve those who use base-R but wish to
learn Tidy, or vice versa.

All examples use R's built-in datasets.  Note that, after changing a data
frame, it is restored for the next example, e.g. \textbf{data(mtcars)}.
The examples are presented roughly in order of how often these
operations tend to be performed by R users.

As this document is aimed at comparing base-R and the tidyverse in terms
of teaching new R learners, advanced functions from either base-R (e.g.
\textbf{with()} or the tidyverse (e.g. \textbf{mutate\_at()} are excluded
here.  The same is true for advanced arguments of less-advanced
functions.

It is certainly not claimed here that all possible operations are
simpler in base-R, and indeed \textbf{ I've long called for teaching R
with a blend of the two approaches}, as seen in one of the examples
here.  

\section*{Reading a specific cell in a data frame}

What is the MPG value for the third car? 

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars$mpg[3]

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars %>%
select(mpg) %>% 
slice(3)

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:}  The Tidy version could be shortened a bit by using
\lstinline{select(mtcars,mpg)} instead of 
\lstinline{mtcars %>% select(mpg)}, 
but the latter seems to be the preferred form,
e.g. on the official tidyverse page, \textit{https://dplyr.tidyverse.org/}.  

\section*{Adding a column to a data frame}

Create a new variable, the horsepower-to-weight ratio.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars$hwratio 
  <- mtcars$hp / mtcars$wt

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars %>% 
mutate(hwratio=hp/wt) -> mtcars

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\bigskip

\textbf{Comment:}
Of course, typically Tidy coders would use \lstinline{<-}
rather than \lstinline{->}.  I feel that the former is more
consistent with the ``left to right flow'' of pipes.  But in any case,
the point about code complexity is the same either way.

\section*{Extracting rows}

Find all the cars with 8 cylinders.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtc8 <- 
  subset(mtcars,cyl==8)

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars %>%
filter(cyl == 8) -> mtc8

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\section*{Mean by group}

Find the mean MPG, broken down by number of cylinders.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

tapply(mtcars$mpg,
  mtcars$cyl,mean)

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars %>%
group_by(cyl) %>%
  summarize(meanMPG = 
    mean(mpg,))

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:}  Strangely, many Tidy advocates dismiss the
\textbf{tapply()} function, treating it as a niche function in base-R.
On the contrary, it is a workhorse in classic base-R applications.  For
example, consider the \textbf{ggplot2} package, written by the (later)
inventor of the tidyverse, Hadley Wickham. Hadley calls
\textbf{tapply()} 7 times in the \textbf{ggplot2} code!

\section*{Row means}

For each day in the market, find the mean of 4 major EU stock indexes.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

rowMeans(EuStockMarkets)

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

EuStockMarkets %>% 
  as.data.frame %>% 
  rowwise() %>% 
  mutate(m = 
    rowMeans(across(everything()))) 
  %>% select(m)

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:} The row means operation is quite common in R usage.
Here the Tidy user must go to much more trouble than in base-R.

\section*{Row operations, custom function}

For each market day, find the spread between the smallest index and the
largest.


\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mM <- function(x) 
  max(x) - min(x)
apply(EuStockMarkets,1,mM) 

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mM <- function(x) 
   max(x) - min(x)
EuStockMarkets %>% 
  as.data.frame %>% rowwise() %>% 
  mutate(m = 
     mM(across(everything()))) %>% 
  select(m)

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:} The \textbf{apply()} function is of central importance
in traditional R.  Here again, the Tidy user must go to much more
trouble.

\section*{Means, grouped by more than one variable}

Find the mean MPG, broken down by both number of cylinders and type of
transmission.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

tapply(mtcars$mpg,
  list(mtcars$cyl,
    mtcars$am),
  mean)

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars %>% 
  group_by(cyl,am) %>% 
  summarize(m = mean(mpg))

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:}  
In this example, the two sets of code are not fully comparable,
as there is quite a difference in type of output:

\begin{lstlisting}

> tapply(mtcars$mpg,list(mtcars$cyl,mtcars$am),mean)
       0        1
4 22.900 28.07500
6 19.125 20.56667
8 15.050 15.40000
> mtcars %>% group_by(cyl,am) %>% summarize(m = mean(mpg))
# Groups:   cyl [3]
    cyl    am     m
  <dbl> <dbl> <dbl>
1     4     0  22.9
2     4     1  28.1
3     6     0  19.1
4     6     1  20.6
5     8     0  15.0
6     8     1  15.4

\end{lstlisting}

The base-R form returns a 3x2 table, which is often what one needs for
reports, research papers and so.  The Tidy version is less useful in
such contexts, but would be of greater value in some other applications.

\section*{Binary recoding of a vector}

Record the river levels to 'high' if $>$ 1000, 'low' otherwise.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}
NileHiLow <- 
  ifelse(Nile >= 1000,
    'high','low')
\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

Nile %>% as.data.frame %>% 
  mutate(
    HighLow = case_when
    (x < 1000~'low',
     x >= 1000~'high')
  ) %>%
  select(HighLow) %>%
  as.vector -> HighLow


\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:}
The Tidy step of conversion back to a vector at the end is needed for
the many R packages in which vector input is required.  

\section*{Deleting columns from a data frame}

Remove the 'drat' and 'carb' variables from the data.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars[c('drat','carb')] 
  <- NULL

\end{lstlisting}

\end{minipage}

}

\hspace{0.1in}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

mtcars %>%
select(-c(drat,carb)) 
  -> mtcars

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\section*{Isn't It More Reasonable to Teach a Mix of Base-R and Tidy?}

I believe most people react skeptically to extreme ideologies.  I've
advocated teaching a mix of base-R and Tidy.

Here is an example.  (The Tidy version appeared in a presentation by a
Tidy advocate.)  In the \textbf{mtcars} dataset, say we wish to recode
the number of gears into English.

\begin{parcolumns}[rulebetween=true]{2}

\colchunk{

\begin{minipage}{0.40\linewidth}

\begin{lstlisting}

gr <- mtcars$gear
mtcars$gear <-
  case_when(
    gr == 3 ~ 'three',
    gr == 4 ~ 'four',
    gr == 5 ~ 'five')

\end{lstlisting}

\end{minipage}

}

\hspace{0.05in}

\colchunk{

\begin{minipage}{0.45\linewidth}

\begin{lstlisting}

mtcars %>%
 mutate(
   gear =
   case_when(
     gear == 3 ~ "three",
     gear == 4 ~ "four",
     gear == 5 ~ "five"
   )
 ) -> mtcars

\end{lstlisting}

\end{minipage}

}

\end{parcolumns}

\textbf{Comment:}  A purely-base-R version might use nested
\textbf{ifelse()} calls.  This would be shorter than the Tidy version,
but not as clear.  But by using a blend of base-R and Tidy---use of '\$'
to reference a data frame column and use of the Tidy function
\textbf{case\_when()}---we still end up with shorter and simpler code,
compared to the pure-Tidy version.

\end{document}