README.Rmd

https://conferences.oreilly.com/strata/strata-ca-2017/public/schedule/detail/55791
Strata + Hadoop World 
MAKE DATA WORK
MARCH 13–14, 2017: TRAINING
MARCH 14–16, 2017: TUTORIALS & CONFERENCE
SAN JOSE, CA
SCHEDULE
SPEAKERS
SPONSORS
EVENTS
VENUE/HOTEL
ABOUT
RESOURCES
ACCOUNT
NetworkJoin Attendee Network  
Add to your personal scheduleAdd to Your Schedule 
Modeling big data with R, sparklyr, and Apache Spark
John Mount (Win-Vector LLC)
1:30pm–5:00pm Tuesday, March 14, 2017
Data science & advanced analytics
Location: LL21 C/D
Level: Intermediate
Secondary topics:  R
Average rating:  ***** (4.83, 6 ratings)
Who is this presentation for?
Data scientists, data analysts, modelers, R users, Spark users, statisticians, and those in IT
Prerequisite knowledge
Basic familiarity with R
Experience using the dplyr R package (If you have not used dplyr before, please read this chapter before coming to class.)
Materials or downloads needed in advance
A laptop with a wireless network connection and a web browser (i.e., Chrome, Safari, or Firefox) able to run the RStudo Server web client (requires JavaScript and the ability to turn off pop-up and ad blockers selectively) installed. (You will get a free temporary RStudio server URL and credentials during the workshop.)
All course materials will be made public at https://github.com/WinVector/BigDataRStrata2017 and will also be preloaded in the free accounts. (To conserve wireless bandwidth for other seminars, please wait until after the session to download these materials.)
What you'll learn
Learn how to quickly set up a local Spark instance, store big data in Spark and then connect to the data with R, use R to apply machine-learning algorithms to big data stored in Spark, and filter and aggregate big data stored in Spark and then import the results into R for analysis and visualization
Understand how to extend R (sparklyr) to access the entire Spark API
Description
Sparklyr, developed by RStudio in conjunction with IBM, Cloudera, and H2O, provides an R interface to Spark’s distributed machine-learning algorithms and much more. Sparklyr makes practical machine learning scalable and easy. With sparklyr, you can interactively manipulate Spark data using both dplyr and SQL (via DBI); filter and aggregate Spark datasets then bring them into R for analysis and visualization; orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater; create extensions that call the full Spark API and provide interfaces to Spark packages; and establish Spark connections and browse Spark data frames within the RStudio IDE.

John Mount demonstrates how to use sparklyr to analyze big data in Spark, covering filtering and manipulating Spark data to import into R and using R to run machine-learning algorithms on data in Spark. John also explores the sparklyr integration built into the RStudio IDE.

Photo of John Mount
John Mount
Win-Vector LLC
John Mount is a principal consultant at Win-Vector LLC, a San Francisco data science consultancy. John has worked as a computational scientist in biotechnology and a stock-trading algorithm designer and has managed a research team for Shopping.com (now an eBay company). He is the coauthor of Practical Data Science with R (Manning Publications, 2014). John started his advanced education in mathematics at UC Berkeley and holds a PhD in computer science from Carnegie Mellon (specializing in the design and analysis of randomized algorithms). He currently blogs about technical issues at the Win-Vector blog, tweets at @WinVectorLLC, and is active in the Rotary. Please contact jmount@win-vector.com for projects and collaborations.

Website
Comments on this page are now closed.

Comments
Picture of John Mount
John Mount | PRINCIPAL CONSULTANT
03/20/2017 7:18am PDT
I want to say I very much appreciated the chance to speak in front of you, appreciate how hard you worked in the workshop. It was a privilege to work with you and I very much appreciate your support.

Thank you all.

I have also put up a short video showing how to install Spark from R https://youtu.be/qnINvPqcRvE (the linked Github repository has the additional steps to install h2o).

Lolo Fernandez | DATA SCIENTIST
03/18/2017 9:49am PDT
I found very relevant this session and thanks for been very professional presenting it. Every sentence/phrase was just on the spot. Great content and outstanding delivered. I will use SparklyR right away at work. Thanks.

Picture of John Mount
John Mount | PRINCIPAL CONSULTANT
02/24/2017 12:40am PST
We are going to have everything loaded on RStudio Server Pro instances (generously supplied by RStudio), so bringing a ready to go laptop with wireless and an appropriate Javascript enabled web browser should be all you need. Also on the day of the workshop (and after) we will share all slides, code, exercises, solutions, and materials here https://github.com/WinVector/BigDataRStrata2017 . So there will be no need to copy anything to/from your laptop or the servers (though it should be easy to do so if you wish).

---
output:
  md_document:
    variant: markdown_github
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

Materials for:


#### [Modeling big data with R, sparklyr, and Apache Spark](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/55791)

    1:30pm–5:00pm Tuesday, March 14, 2017
    Data science & advanced analytics
    Location: LL21 C/D
    Level: Intermediate
    Secondary topics:  R

    John Mount  (Win-Vector LLC)
    

We have a short video showing how to install [Spark](http://spark.apache.org) using [R](https://cran.r-project.org) and [RStudio](https://www.rstudio.com) [here](https://youtu.be/qnINvPqcRvE).

Also please click through for slides from Edgar Ruiz's excellent [Strata Sparklyr presentation](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/55800) and [cheat-sheet](http://spark.rstudio.com/images/sparklyr-cheatsheet.pdf).
    

Description from Strata announcement
-------------
  
  
  Modeling big data with R, sparklyr, and Apache Spark
  
[John Mount](http://www.win-vector.com/site/staff/john-mount/) ([Win Vector LLC](http://www.win-vector.com/))
1:30pm–5:00pm Tuesday, March 14, 2017
Data science & advanced analytics
Location: LL21 C/D
Level: Intermediate
Secondary topics:  R

#### Who is this presentation for?

Data scientists, data analysts, modelers, R users, Spark users, statisticians, and those in IT


#### Prerequisite knowledge


##### Basic familiarity with R

Experience using the [dplyr](https://CRAN.R-project.org/package=dplyr) R package (If you have not used dplyr before, please read this chapter before coming to class.)
Materials or downloads needed in advance.

A WiFi-enabled laptop (You'll be provided an [RStudio Server Pro](https://www.rstudio.com/products/rstudio-server-pro/) login for students to use on the day of the workshop.)

##### What you'll learn

Learn how to quickly set up a local Spark instance, store big data in Spark and then connect to the data with R, use R to apply machine-learning algorithms to big data stored in Spark, and filter and aggregate big data stored in Spark and then import the results into R for analysis and visualization
Understand how to extend R and use [sparkly](http://spark.rstudio.com)) to access the entire Spark API

### Description

Sparklyr, developed by RStudio in conjunction with IBM, Cloudera, and [H2O](http://www.h2o.ai), provides an R interface to Spark’s distributed machine-learning algorithms and much more. Sparklyr makes practical machine learning scalable and easy. With sparklyr, you can interactively manipulate Spark data using both dplyr and SQL (via DBI); filter and aggregate Spark datasets then bring them into R for analysis and visualization; orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater; create extensions that call the full Spark API and provide interfaces to Spark packages; and establish Spark connections and browse Spark data frames within the RStudio IDE.

John Mount demonstrates how to use sparklyr to analyze big data in Spark, covering filtering and manipulating Spark data to import into R and using R to run machine-learning algorithms on data in Spark. John also also explores the sparklyr integration built into the RStudio IDE.


Derived from [R for big data](https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/52369) (GitHub"" [https://github.com/rstudio/Strata2016](https://github.com/rstudio/Strata2016)).

Public repository is: [https://github.com/WinVector/BigDataRStrata2017](https://github.com/WinVector/BigDataRStrata2017).


###### config


Current list of CRAN packages used:

```{r cranpackges, eval=FALSE}
# often a good idea, though try "n" to build source
# may interfere with us pinning h2o to a specific version
# update.packages(ask=FALSE) 
cranpkgs <- c(
 'babynames',
 'caret',
 'DBI',
 'devtools',
 'dplyr',
 'dygraphs',
 'e1071',
 'formatR',
 'ggplot2',
  # 'h2o', # installed a bit later
 'lubridate',
 'nycflights13',
 'plotly',
 'rbokeh',
 'rsparkling',
 'RSQLite',
 'sparklyr',
 'tidyr',
 'tidyverse',
 'titanic',
 'xtable'
 )
install.packages(cranpkgs)
```

```{r githubddevpkgs, eval=FALSE}
devpkgs <- c(
  'RStudio/EDAWR',
  'WinVector/replyr',
  'WinVector/WVPlots' )

for(pkgi in devpkgs) {
  devtools::install_github(pkgi)
}
```


Also it is critical to look at [Exercises/solutions/RsparklingExample.Rmd](Exercises/solutions/RsparklingExample.Rmd) as it installs and configures some packages.  A refresh of all packages will break the matching version numbers required by `h2o` and `rsparkling`.  So please work through the details in `RsparklingExample.Rmd` after updating and installing all the above packages.

A copy of those note are below (but it is better to look at `RsparklingExample.Rmd`).


```{r installh2o, eval=FALSE}
# updated from https://gist.github.com/edgararuiz/6453d44a91c85a87998cfeb0dfed9fa9
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods", "statmod", "stats",
          "graphics", "RCurl", "jsonlite",
          "tools", "utils")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) {
     install.packages(pkg)
  }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-turnbull/2/R")

# Installing 'rsparkling' from CRAN
install.packages("rsparkling")
options(rsparkling.sparklingwater.version = "2.0.3")
# Reinstalling 'sparklyr' 
install.packages("sparklyr")
sparklyr::spark_install(version = "2.0.0")
```

Note: please note using `dplyr::compute()` (or `sparklyr::sdf_checkpoint()`) with `sparklyr` can have issues (see [sparklyr issue 721](https://github.com/rstudio/sparklyr/issues/721)).