Skip to content

Commit

Permalink
typos
Browse files Browse the repository at this point in the history
  • Loading branch information
grantmcdermott committed May 26, 2021
1 parent 5344b69 commit a3ca5ad
Show file tree
Hide file tree
Showing 6 changed files with 19 additions and 16 deletions.
14 changes: 7 additions & 7 deletions 17-spark/17-spark.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -204,16 +204,16 @@ The [Wikipedia page](https://en.wikipedia.org/wiki/Apache_Spark) is a good place

Spark's tagline is that it provides a *unified analytics engine for large-scale data processing*. **Translation:** Spark provides a common platform for tackling all of the problems that we typically encounter in big data pipelines. This includes distributed data storage, data wrangling, and analytics (machine learning and co.) Spark achieves this by bringing together various sub-components:

- Spark Core (the foundation of the project, enabling things like distributed computing. We'll get back to this in a second.)
- Spark SQL (exactly what it sounds like: a SQL implementation for querying and storing data in Spark)
- MLlib (an extensive machine learning library)
- **Spark Core**. The foundation of the project, enabling things like distributed computing and memory-mapping.
- **Spark SQL**. A SQL implementation for efficiently querying and storing data in Spark.
- **MLlib**. An extensive machine learning library.
- etc.

Now, at this point you may be asking yourself: "Hang on a minute, haven't we been learning to do all these things in R? Why do we need this Spark stuff if R already provides a (quote unquote) 'unified framework' for our data science needs?"

The short answer is that Spark can scale in ways that R simply can't. You can move your analysis from a test dataset living on your local computer, to a massive dataset running over a cluster of distributed machines with minimal changes to your code. But at a deeper level, asking yourself the above questions is to miss the point. We will be running all of our Spark functions and analysis from R, thanks to the **sparklyr** package. It will look like R code and it will return R outputs.

**Aside:** You'll note that Spark is often compared to a predecessor framework for cluster computing and distributed data processing called [(Hadoop) MapReduce](https://en.wikipedia.org/wiki/MapReduce). The key difference between these two frameworks is that Spark allows for in-memory processing, whereas MapReduce relies solely on I/O to disk. (See [here](https://www.scnsoft.com/blog/spark-vs-hadoop-mapreduce) and [here](https://www.quora.com/What-is-the-difference-in-idea-design-and-code-between-Apache-Spark-and-Apache-Hadoop/answer/Shubham-Sinha-202.)) You don't need to worry too much about all this now. It basically means that Spark is faster and better suited to our present needs. Spark also comes with a bunch of really cool extension libraries, which we'll barely scratch the surface of today.
> **Aside:** You'll note that Spark is often compared to a predecessor framework for cluster computing and distributed data processing, i.e. [(Hadoop) MapReduce](https://en.wikipedia.org/wiki/MapReduce). The key difference between these two frameworks is that Spark allows for in-memory processing, whereas MapReduce relies solely on I/O to disk. (See [here](https://www.scnsoft.com/blog/spark-vs-hadoop-mapreduce) and [here](https://www.quora.com/What-is-the-difference-in-idea-design-and-code-between-Apache-Spark-and-Apache-Hadoop/answer/Shubham-Sinha-202.)) You don't need to worry too much about all this now. It essentially means that Spark is faster and better suited to our present needs. Spark also comes with a bunch of really cool extension libraries, of which we'll barely scratch the surface today.
## Big data and Spark

Expand Down Expand Up @@ -431,7 +431,7 @@ flights_cached

> **Tip:** Open the Spark web UI (`spark_web(sc)`) and click the "Storage" tab to see which of your tables are cached and held in memory.
Remembering that "flights_cached" still only exists as a Spark table, here's something cool: We can use the **dbplot** package ([link](https://db.rstudio.com/dbplot/)) to perform plot calculations *inside* the Spark connection (i.e. database). While this might seem like overkill for this particular example --- and it is --- this database plotting functionality is extremely useful extracting insights from large database connections.
Remembering that "flights_cached" still only exists as a Spark table, here's something cool: We can use the **dbplot** package ([link](https://db.rstudio.com/dbplot/)) to perform plot calculations *inside* the Spark connection (i.e. database). While this might seem like overkill for this particular example --- and it is --- this database plotting functionality is extremely useful for extracting insights from large database connections.

```{r flights_cached_plot}
# library(dbplot) ## Already loaded
Expand All @@ -444,13 +444,13 @@ And there we have our same plot from before, but now executed from inside the Sp

## Machine Learning

Some of the most exciting applications of Spark involve machine learning; both through its built-in [MLlib](http://spark.rstudio.com/mlib/) library and its (seamless) interface to external platforms like H2O's [Sparkling Water](https://www.h2o.ai/sparkling-water/). There's a lot to cover here and I'm not going to be able to explain everything in depth. (We'll have a dedicated intro-to-machine-learning lecture next week.) But here's a simple example that uses our cached flights data to predict arrival delays with different algorithms.
Some of the most exciting applications of Spark involve machine learning; both through its built-in [MLlib](http://spark.rstudio.com/mlib/) library and its (seamless) interface to external platforms like H2O's [Sparkling Water](https://www.h2o.ai/sparkling-water/). There's a lot to cover here and I'm not going to be able to explain everything in depth. But in this section I'll demonstrate by running several different models (algorithms) on our cached flights dataset. In each case our task will be exactly the same: Predict whether a plan will be late arriving at its destination based on when it departed.

### Prepare and the data

First, I'm going to prepare the data. For this simple example, I'm not going to worry about scaling and any of the other feature engineering tricks that we'd normally heed in a real-life ML problem. Instead, I'm just going to create a binary (1/0) variable called "late", which records whether a flight arrived more than 15 minutes behind schedule. In other words, this is going to be a *classification* problem.

As ever with prediction algorithms, we also need to partition our data into random samples for testing and validation. This will help us avoid overfitting. In this case, I'm only going to use 10% of the data for model building and testing. This 10% sample will in turn be split 30-70 between training and testing data, respectively.
As ever with prediction algorithms, we also need to partition our data into random samples for testing and validation. This will help us avoid overfitting. In this case, I'm only going to use 10% of the data for model building and testing to reduce estimation time.^[Estimating models on random samples is very common practice in industries where you are working with extremely large datasets.] This 10% sample will then be split 30-70 between training and testing data, respectively.

```{r flights_sample, cache=TRUE}
flights_sample =
Expand Down
21 changes: 12 additions & 9 deletions 17-spark/17-spark.html
Original file line number Diff line number Diff line change
Expand Up @@ -439,14 +439,16 @@ <h2>What is Spark?</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Apache_Spark">Wikipedia page</a> is a good place to start.</p>
<p>Spark’s tagline is that it provides a <em>unified analytics engine for large-scale data processing</em>. <strong>Translation:</strong> Spark provides a common platform for tackling all of the problems that we typically encounter in big data pipelines. This includes distributed data storage, data wrangling, and analytics (machine learning and co.) Spark achieves this by bringing together various sub-components:</p>
<ul>
<li>Spark Core (the foundation of the project, enabling things like distributed computing. We’ll get back to this in a second.)</li>
<li>Spark SQL (exactly what it sounds like: a SQL implementation for querying and storing data in Spark)</li>
<li>MLlib (an extensive machine learning library)</li>
<li><strong>Spark Core</strong>. The foundation of the project, enabling things like distributed computing and memory-mapping.</li>
<li><strong>Spark SQL</strong>. A SQL implementation for efficiently querying and storing data in Spark.</li>
<li><strong>MLlib</strong>. An extensive machine learning library.</li>
<li>etc.</li>
</ul>
<p>Now, at this point you may be asking yourself: “Hang on a minute, haven’t we been learning to do all these things in R? Why do we need this Spark stuff if R already provides a (quote unquote) ‘unified framework’ for our data science needs?”</p>
<p>The short answer is that Spark can scale in ways that R simply can’t. You can move your analysis from a test dataset living on your local computer, to a massive dataset running over a cluster of distributed machines with minimal changes to your code. But at a deeper level, asking yourself the above questions is to miss the point. We will be running all of our Spark functions and analysis from R, thanks to the <strong>sparklyr</strong> package. It will look like R code and it will return R outputs.</p>
<p><strong>Aside:</strong> You’ll note that Spark is often compared to a predecessor framework for cluster computing and distributed data processing called <a href="https://en.wikipedia.org/wiki/MapReduce">(Hadoop) MapReduce</a>. The key difference between these two frameworks is that Spark allows for in-memory processing, whereas MapReduce relies solely on I/O to disk. (See <a href="https://www.scnsoft.com/blog/spark-vs-hadoop-mapreduce">here</a> and <a href="https://www.quora.com/What-is-the-difference-in-idea-design-and-code-between-Apache-Spark-and-Apache-Hadoop/answer/Shubham-Sinha-202.">here</a>) You don’t need to worry too much about all this now. It basically means that Spark is faster and better suited to our present needs. Spark also comes with a bunch of really cool extension libraries, which we’ll barely scratch the surface of today.</p>
<blockquote>
<p><strong>Aside:</strong> You’ll note that Spark is often compared to a predecessor framework for cluster computing and distributed data processing, i.e. <a href="https://en.wikipedia.org/wiki/MapReduce">(Hadoop) MapReduce</a>. The key difference between these two frameworks is that Spark allows for in-memory processing, whereas MapReduce relies solely on I/O to disk. (See <a href="https://www.scnsoft.com/blog/spark-vs-hadoop-mapreduce">here</a> and <a href="https://www.quora.com/What-is-the-difference-in-idea-design-and-code-between-Apache-Spark-and-Apache-Hadoop/answer/Shubham-Sinha-202.">here</a>) You don’t need to worry too much about all this now. It essentially means that Spark is faster and better suited to our present needs. Spark also comes with a bunch of really cool extension libraries, of which we’ll barely scratch the surface today.</p>
</blockquote>
</div>
<div id="big-data-and-spark" class="section level2">
<h2>Big data and Spark</h2>
Expand Down Expand Up @@ -668,7 +670,7 @@ <h3>Option 2: Distributed analysis using <strong>sparklyr</strong></h3>
<blockquote>
<p><strong>Tip:</strong> Open the Spark web UI (<code>spark_web(sc)</code>) and click the “Storage” tab to see which of your tables are cached and held in memory.</p>
</blockquote>
<p>Remembering that “flights_cached” still only exists as a Spark table, here’s something cool: We can use the <strong>dbplot</strong> package (<a href="https://db.rstudio.com/dbplot/">link</a>) to perform plot calculations <em>inside</em> the Spark connection (i.e. database). While this might seem like overkill for this particular example — and it is — this database plotting functionality is extremely useful extracting insights from large database connections.</p>
<p>Remembering that “flights_cached” still only exists as a Spark table, here’s something cool: We can use the <strong>dbplot</strong> package (<a href="https://db.rstudio.com/dbplot/">link</a>) to perform plot calculations <em>inside</em> the Spark connection (i.e. database). While this might seem like overkill for this particular example — and it is — this database plotting functionality is extremely useful for extracting insights from large database connections.</p>
<div class="sourceCode" id="cb33"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a><span class="co"># library(dbplot) ## Already loaded</span></span>
<span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-3"><a href="#cb33-3" aria-hidden="true" tabindex="-1"></a>flights_cached <span class="sc">%&gt;%</span></span>
Expand All @@ -679,11 +681,11 @@ <h3>Option 2: Distributed analysis using <strong>sparklyr</strong></h3>
</div>
<div id="machine-learning" class="section level2">
<h2>Machine Learning</h2>
<p>Some of the most exciting applications of Spark involve machine learning; both through its built-in <a href="http://spark.rstudio.com/mlib/">MLlib</a> library and its (seamless) interface to external platforms like H2O’s <a href="https://www.h2o.ai/sparkling-water/">Sparkling Water</a>. There’s a lot to cover here and I’m not going to be able to explain everything in depth. (We’ll have a dedicated intro-to-machine-learning lecture next week.) But here’s a simple example that uses our cached flights data to predict arrival delays with different algorithms.</p>
<p>Some of the most exciting applications of Spark involve machine learning; both through its built-in <a href="http://spark.rstudio.com/mlib/">MLlib</a> library and its (seamless) interface to external platforms like H2O’s <a href="https://www.h2o.ai/sparkling-water/">Sparkling Water</a>. There’s a lot to cover here and I’m not going to be able to explain everything in depth. But in this section I’ll demonstrate by running several different models (algorithms) on our cached flights dataset. In each case our task will be exactly the same: Predict whether a plan will be late arriving at its destination based on when it departed.</p>
<div id="prepare-and-the-data" class="section level3">
<h3>Prepare and the data</h3>
<p>First, I’m going to prepare the data. For this simple example, I’m not going to worry about scaling and any of the other feature engineering tricks that we’d normally heed in a real-life ML problem. Instead, I’m just going to create a binary (1/0) variable called “late”, which records whether a flight arrived more than 15 minutes behind schedule. In other words, this is going to be a <em>classification</em> problem.</p>
<p>As ever with prediction algorithms, we also need to partition our data into random samples for testing and validation. This will help us avoid overfitting. In this case, I’m only going to use 10% of the data for model building and testing. This 10% sample will in turn be split 30-70 between training and testing data, respectively.</p>
<p>As ever with prediction algorithms, we also need to partition our data into random samples for testing and validation. This will help us avoid overfitting. In this case, I’m only going to use 10% of the data for model building and testing to reduce estimation time.<a href="#fn9" class="footnote-ref" id="fnref9"><sup>9</sup></a> This 10% sample will then be split 30-70 between training and testing data, respectively.</p>
<div class="sourceCode" id="cb34"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a>flights_sample <span class="ot">=</span> </span>
<span id="cb34-2"><a href="#cb34-2" aria-hidden="true" tabindex="-1"></a> flights_cached <span class="sc">%&gt;%</span></span>
<span id="cb34-3"><a href="#cb34-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(<span class="sc">!</span><span class="fu">is.na</span>(arr_delay)) <span class="sc">%&gt;%</span></span>
Expand Down Expand Up @@ -747,7 +749,7 @@ <h4>Logistic regression</h4>
<p>How did we do overall?</p>
<div class="sourceCode" id="cb42"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a><span class="do">## Summary metrics:</span></span>
<span id="cb42-2"><a href="#cb42-2" aria-hidden="true" tabindex="-1"></a><span class="fu">ml_binary_classification_evaluator</span>(log_pred) <span class="do">## area under ROC</span></span></code></pre></div>
<pre><code>## [1] 0.9288696</code></pre>
<pre><code>## [1] 0.9288729</code></pre>
<div class="sourceCode" id="cb44"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb44-1"><a href="#cb44-1" aria-hidden="true" tabindex="-1"></a><span class="fu">ml_multiclass_classification_evaluator</span>(log_pred) <span class="do">## F1 score</span></span></code></pre></div>
<pre><code>## [1] 0.9371875</code></pre>
<div class="sourceCode" id="cb46"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb46-1"><a href="#cb46-1" aria-hidden="true" tabindex="-1"></a><span class="do">## We&#39;ll also create a confusion matrix for use later</span></span>
Expand Down Expand Up @@ -808,7 +810,7 @@ <h4>Neural network</h4>
<span id="cb52-18"><a href="#cb52-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb52-19"><a href="#cb52-19" aria-hidden="true" tabindex="-1"></a><span class="do">## Summary metrics:</span></span>
<span id="cb52-20"><a href="#cb52-20" aria-hidden="true" tabindex="-1"></a><span class="fu">ml_binary_classification_evaluator</span>(nnet_pred) <span class="do">## area under ROC</span></span></code></pre></div>
<pre><code>## [1] 0.928711</code></pre>
<pre><code>## [1] 0.9287353</code></pre>
<div class="sourceCode" id="cb54"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb54-1"><a href="#cb54-1" aria-hidden="true" tabindex="-1"></a><span class="fu">ml_multiclass_classification_evaluator</span>(nnet_pred) <span class="do">## F1 score</span></span></code></pre></div>
<pre><code>## [1] 0.9371784</code></pre>
<div class="sourceCode" id="cb56"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb56-1"><a href="#cb56-1" aria-hidden="true" tabindex="-1"></a><span class="do">## Create confusion matrix</span></span>
Expand Down Expand Up @@ -886,6 +888,7 @@ <h2>Further reading and resources</h2>
<li id="fn6"><p>In the background, these are getting translated to Spark SQL.<a href="#fnref6" class="footnote-back">↩︎</a></p></li>
<li id="fn7"><p>You might note that that I’m not using <strong>lubridate</strong> for the date conversions in the query that follows. That’s because Spark relies on Hive functions for built-in aggregation, including datetime operations. See <a href="https://spark.rstudio.com/dplyr/#hive-functions">here</a>.<a href="#fnref7" class="footnote-back">↩︎</a></p></li>
<li id="fn8"><p>By default, local Spark instances only request 2 GB of memory for the Spark driver. However, you can easily change this by requesting a larger memory allocation (among other features) via the <code>config</code> argument. See the <strong>sparklyr</strong> <a href="https://spark.rstudio.com/deployment/#configuration">website</a> or <a href="https://therinspark.com/tuning.html">Chapter 9</a> of <em>Mastering Spark with R</em> for more details.<a href="#fnref8" class="footnote-back">↩︎</a></p></li>
<li id="fn9"><p>Estimating models on random samples is very common practice in industries where you are working with extremely large datasets.<a href="#fnref9" class="footnote-back">↩︎</a></p></li>
</ol>
</div>

Expand Down
Binary file modified 17-spark/17-spark.pdf
Binary file not shown.
Binary file modified 17-spark/17-spark_files/figure-latex/cmat_plot-1.pdf
Binary file not shown.
Binary file not shown.
Binary file modified 17-spark/17-spark_files/figure-latex/flights_cached_plot-1.pdf
Binary file not shown.

0 comments on commit a3ca5ad

Please sign in to comment.