Skip to content

Commit

Permalink
[SPARK-42445][R] Fix SparkR install.spark function
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR fixes `SparkR` `install.spark` method.

```
$ curl -LO https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/SparkR_3.3.2.tar.gz
$ R CMD INSTALL SparkR_3.3.2.tar.gz
$ R

R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

> install.spark()
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.3.2 for Hadoop 2.7 from:
- https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz
trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196

> install.spark(hadoopVersion="3")
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.3.2 for Hadoop 3 from:
- https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz
trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz'
simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196
```

Note that this is a regression at Spark 3.3.0 and not a blocker for on-going Spark 3.3.2 RC vote.

### Why are the changes needed?

https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#ref-usage
![Screenshot 2023-02-14 at 10 07 49 PM](https://user-images.githubusercontent.com/9700541/218946460-ab7eab1b-65ae-4cb2-bc7c-5810ad359ac9.png)

First, the existing Spark 2.0.0 link is broken.
- https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#details
- http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz (Broken)

Second, Spark 3.3.0 changed the Hadoop postfix pattern from the distribution files so that the function raises errors as described before.
- http://archive.apache.org/dist/spark/spark-3.2.3/spark-3.2.3-bin-hadoop2.7.tgz (Old Pattern)
- http://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz (New Pattern)

### Does this PR introduce _any_ user-facing change?

No, this fixes a bug like Spark 3.2.3 and older versions.

### How was this patch tested?

Pass the CI and manual testing. Please note that the link pattern is correct although it fails because 3.5.0 is not published yet.
```
$ NO_MANUAL=1 ./dev/make-distribution.sh --r
$ R CMD INSTALL R/SparkR_3.5.0-SNAPSHOT.tar.gz
$ R
> library(SparkR)
> install.spark()
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.5.0 for Hadoop 3 from:
- https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
trying URL 'https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz'
simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196
```

Closes apache#40031 from dongjoon-hyun/SPARK-42445.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
dongjoon-hyun committed Feb 15, 2023
1 parent 9843c7c commit b47d29a
Showing 1 changed file with 7 additions and 8 deletions.
15 changes: 7 additions & 8 deletions R/pkg/R/install.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,18 @@
#' \code{mirrorUrl} specifies the remote path to a Spark folder. It is followed by a subfolder
#' named after the Spark version (that corresponds to SparkR), and then the tar filename.
#' The filename is composed of four parts, i.e. [Spark version]-bin-[Hadoop version].tgz.
#' For example, the full path for a Spark 2.0.0 package for Hadoop 2.7 from
#' \code{http://apache.osuosl.org} has path:
#' \code{http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz}.
#' For example, the full path for a Spark 3.3.1 package from
#' \code{https://archive.apache.org} has path:
#' \code{http://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz}.
#' For \code{hadoopVersion = "without"}, [Hadoop version] in the filename is then
#' \code{without-hadoop}.
#'
#' @param hadoopVersion Version of Hadoop to install. Default is \code{"2.7"}. It can take other
#' version number in the format of "x.y" where x and y are integer.
#' @param hadoopVersion Version of Hadoop to install. Default is \code{"3"}.
#' If \code{hadoopVersion = "without"}, "Hadoop free" build is installed.
#' See
#' \href{https://spark.apache.org/docs/latest/hadoop-provided.html}{
#' "Hadoop Free" Build} for more information.
#' Other patched version names can also be used, e.g. \code{"cdh4"}
#' Other patched version names can also be used.
#' @param mirrorUrl base URL of the repositories to use. The directory layout should follow
#' \href{https://www.apache.org/dyn/closer.lua/spark/}{Apache mirrors}.
#' @param localDir a local directory where Spark is installed. The directory contains
Expand All @@ -65,7 +64,7 @@
#' @note install.spark since 2.1.0
#' @seealso See available Hadoop versions:
#' \href{https://spark.apache.org/downloads.html}{Apache Spark}
install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL,
install.spark <- function(hadoopVersion = "3", mirrorUrl = NULL,
localDir = NULL, overwrite = FALSE) {
sparkHome <- Sys.getenv("SPARK_HOME")
if (isSparkRShell()) {
Expand Down Expand Up @@ -251,7 +250,7 @@ defaultMirrorUrl <- function() {
hadoopVersionName <- function(hadoopVersion) {
if (hadoopVersion == "without") {
"without-hadoop"
} else if (grepl("^[0-9]+\\.[0-9]+$", hadoopVersion, perl = TRUE)) {
} else if (grepl("^[0-9]+$", hadoopVersion, perl = TRUE)) {
paste0("hadoop", hadoopVersion)
} else {
hadoopVersion
Expand Down

0 comments on commit b47d29a

Please sign in to comment.