Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-42445][R] Fix SparkR
install.spark
function
### What changes were proposed in this pull request? This PR fixes `SparkR` `install.spark` method. ``` $ curl -LO https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/SparkR_3.3.2.tar.gz $ R CMD INSTALL SparkR_3.3.2.tar.gz $ R R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: aarch64-apple-darwin20 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from ‘package:base’: as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 2.7 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 > install.spark(hadoopVersion="3") Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Note that this is a regression at Spark 3.3.0 and not a blocker for on-going Spark 3.3.2 RC vote. ### Why are the changes needed? https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#ref-usage ![Screenshot 2023-02-14 at 10 07 49 PM](https://user-images.githubusercontent.com/9700541/218946460-ab7eab1b-65ae-4cb2-bc7c-5810ad359ac9.png) First, the existing Spark 2.0.0 link is broken. - https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#details - http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz (Broken) Second, Spark 3.3.0 changed the Hadoop postfix pattern from the distribution files so that the function raises errors as described before. - http://archive.apache.org/dist/spark/spark-3.2.3/spark-3.2.3-bin-hadoop2.7.tgz (Old Pattern) - http://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz (New Pattern) ### Does this PR introduce _any_ user-facing change? No, this fixes a bug like Spark 3.2.3 and older versions. ### How was this patch tested? Pass the CI and manual testing. Please note that the link pattern is correct although it fails because 3.5.0 is not published yet. ``` $ NO_MANUAL=1 ./dev/make-distribution.sh --r $ R CMD INSTALL R/SparkR_3.5.0-SNAPSHOT.tar.gz $ R > library(SparkR) > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.5.0 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Closes apache#40031 from dongjoon-hyun/SPARK-42445. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
- Loading branch information