Skip to content

Commit

Permalink
[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecat…
Browse files Browse the repository at this point in the history
…ed in Spark 2.2.0 or earlier, for Spark 3

### What changes were proposed in this pull request?

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc

Notes:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.

### Why are the changes needed?

Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.

### Does this PR introduce any user-facing change?

Yes, in that deprecated items are removed from some public APIs.

### How was this patch tested?

Existing tests.

Closes apache#25684 from srowen/SPARK-28980.

Lead-authored-by: Sean Owen <[email protected]>
Co-authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
  • Loading branch information
srowen and HyukjinKwon committed Sep 9, 2019
1 parent 3d6b33a commit 6378d4b
Show file tree
Hide file tree
Showing 64 changed files with 224 additions and 2,656 deletions.
7 changes: 4 additions & 3 deletions R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -266,11 +266,12 @@ sparkR.sparkContext <- function(
#' df <- read.json(path)
#'
#' sparkR.session("local[2]", "SparkR", "/home/spark")
#' sparkR.session("yarn-client", "SparkR", "/home/spark",
#' list(spark.executor.memory="4g"),
#' sparkR.session("yarn", "SparkR", "/home/spark",
#' list(spark.executor.memory="4g", spark.submit.deployMode="client"),
#' c("one.jar", "two.jar", "three.jar"),
#' c("com.databricks:spark-avro_2.12:2.0.1"))
#' sparkR.session(spark.master = "yarn-client", spark.executor.memory = "4g")
#' sparkR.session(spark.master = "yarn", spark.submit.deployMode = "client",
# spark.executor.memory = "4g")
#'}
#' @note sparkR.session since 2.0.0
sparkR.session <- function(
Expand Down
4 changes: 2 additions & 2 deletions R/pkg/tests/fulltests/test_sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ test_that("sparkCheckInstall", {

# "yarn-client, mesos-client" mode, SPARK_HOME was not set
sparkHome <- ""
master <- "yarn-client"
deployMode <- ""
master <- "yarn"
deployMode <- "client"
expect_error(sparkCheckInstall(sparkHome, master, deployMode))
sparkHome <- ""
master <- ""
Expand Down
17 changes: 0 additions & 17 deletions core/src/main/scala/org/apache/spark/SparkConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -548,23 +548,6 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria
}
}

if (contains("spark.master") && get("spark.master").startsWith("yarn-")) {
val warning = s"spark.master ${get("spark.master")} is deprecated in Spark 2.0+, please " +
"instead use \"yarn\" with specified deploy mode."

get("spark.master") match {
case "yarn-cluster" =>
logWarning(warning)
set("spark.master", "yarn")
set(SUBMIT_DEPLOY_MODE, "cluster")
case "yarn-client" =>
logWarning(warning)
set("spark.master", "yarn")
set(SUBMIT_DEPLOY_MODE, "client")
case _ => // Any other unexpected master will be checked when creating scheduler backend.
}
}

if (contains(SUBMIT_DEPLOY_MODE)) {
get(SUBMIT_DEPLOY_MODE) match {
case "cluster" | "client" =>
Expand Down
19 changes: 0 additions & 19 deletions core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
Original file line number Diff line number Diff line change
Expand Up @@ -229,10 +229,6 @@ private[spark] class SparkSubmit extends Logging {
// Set the cluster manager
val clusterManager: Int = args.master match {
case "yarn" => YARN
case "yarn-client" | "yarn-cluster" =>
logWarning(s"Master ${args.master} is deprecated since 2.0." +
" Please use master \"yarn\" with specified deploy mode instead.")
YARN
case m if m.startsWith("spark") => STANDALONE
case m if m.startsWith("mesos") => MESOS
case m if m.startsWith("k8s") => KUBERNETES
Expand All @@ -251,22 +247,7 @@ private[spark] class SparkSubmit extends Logging {
-1
}

// Because the deprecated way of specifying "yarn-cluster" and "yarn-client" encapsulate both
// the master and deploy mode, we have some logic to infer the master and deploy mode
// from each other if only one is specified, or exit early if they are at odds.
if (clusterManager == YARN) {
(args.master, args.deployMode) match {
case ("yarn-cluster", null) =>
deployMode = CLUSTER
args.master = "yarn"
case ("yarn-cluster", "client") =>
error("Client deploy mode is not compatible with master \"yarn-cluster\"")
case ("yarn-client", "cluster") =>
error("Cluster deploy mode is not compatible with master \"yarn-client\"")
case (_, mode) =>
args.master = "yarn"
}

// Make sure YARN is included in our build if we're trying to use it
if (!Utils.classIsLoadable(YARN_CLUSTER_SUBMIT_CLASS) && !Utils.isTesting) {
error(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,15 @@ import org.apache.hadoop.fs.Path
import org.scalatest.Matchers

import org.apache.spark.{SparkConf, SparkFunSuite}
import org.apache.spark.internal.config.STAGING_DIR
import org.apache.spark.internal.config.{STAGING_DIR, SUBMIT_DEPLOY_MODE}

class HadoopFSDelegationTokenProviderSuite extends SparkFunSuite with Matchers {
test("hadoopFSsToAccess should return defaultFS even if not configured") {
val sparkConf = new SparkConf()
val defaultFS = "hdfs://localhost:8020"
val statingDir = "hdfs://localhost:8021"
sparkConf.set("spark.master", "yarn-client")
sparkConf.setMaster("yarn")
sparkConf.set(SUBMIT_DEPLOY_MODE, "client")
sparkConf.set(STAGING_DIR, statingDir)
val hadoopConf = new Configuration()
hadoopConf.set("fs.defaultFS", defaultFS)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,7 @@ class BlacklistTrackerSuite extends SparkFunSuite with BeforeAndAfterEach with M
}

test("check blacklist configuration invariants") {
val conf = new SparkConf().setMaster("yarn-cluster")
val conf = new SparkConf().setMaster("yarn").set(config.SUBMIT_DEPLOY_MODE, "cluster")
Seq(
(2, 2),
(2, 3)
Expand Down
1 change: 0 additions & 1 deletion dev/sparktestsupport/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,6 @@ def __hash__(self):
"pyspark.sql.window",
"pyspark.sql.avro.functions",
# unittests
"pyspark.sql.tests.test_appsubmit",
"pyspark.sql.tests.test_arrow",
"pyspark.sql.tests.test_catalog",
"pyspark.sql.tests.test_column",
Expand Down
28 changes: 0 additions & 28 deletions docs/mllib-evaluation-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -577,31 +577,3 @@ variable from a number of independent variables.
</tr>
</tbody>
</table>

**Examples**

<div class="codetabs">
The following code snippets illustrate how to load a sample dataset, train a linear regression algorithm on the data,
and evaluate the performance of the algorithm by several regression metrics.

<div data-lang="scala" markdown="1">
Refer to the [`RegressionMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RegressionMetrics) for details on the API.

{% include_example scala/org/apache/spark/examples/mllib/RegressionMetricsExample.scala %}

</div>

<div data-lang="java" markdown="1">
Refer to the [`RegressionMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RegressionMetrics.html) for details on the API.

{% include_example java/org/apache/spark/examples/mllib/JavaRegressionMetricsExample.java %}

</div>

<div data-lang="python" markdown="1">
Refer to the [`RegressionMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RegressionMetrics) for more details on the API.

{% include_example python/mllib/regression_metrics_example.py %}

</div>
</div>
14 changes: 0 additions & 14 deletions docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,17 +348,3 @@ Refer to the [`ElementwiseProduct` Python docs](api/python/pyspark.mllib.html#py

A feature transformer that projects vectors to a low-dimensional space using PCA.
Details you can read at [dimensionality reduction](mllib-dimensionality-reduction.html).

### Example

The following code demonstrates how to compute principal components on a `Vector`
and use them to project the vectors into a low-dimensional space while keeping associated labels
for calculation a [Linear Regression](mllib-linear-methods.html)

<div class="codetabs">
<div data-lang="scala" markdown="1">
Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.PCA) for details on the API.

{% include_example scala/org/apache/spark/examples/mllib/PCAExample.scala %}
</div>
</div>
51 changes: 0 additions & 51 deletions docs/mllib-linear-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,57 +360,6 @@ regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) u
regularization. For all of these models, the average loss or training error, $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_error).

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
The following example demonstrates how to load training data, parse it as an RDD of LabeledPoint.
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).

Refer to the [`LinearRegressionWithSGD` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) and [`LinearRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionModel) for details on the API.

{% include_example scala/org/apache/spark/examples/mllib/LinearRegressionWithSGDExample.scala %}

[`RidgeRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
and [`LassoWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD) can be used in a similar fashion as `LinearRegressionWithSGD`.

</div>

<div data-lang="java" markdown="1">
All of MLlib's methods use Java-friendly types, so you can import and call them there the same
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
calling `.rdd()` on your `JavaRDD` object. The corresponding Java example to
the Scala snippet provided, is presented below:

Refer to the [`LinearRegressionWithSGD` Java docs](api/java/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html) and [`LinearRegressionModel` Java docs](api/java/org/apache/spark/mllib/regression/LinearRegressionModel.html) for details on the API.

{% include_example java/org/apache/spark/examples/mllib/JavaLinearRegressionWithSGDExample.java %}
</div>

<div data-lang="python" markdown="1">
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).

Note that the Python API does not yet support model save/load but will in the future.

Refer to the [`LinearRegressionWithSGD` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LinearRegressionWithSGD) and [`LinearRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LinearRegressionModel) for more details on the API.

{% include_example python/mllib/linear_regression_with_sgd_example.py %}
</div>
</div>

In order to run the above application, follow the instructions
provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
section of the Spark
quick-start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.

### Streaming linear regression

When data arrive in a streaming fashion, it is useful to fit regression models online,
Expand Down
5 changes: 5 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ license: |
{:toc}

## Upgrading From Spark SQL 2.4 to 3.0

- In Spark 3.0, the deprecated methods `SQLContext.createExternalTable` and `SparkSession.createExternalTable` have been removed in favor of its replacement, `createTable`.

- In Spark 3.0, the deprecated `HiveContext` class has been removed. Use `SparkSession.builder.enableHiveSupport()` instead.

- Since Spark 3.0, configuration `spark.sql.crossJoin.enabled` become internal configuration, and is true by default, so by default spark won't raise exception on sql with implicit cross join.

- Since Spark 3.0, we reversed argument order of the trim function from `TRIM(trimStr, str)` to `TRIM(str, trimStr)` to be compatible with other databases.
Expand Down
2 changes: 1 addition & 1 deletion docs/streaming-kinesis-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build();

See the [API docs](api/java/index.html?org/apache/spark/streaming/kinesis/KinesisUtils.html)
See the [API docs](api/java/index.html?org/apache/spark/streaming/kinesis/KinesisInputDStream.html)
and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/external/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java). Refer to the [Running the Example](#running-the-example) subsection for instructions to run the example.

</div>
Expand Down
4 changes: 2 additions & 2 deletions docs/streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2488,13 +2488,13 @@ additional effort may be necessary to achieve exactly-once semantics. There are
* [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) and
[DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream)
* [KafkaUtils](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
[KinesisUtils](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisUtils$),
[KinesisUtils](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisInputDStream),
- Java docs
* [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html),
[JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html) and
[JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html)
* [KafkaUtils](api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html),
[KinesisUtils](api/java/index.html?org/apache/spark/streaming/kinesis/KinesisUtils.html)
[KinesisUtils](api/java/index.html?org/apache/spark/streaming/kinesis/KinesisInputDStream.html)
- Python docs
* [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) and [DStream](api/python/pyspark.streaming.html#pyspark.streaming.DStream)
* [KafkaUtils](api/python/pyspark.streaming.html#pyspark.streaming.kafka.KafkaUtils)
Expand Down

This file was deleted.

Loading

0 comments on commit 6378d4b

Please sign in to comment.