Skip to content

Commit

Permalink
[SPARK-33605][BUILD] Add gcs-connector to hadoop-cloud module
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR aims to add `gcs-connector` shaded jar to `hadoop-cloud` module.

### Why are the changes needed?

To support Google Cloud Storage more easily.

### Does this PR introduce _any_ user-facing change?

Only one shaded jar file is added when the distribution is built with `-Phadoop-cloud`.
```
$ ls -alh gcs*
-rw-r--r-- 1 dongjoon  staff    32M Aug 31 11:14 gcs-connector-hadoop3-2.2.7-shaded.jar
```

### How was this patch tested?

**BUILD**
```
$ dev/make-distribution.sh -Phadoop-cloud
```

**RUN**
```
$ export KEYFILE=YOUR-credentials.json
$ export EMAIL=$(jq -r '.client_email' < $KEYFILE)
$ export PRIVATE_KEY_ID=$(jq -r '.private_key_id' < $KEYFILE)
$ export PRIVATE_KEY="$(jq -r '.private_key' < $KEYFILE)"
$ bin/spark-shell \
-c spark.hadoop.fs.gs.auth.service.account.email=$EMAIL \
-c spark.hadoop.fs.gs.auth.service.account.private.key.id=$PRIVATE_KEY_ID \
-c spark.hadoop.fs.gs.auth.service.account.private.key="$PRIVATE_KEY"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/31 11:56:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1661972165062).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0-SNAPSHOT
      /_/

Using Scala version 2.12.16 (OpenJDK 64-Bit Server VM, Java 17.0.4)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.text("gs://apache-spark-bucket/README.md").count()
res0: Long = 124

scala> spark.read.orc("examples/src/main/resources/users.orc").write.orc("gs://apache-spark-bucket/users.orc")

scala> spark.read.orc("gs://apache-spark-bucket/users.orc").show()
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+
```

Closes apache#37745 from dongjoon-hyun/SPARK-33605.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
dongjoon-hyun committed Sep 7, 2022
1 parent 4e3e627 commit 871152b
Show file tree
Hide file tree
Showing 5 changed files with 17 additions and 0 deletions.
1 change: 1 addition & 0 deletions LICENSE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,7 @@ org.datanucleus:javax.jdo
com.tdunning:json
org.apache.velocity:velocity
org.apache.yetus:audience-annotations
com.google.cloud.bigdataoss:gcs-connector

core/src/main/java/org/apache/spark/util/collection/TimSort.java
core/src/main/resources/org/apache/spark/ui/static/bootstrap*
Expand Down
1 change: 1 addition & 0 deletions dev/deps/spark-deps-hadoop-2-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ datanucleus-rdbms/4.1.19//datanucleus-rdbms-4.1.19.jar
derby/10.14.2.0//derby-10.14.2.0.jar
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
gcs-connector/hadoop2-2.2.7/shaded/gcs-connector-hadoop2-2.2.7-shaded.jar
generex/1.0.2//generex-1.0.2.jar
gmetric4j/1.0.10//gmetric4j-1.0.10.jar
gson/2.2.4//gson-2.2.4.jar
Expand Down
1 change: 1 addition & 0 deletions dev/deps/spark-deps-hadoop-3-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ datanucleus-rdbms/4.1.19//datanucleus-rdbms-4.1.19.jar
derby/10.14.2.0//derby-10.14.2.0.jar
dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
flatbuffers-java/1.12.0//flatbuffers-java-1.12.0.jar
gcs-connector/hadoop3-2.2.7/shaded/gcs-connector-hadoop3-2.2.7-shaded.jar
generex/1.0.2//generex-1.0.2.jar
gmetric4j/1.0.10//gmetric4j-1.0.10.jar
gson/2.2.4//gson-2.2.4.jar
Expand Down
12 changes: 12 additions & 0 deletions hadoop-cloud/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,18 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>${gcs-connector.version}</version>
<classifier>shaded</classifier>
<exclusions>
<exclusion>
<groupId>*</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>

<!--
Add joda time to ensure that anything downstream which doesn't pull in spark-hive
Expand Down
2 changes: 2 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@
<aws.java.sdk.version>1.11.655</aws.java.sdk.version>
<!-- the producer is used in tests -->
<aws.kinesis.producer.version>0.12.8</aws.kinesis.producer.version>
<gcs-connector.version>hadoop3-2.2.7</gcs-connector.version>
<!-- org.apache.httpcomponents/httpclient-->
<commons.httpclient.version>4.5.13</commons.httpclient.version>
<commons.httpcore.version>4.4.14</commons.httpcore.version>
Expand Down Expand Up @@ -3503,6 +3504,7 @@
<hadoop-client-api.artifact>hadoop-client</hadoop-client-api.artifact>
<hadoop-client-runtime.artifact>hadoop-yarn-api</hadoop-client-runtime.artifact>
<hadoop-client-minicluster.artifact>hadoop-client</hadoop-client-minicluster.artifact>
<gcs-connector.version>hadoop2-2.2.7</gcs-connector.version>
<!-- SPARK-36547: Please don't upgrade the version below, otherwise there will be an error on building Hadoop 2.7 package -->
<scala-maven-plugin.version>4.3.0</scala-maven-plugin.version>
</properties>
Expand Down

0 comments on commit 871152b

Please sign in to comment.