Spark application template for running on Cloud
- Import Spark library at
build.sbt
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.3.0",
"org.apache.spark" % "spark-sql_2.11" % "2.3.0"
)
- Spark only work with Scala version below 2.12 and Java 9 according to its documentation
- Compile and create scala program using sbt
sbt compile
sbt package
- Upload datasets on Cloud Storage - example
- Running Spark application on Google Cloud Dataproc. Tutorial can be found here
- Save the output to a Parquet to Google Cloud Storage
- Import to Google BigQuery and further process it