Skip to content

Commit

Permalink
docs: Add simplest possible Spark SQL example
Browse files Browse the repository at this point in the history
Often I look for a simple "hello world" example in the Kudu Spark docs
and I remember that there isn't one. I've added a quick-and-dirty Spark
SQL example.

Change-Id: I2cf4c00f3a1dc92fd93458aa3c1b1d2cd4f38f78
Reviewed-on: http://gerrit.cloudera.org:8080/11095
Tested-by: Kudu Jenkins
Reviewed-by: Grant Henke <[email protected]>
Reviewed-by: Alexey Serbin <[email protected]>
  • Loading branch information
mpercy authored and granthenke committed Jun 12, 2019
1 parent 53ea7c6 commit d25c6f1
Showing 1 changed file with 28 additions and 9 deletions.
37 changes: 28 additions & 9 deletions docs/developing.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,26 @@ on the link:http://kudu.apache.org/releases/[releases page].
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.10.0
----

then import kudu-spark and create a dataframe:
Below is a minimal Spark SQL "select" example for a Kudu table created with
Impala in the "default" database. We first import the kudu spark package,
then create a DataFrame, and then create a view from the DataFrame. After those
steps, the table is accessible from Spark SQL.

[source,scala]
----
import org.apache.kudu.spark.kudu._
// Create a DataFrame that points to the Kudu table we want to query.
val df = spark.read.options(Map("kudu.master" -> "master1.foo.com,master2.foo.com,master3.foo.com",
"kudu.table" -> "default.my_table")).kudu
// Create a view from the DataFrame to make it accessible from Spark SQL.
df.createOrReplaceTempView("my_table")
// Now we can run Spark SQL queries against our view of the Kudu table.
spark.sql("select * from my_table").show()
----

Below is a more sophisticated example that includes both reads and writes:

[source,scala]
----
import org.apache.kudu.client._
Expand All @@ -124,14 +143,14 @@ val df = spark.read
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
df.createOrReplaceTempView("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
// Create a new Kudu table from a DataFrame schema
// NB: No rows from the DataFrame are inserted into the table
kuduContext.createTable(
"test_table", df.schema, Seq("key"),
new CreateTableOptions()
Expand Down Expand Up @@ -170,15 +189,15 @@ kuduContext.deleteTable("unwanted_table")

=== Upsert option in Kudu Spark
The upsert operation in kudu-spark supports an extra write option of `ignoreNull`. If set to true,
it will avoid setting existing column values in Kudu table to Null if the corresponding dataframe
it will avoid setting existing column values in Kudu table to Null if the corresponding DataFrame
column values are Null. If unspecified, `ignoreNull` is false by default.
[source,scala]
----
val dataDF = spark.read
val dataFrame = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> simpleTableName))
.format("kudu").load
dataDF.registerTempTable(simpleTableName)
dataDF.show()
dataFrame.createOrReplaceTempView(simpleTableName)
dataFrame.show()
// Below is the original data in the table 'simpleTableName'
+---+---+
|key|val|
Expand All @@ -191,7 +210,7 @@ val nullDF = spark.createDataFrame(Seq((0, null.asInstanceOf[String]))).toDF("ke
val wo = new KuduWriteOptions
wo.ignoreNull = true
kuduContext.upsertRows(nullDF, simpleTableName, wo)
dataDF.show()
dataFrame.show()
// The val field stays unchanged
+---+---+
|key|val|
Expand Down

0 comments on commit d25c6f1

Please sign in to comment.