Add R example and fix python examples.

In this PR, R examples are added and the Python examples are fixed. Author: hyukjinkwon <[email protected]> Closes databricks#36 from HyukjinKwon/r-example.
luckyfriends · Dec 23, 2015 · 8827df2 · 8827df2
1 parent b4bcfc6
commit 8827df2
Showing 1 changed file with 46 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -288,7 +288,7 @@ customSchema = StructType([ \
     StructField("id", StringType(), True), \
     StructField("price", DoubleType(), True), \
     StructField("publish_date", StringType(), True), \
-    StructField("title", StringType(), True])) \
+    StructField("title", StringType(), True]))
 
 df = sqlContext.read \
     .format('com.databricks.spark.xml') \
@@ -319,16 +319,56 @@ from pyspark.sql.types import *
 
 sqlContext = SQLContext(sc)
 customSchema = StructType([ \
-    StructField("year", IntegerType(), True), \
-    StructField("make", StringType(), True), \
-    StructField("model", StringType(), True), \
-    StructField("orgment", StringType(), True), \
-    StructField("blank", StringType(), True)])
+    StructField("author", StringType(), True), \
+    StructField("description", StringType(), True), \
+    StructField("genre", StringType(), True), \
+    StructField("id", StringType(), True), \
+    StructField("price", DoubleType(), True), \
+    StructField("publish_date", StringType(), True), \
+    StructField("title", StringType(), True]))
 
 df = sqlContext.load(source="com.databricks.spark.xml", rowTag = 'book', schema = customSchema, path = 'books.xml')
 df.select("author", "id").save('newbooks.xml', rootTag = 'books', rowTag = 'book', path = 'newbooks.xml')
 ```
 
+
+### R API
+__Spark 1.4+:__
+
+Automatically infer schema (data types)
+```R
+library(SparkR)
+
+Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-xml_2.10:1.3.0" "sparkr-shell"')
+sqlContext <- sparkRSQL.init(sc)
+
+df <- read.df(sqlContext, "books.xml", source = "com.databricks.spark.xml", rowTag = "book")
+
+# In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
+write.df(df, "newbooks.csv", "com.databricks.spark.xml", "overwrite")
+```
+
+You can manually specify schema:
+```R
+library(SparkR)
+
+Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"')
+sqlContext <- sparkRSQL.init(sc)
+customSchema <- structType(
+    structField("author", "string"),
+    structField("description", "string"),
+    structField("genre", "string"),
+    structField("id", "string"),
+    structField("price", "double"),
+    structField("publish_date", "string"),
+    structField("title", "string"))
+
+df <- read.df(sqlContext, "books.xml", source = "com.databricks.spark.xml", rowTag = "book")
+
+# In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
+write.df(df, "newbooks.csv", "com.databricks.spark.xml", "overwrite")
+```
+
 ## Building From Source
 This library is built with [SBT](http://www.scala-sbt.org/0.13/docs/Command-Line-Reference.html), which is automatically downloaded by the included shell script. To build a JAR file simply run `sbt/sbt package` from the project root. The build configuration includes support for both Scala 2.10 and 2.11.