Skip to content

Commit

Permalink
Add R example and fix python examples.
Browse files Browse the repository at this point in the history
In this PR, R examples are added and the Python examples are fixed.

Author: hyukjinkwon <[email protected]>

Closes databricks#36 from HyukjinKwon/r-example.
  • Loading branch information
HyukjinKwon committed Dec 23, 2015
1 parent b4bcfc6 commit 8827df2
Showing 1 changed file with 46 additions and 6 deletions.
52 changes: 46 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ customSchema = StructType([ \
StructField("id", StringType(), True), \
StructField("price", DoubleType(), True), \
StructField("publish_date", StringType(), True), \
StructField("title", StringType(), True])) \
StructField("title", StringType(), True]))

df = sqlContext.read \
.format('com.databricks.spark.xml') \
Expand Down Expand Up @@ -319,16 +319,56 @@ from pyspark.sql.types import *

sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("year", IntegerType(), True), \
StructField("make", StringType(), True), \
StructField("model", StringType(), True), \
StructField("orgment", StringType(), True), \
StructField("blank", StringType(), True)])
StructField("author", StringType(), True), \
StructField("description", StringType(), True), \
StructField("genre", StringType(), True), \
StructField("id", StringType(), True), \
StructField("price", DoubleType(), True), \
StructField("publish_date", StringType(), True), \
StructField("title", StringType(), True]))

df = sqlContext.load(source="com.databricks.spark.xml", rowTag = 'book', schema = customSchema, path = 'books.xml')
df.select("author", "id").save('newbooks.xml', rootTag = 'books', rowTag = 'book', path = 'newbooks.xml')
```


### R API
__Spark 1.4+:__

Automatically infer schema (data types)
```R
library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-xml_2.10:1.3.0" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "books.xml", source = "com.databricks.spark.xml", rowTag = "book")

# In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
write.df(df, "newbooks.csv", "com.databricks.spark.xml", "overwrite")
```

You can manually specify schema:
```R
library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)
customSchema <- structType(
structField("author", "string"),
structField("description", "string"),
structField("genre", "string"),
structField("id", "string"),
structField("price", "double"),
structField("publish_date", "string"),
structField("title", "string"))

df <- read.df(sqlContext, "books.xml", source = "com.databricks.spark.xml", rowTag = "book")

# In this case, `rootTag` is set to "ROWS" and `rowTag` is set to "ROW".
write.df(df, "newbooks.csv", "com.databricks.spark.xml", "overwrite")
```

## Building From Source
This library is built with [SBT](http://www.scala-sbt.org/0.13/docs/Command-Line-Reference.html), which is automatically downloaded by the included shell script. To build a JAR file simply run `sbt/sbt package` from the project root. The build configuration includes support for both Scala 2.10 and 2.11.

Expand Down

0 comments on commit 8827df2

Please sign in to comment.