Skip to content

Commit

Permalink
[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource mult…
Browse files Browse the repository at this point in the history
…iline mode

## What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode.

## How was this patch tested?

Unit test with a file with crlf line endings.

Closes apache#22503 from justinuang/fix-clrf-multiline.

Authored-by: Justin Uang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
  • Loading branch information
Justin Uang authored and HyukjinKwon committed Oct 19, 2018
1 parent d0ecff2 commit 1e6c1d8
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,8 @@ class CSVOptions(
settings.setEmptyValue(emptyValueInRead)
settings.setMaxCharsPerColumn(maxCharsPerColumn)
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
settings.setLineSeparatorDetectionEnabled(multiLine == true)

settings
}
}
7 changes: 7 additions & 0 deletions sql/core/src/test/resources/test-data/cars-crlf.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@

year,make,model,comment,blank
"2012","Tesla","S","No comment",

1997,Ford,E350,"Go get one now they are going fast",
2015,Chevy,Volt

Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
private val carsNullFile = "test-data/cars-null.csv"
private val carsEmptyValueFile = "test-data/cars-empty-value.csv"
private val carsBlankColName = "test-data/cars-blank-column-name.csv"
private val carsCrlf = "test-data/cars-crlf.csv"
private val emptyFile = "test-data/empty.csv"
private val commentsFile = "test-data/comments.csv"
private val disableCommentsFile = "test-data/disable_comments.csv"
Expand Down Expand Up @@ -220,6 +221,17 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te
}
}

test("crlf line separators in multiline mode") {
val cars = spark
.read
.format("csv")
.option("multiLine", "true")
.option("header", "true")
.load(testFile(carsCrlf))

verifyCars(cars, withHeader = true)
}

test("test aliases sep and encoding for delimiter and charset") {
// scalastyle:off
val cars = spark
Expand Down

0 comments on commit 1e6c1d8

Please sign in to comment.