Skip to content

Commit

Permalink
[SPARK-28471][SQL] Replace yyyy by uuuu in date-timestamp pattern…
Browse files Browse the repository at this point in the history
…s without era

## What changes were proposed in this pull request?

In the PR, I propose to use `uuuu` for years instead of `yyyy` in date/timestamp patterns without the era pattern `G` (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html). **Parsing/formatting of positive years (current era) will be the same.** The difference is in formatting negative years belong to previous era - BC (Before Christ).

I replaced the `yyyy` pattern by `uuuu` everywhere except:
1. Test, Suite & Benchmark. Existing tests must work as is.
2. `SimpleDateFormat` because it doesn't support the `uuuu` pattern.
3. Comments and examples (except comments related to already replaced patterns).

Before the changes, the year of common era `100` and the year of BC era `-99`, showed similarly as `100`.  After the changes negative years will be formatted with the `-` sign.

Before:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+----------+
|     value|
+----------+
|0100-01-01|
+----------+
```

After:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+-----------+
|      value|
+-----------+
|-0099-01-01|
+-----------+
```

## How was this patch tested?

By existing test suites, and added tests for negative years to `DateFormatterSuite` and `TimestampFormatterSuite`.

Closes apache#25230 from MaxGekk/year-pattern-uuuu.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
MaxGekk authored and dongjoon-hyun committed Jul 29, 2019
1 parent a428f40 commit a5a5da7
Show file tree
Hide file tree
Showing 19 changed files with 67 additions and 54 deletions.
4 changes: 2 additions & 2 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -2741,7 +2741,7 @@ setMethod("format_string", signature(format = "character", x = "Column"),
#' head(tmp)}
#' @note from_unixtime since 1.5.0
setMethod("from_unixtime", signature(x = "Column"),
function(x, format = "yyyy-MM-dd HH:mm:ss") {
function(x, format = "uuuu-MM-dd HH:mm:ss") {
jc <- callJStatic("org.apache.spark.sql.functions",
"from_unixtime",
x@jc, format)
Expand Down Expand Up @@ -3029,7 +3029,7 @@ setMethod("unix_timestamp", signature(x = "Column", format = "missing"),
#' @aliases unix_timestamp,Column,character-method
#' @note unix_timestamp(Column, character) since 1.5.0
setMethod("unix_timestamp", signature(x = "Column", format = "character"),
function(x, format = "yyyy-MM-dd HH:mm:ss") {
function(x, format = "uuuu-MM-dd HH:mm:ss") {
jc <- callJStatic("org.apache.spark.sql.functions", "unix_timestamp", x@jc, format)
column(jc)
})
Expand Down
6 changes: 3 additions & 3 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1247,7 +1247,7 @@ def last_day(date):

@ignore_unicode_prefix
@since(1.5)
def from_unixtime(timestamp, format="yyyy-MM-dd HH:mm:ss"):
def from_unixtime(timestamp, format="uuuu-MM-dd HH:mm:ss"):
"""
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string
representing the timestamp of that moment in the current system time zone in the given
Expand All @@ -1264,9 +1264,9 @@ def from_unixtime(timestamp, format="yyyy-MM-dd HH:mm:ss"):


@since(1.5)
def unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss'):
def unix_timestamp(timestamp=None, format='uuuu-MM-dd HH:mm:ss'):
"""
Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default)
Convert time string with given pattern ('uuuu-MM-dd HH:mm:ss', by default)
to Unix time stamp (in seconds), using the default timezone and the default
locale, return null if fail.
Expand Down
16 changes: 8 additions & 8 deletions python/pyspark/sql/readwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,12 +222,12 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
:param dateFormat: sets the string that indicates a date format. Custom date formats
follow the formats at ``java.time.format.DateTimeFormatter``. This
applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
default value, ``uuuu-MM-dd``.
:param timestampFormat: sets the string that indicates a timestamp format.
Custom date formats follow the formats at
``java.time.format.DateTimeFormatter``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
:param multiLine: parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, ``false``.
:param allowUnquotedControlChars: allows JSON Strings to contain unquoted control
Expand Down Expand Up @@ -404,12 +404,12 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
:param dateFormat: sets the string that indicates a date format. Custom date formats
follow the formats at ``java.time.format.DateTimeFormatter``. This
applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
default value, ``uuuu-MM-dd``.
:param timestampFormat: sets the string that indicates a timestamp format.
Custom date formats follow the formats at
``java.time.format.DateTimeFormatter``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
:param maxColumns: defines a hard limit of how many columns a record can have. If None is
set, it uses the default value, ``20480``.
:param maxCharsPerColumn: defines the maximum number of characters allowed for any given
Expand Down Expand Up @@ -806,12 +806,12 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm
:param dateFormat: sets the string that indicates a date format. Custom date formats
follow the formats at ``java.time.format.DateTimeFormatter``. This
applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
default value, ``uuuu-MM-dd``.
:param timestampFormat: sets the string that indicates a timestamp format.
Custom date formats follow the formats at
``java.time.format.DateTimeFormatter``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
:param encoding: specifies encoding (charset) of saved json files. If None is set,
the default UTF-8 charset will be used.
:param lineSep: defines the line separator that should be used for writing. If None is
Expand Down Expand Up @@ -909,12 +909,12 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No
:param dateFormat: sets the string that indicates a date format. Custom date formats
follow the formats at ``java.time.format.DateTimeFormatter``. This
applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
default value, ``uuuu-MM-dd``.
:param timestampFormat: sets the string that indicates a timestamp format.
Custom date formats follow the formats at
``java.time.format.DateTimeFormatter``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
:param ignoreLeadingWhiteSpace: a flag indicating whether or not leading whitespaces from
values being written should be skipped. If None is set, it
uses the default value, ``true``.
Expand Down
8 changes: 4 additions & 4 deletions python/pyspark/sql/streaming.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,12 +464,12 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
:param dateFormat: sets the string that indicates a date format. Custom date formats
follow the formats at ``java.time.format.DateTimeFormatter``. This
applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
default value, ``uuuu-MM-dd``.
:param timestampFormat: sets the string that indicates a timestamp format.
Custom date formats follow the formats at
``java.time.format.DateTimeFormatter``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
:param multiLine: parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, ``false``.
:param allowUnquotedControlChars: allows JSON Strings to contain unquoted control
Expand Down Expand Up @@ -640,12 +640,12 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
:param dateFormat: sets the string that indicates a date format. Custom date formats
follow the formats at ``java.time.format.DateTimeFormatter``. This
applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
default value, ``uuuu-MM-dd``.
:param timestampFormat: sets the string that indicates a timestamp format.
Custom date formats follow the formats at
``java.time.format.DateTimeFormatter``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
:param maxColumns: defines a hard limit of how many columns a record can have. If None is
set, it uses the default value, ``20480``.
:param maxCharsPerColumn: defines the maximum number of characters allowed for any given
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -478,7 +478,7 @@ object CatalogColumnStat extends Logging {
val VERSION = 2

private def getTimestampFormatter(): TimestampFormatter = {
TimestampFormatter(format = "yyyy-MM-dd HH:mm:ss.SSSSSS", zoneId = ZoneOffset.UTC)
TimestampFormatter(format = "uuuu-MM-dd HH:mm:ss.SSSSSS", zoneId = ZoneOffset.UTC)
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,10 +146,10 @@ class CSVOptions(
// A language tag in IETF BCP 47 format
val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)

val dateFormat: String = parameters.getOrElse("dateFormat", "yyyy-MM-dd")
val dateFormat: String = parameters.getOrElse("dateFormat", "uuuu-MM-dd")

val timestampFormat: String =
parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
parameters.getOrElse("timestampFormat", "uuuu-MM-dd'T'HH:mm:ss.SSSXXX")

val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -579,7 +579,7 @@ case class ToUnixTimestamp(
copy(timeZoneId = Option(timeZoneId))

def this(time: Expression) = {
this(time, Literal("yyyy-MM-dd HH:mm:ss"))
this(time, Literal("uuuu-MM-dd HH:mm:ss"))
}

override def prettyName: String = "to_unix_timestamp"
Expand Down Expand Up @@ -616,7 +616,7 @@ case class UnixTimestamp(timeExp: Expression, format: Expression, timeZoneId: Op
copy(timeZoneId = Option(timeZoneId))

def this(time: Expression) = {
this(time, Literal("yyyy-MM-dd HH:mm:ss"))
this(time, Literal("uuuu-MM-dd HH:mm:ss"))
}

def this() = {
Expand Down Expand Up @@ -786,7 +786,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[
override def prettyName: String = "from_unixtime"

def this(unix: Expression) = {
this(unix, Literal("yyyy-MM-dd HH:mm:ss"))
this(unix, Literal("uuuu-MM-dd HH:mm:ss"))
}

override def dataType: DataType = StringType
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,10 @@ private[sql] class JSONOptions(
val zoneId: ZoneId = DateTimeUtils.getZoneId(
parameters.getOrElse(DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId))

val dateFormat: String = parameters.getOrElse("dateFormat", "yyyy-MM-dd")
val dateFormat: String = parameters.getOrElse("dateFormat", "uuuu-MM-dd")

val timestampFormat: String =
parameters.getOrElse("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
parameters.getOrElse("timestampFormat", "uuuu-MM-dd'T'HH:mm:ss.SSSXXX")

val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class Iso8601DateFormatter(
}

object DateFormatter {
val defaultPattern: String = "yyyy-MM-dd"
val defaultPattern: String = "uuuu-MM-dd"
val defaultLocale: Locale = Locale.US

def apply(format: String, locale: Locale): DateFormatter = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ class FractionTimestampFormatter(zoneId: ZoneId)
}

object TimestampFormatter {
val defaultPattern: String = "yyyy-MM-dd HH:mm:ss"
val defaultPattern: String = "uuuu-MM-dd HH:mm:ss"
val defaultLocale: Locale = Locale.US

def apply(format: String, zoneId: ZoneId, locale: Locale): TimestampFormatter = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,9 @@ class DateFormatterSuite extends SparkFunSuite with SQLHelper {
val daysSinceEpoch = formatter.parse("2018 Dec")
assert(daysSinceEpoch === LocalDate.of(2018, 12, 1).toEpochDay)
}

test("formatting negative years with default pattern") {
val epochDays = LocalDate.of(-99, 1, 1).toEpochDay.toInt
assert(DateFormatter().format(epochDays) === "-0099-01-01")
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -123,4 +123,12 @@ class TimestampFormatterSuite extends SparkFunSuite with SQLHelper {
assert(formatter.format(900000) === "1970-01-01 00:00:00.9")
assert(formatter.format(1000000) === "1970-01-01 00:00:01")
}

test("formatting negative years with default pattern") {
val instant = LocalDateTime.of(-99, 1, 1, 0, 0, 0)
.atZone(ZoneOffset.UTC)
.toInstant
val micros = DateTimeUtils.instantToMicros(instant)
assert(TimestampFormatter(ZoneOffset.UTC).format(micros) === "-0099-01-01 00:00:00")
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -395,10 +395,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* <li>`dateFormat` (default `uuuu-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at `java.time.format.DateTimeFormatter`.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* <li>`timestampFormat` (default `uuuu-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.time.format.DateTimeFormatter`. This applies to timestamp type.</li>
* <li>`multiLine` (default `false`): parse one record, which may span multiple lines,
Expand Down Expand Up @@ -615,10 +615,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* value.</li>
* <li>`negativeInf` (default `-Inf`): sets the string representation of a negative infinity
* value.</li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* <li>`dateFormat` (default `uuuu-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at `java.time.format.DateTimeFormatter`.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* <li>`timestampFormat` (default `uuuu-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.time.format.DateTimeFormatter`. This applies to timestamp type.</li>
* <li>`maxColumns` (default `20480`): defines a hard limit of how many columns
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -568,10 +568,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* <li>`dateFormat` (default `uuuu-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at `java.time.format.DateTimeFormatter`.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* <li>`timestampFormat` (default `uuuu-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.time.format.DateTimeFormatter`. This applies to timestamp type.</li>
* <li>`encoding` (by default it is not set): specifies encoding (charset) of saved json
Expand Down Expand Up @@ -687,10 +687,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* <li>`dateFormat` (default `uuuu-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at `java.time.format.DateTimeFormatter`.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* <li>`timestampFormat` (default `uuuu-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.time.format.DateTimeFormatter`. This applies to timestamp type.</li>
* <li>`ignoreLeadingWhiteSpace` (default `true`): a flag indicating whether or not leading
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ object PartitionSpec {

object PartitioningUtils {

val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.S]"
val timestampPartitionPattern = "uuuu-MM-dd HH:mm:ss[.S]"

private[datasources] case class PartitionValues(columnNames: Seq[String], literals: Seq[Literal])
{
Expand Down
Loading

0 comments on commit a5a5da7

Please sign in to comment.