Skip to content

Commit

Permalink
Update 30-file-format-options.md
Browse files Browse the repository at this point in the history
  • Loading branch information
soyeric128 committed Dec 20, 2022
1 parent 1dc1ca5 commit e09cb1f
Showing 1 changed file with 127 additions and 14 deletions.
141 changes: 127 additions & 14 deletions docs/doc/11-integrations/30-file-format-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,17 @@
title: Input File Formats
---

Databend accepts a variety of file formats as a source where you can load or query data from using:
- [COPY INTO command](../14-sql-commands/10-dml/dml-copy-into-table.md)
- [Streaming Load API](../11-integrations/00-api/03-streaming-load.md)

When you select a file to load or query data from, you need to tell Databend what the file looks like in the following format:
Databend accepts a variety of file formats as a source where you can load data from with the [COPY INTO command](../14-sql-commands/10-dml/dml-copy-into-table.md) or [Streaming Load API](../11-integrations/00-api/03-streaming-load.md). When you select a file to do that, you need to tell Databend what the file looks like using the following format:

```sql
FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | XML} [ formatTypeOptions ] )
```

`Type`: Specifies the file format. Must be one of the following formats that Databend supports:
- CSV
- TSV
- NDJSON
- PARQUET
- XML
`Type`: Specifies the file format. Must be one of the ones listed above that Databend supports.

`formatTypeOptions`: Describes other format details about the file. The options may vary depending on the file format. See the sections below for the available options of each supported file format.
`formatTypeOptions`: Includes one or more options to describe other format details about the file. The options vary depending on the file format. See the sections below to find out the available options for each supported file format.

```
```sql
formatTypeOptions ::=
RECORD_DELIMITER = '<character>'
FIELD_DELIMITER = '<character>'
Expand All @@ -35,10 +26,132 @@ formatTypeOptions ::=

## CSV Options

Databend accepts CVS files that are compliant with [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180) and is subject to the following conditions:

- A string must be quoted if it contains the character of a [QUOTE](#quote), [ESCAPE](#escape), [RECORD_DELIMITER](#record_delimiter), or [FIELD_DELIMITER](#field_delimiter).
- No character will be escaped in a quoted string except `Quote`.
- No space should be left between a `FIELD_DELIMITER` and a `Quote`.
- A record should not contain a trailing `FIELD_DELIMITER`.
- A string will be quoted in CSV if it comes from a serialized Array or Struct field.
- If you develop a program and generate the CSV files from it, Databend recommends using the CSV library from the programing language.
- Databend does not recognize the files unloaded from MySQL as the CSV format unless the following conditions are satisfied:
- `ESCAPED BY` is empty.
- `ENCLOSED BY` is not empty.
:::note
Files will be recognized as the TSV format if the conditions above are not satisfied. For more information about the clauses `ESCAPED BY` and `ENCLOSED BY`, refer to https://dev.mysql.com/doc/refman/8.0/en/load-data.html.
:::

### RECORD_DELIMITER

Separates records in an input file.

**Available Values**: `\r`,`\n`, or use a character with the escape char: `\b`, `\f`, `\r`, `\n`, `\t`, `\0`, `\xHH`

**Default**: `\n`

### FIELD_DELIMITER

Separates fields in an input file.

**Available Values**: Use a character with the escape char: `\b`, `\f`, `\r`, `\n`, `\t`, `\0`, `\xHH`

**Default**: `,` (comma)

### QUOTE

Quotes strings in a CSV file. For data loading, the quote is not necessary unless a string contains the character of a [QUOTE](#quote), [ESCAPE](#escape), [RECORD_DELIMITER](#record_delimiter), or [FIELD_DELIMITER](#field_delimiter).

**Available Values**: `\'` or `\"`.

**Default**: `\"`

### ESCAPE

Escapes a quote in a quoted string.

**Available Values**: `\'` or `\"` or `\\`.

**Default**: `\"`

### SKIP_HEADER

Used for data loading only to specify how many lines to be skipped from the beginning of the file.

**Default**: `0`

### NAN_DISPLAY

**Available Values**: Must be literal `'nan'` or `'null'` (case-insensitive)

**Default**: `'NaN'`

### COMPRESSION

Specifies the compression algorithm.

**Default**: `NONE`

**Available Values**:

| Values | Notes |
| ------------- | --------------------------------------------------------------- |
| `AUTO` | Auto detect compression via file extensions |
| `GZIP` | |
| `BZ2` | |
| `BROTLI` | Must be specified if loading/unloading Brotli-compressed files. |
| `ZSTD` | Zstandard v0.8 (and higher) is supported. |
| `DEFLATE` | Deflate-compressed files (with zlib header, RFC1950). |
| `RAW_DEFLATE` | Deflate-compressed files (without any header, RFC1951). |
| `XZ` | |
| `NONE` | Indicates that the files have not been compressed. |

## TSV Options

Databend is subject to the following conditions when dealing with a TSV file:

- These characters in a TSV file will be escaped: `\b`, `\f`, `\r`, `\n`, `\t`, `\0`, `\\`, `\'`, [RECORD_DELIMITER](#record_delimiter-1), [FIELD_DELIMITER](#field_delimiter-1).
- Neither quoting nor enclosing is currently supported.
- A string will be quoted in CSV if it comes from a serialized Array or Struct field.
- Null is serialized as `\N`.

### RECORD_DELIMITER

Separates records in an input file.

**Available Values**: `\r`,`\n`, or use a character with the escape char: `\b`, `\f`, `\r`, `\n`, `\t`, `\0`, `\xHH`

**Default**: `\n`

### FIELD_DELIMITER

Separates fields in an input file.

**Available Values**: Use a character with the escape char: `\b`, `\f`, `\r`, `\n`, `\t`, `\0`, `\xHH`

**Default**: `\t` (TAB)

### COMPRESSION

Same as [the COMPRESSION option for CSV](#compression).

## NDJSON Options

### COMPRESSION

Same as [the COMPRESSION option for CSV](#compression).

## PARQUET Options

## XML Options
No available options.

## XML Options

### COMPRESSION

Same as [the COMPRESSION option for CSV](#compression).

### ROW_TAG

Used to select XML elements to be decoded as a record.

**Default**: `'row'`

0 comments on commit e09cb1f

Please sign in to comment.