forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
metadata.Rmd
82 lines (58 loc) · 3.19 KB
/
metadata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "Metadata"
description: >
Learn how Arrow uses Schemas to document structure of data objects,
and how R metadata are supported in Arrow
output: rmarkdown::html_vignette
---
This article describes the various data and metadata object types supplied by arrow, and documents how these objects are structured.
```{r include=FALSE}
library(arrow, warn.conflicts = FALSE)
```
## Arrow metadata classes
The arrow package defines the following classes for representing metadata:
- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
- A `Field` specifies a character string name and a `DataType`; and
- A `DataType` is an attribute controlling how values are represented
Consider this:
```{r}
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
tb <- arrow_table(df)
tb$schema
```
The schema that has been automatically inferred could also be manually created:
```{r}
schema(
field(name = "x", type = int32()),
field(name = "y", type = utf8())
)
```
The `schema()` function allows the following shorthand to define fields:
```{r}
schema(x = int32(), y = utf8())
```
Sometimes it is important to specify the schema manually, particularly if you want fine-grained control over the Arrow data types:
```{r}
arrow_table(df, schema = schema(x = int64(), y = utf8()))
arrow_table(df, schema = schema(x = float64(), y = utf8()))
```
## R object attributes
Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object Schema. Attributes added to objects in this fashion are stored under the `r` key, as shown below:
```{r}
# data frame with custom metadata
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
attr(df, "df_meta") <- "custom data frame metadata"
attr(df$y, "col_meta") <- "custom column metadata"
# when converted to a Table, the metadata is preserved
tb <- arrow_table(df)
tb$metadata
```
It is also possible to assign additional string metadata under any other key you wish, using a command like this:
```{r}
tb$metadata$new_key <- "new value"
```
Metadata attached to a Schema is preserved when writing the Table to Arrow/Feather or Parquet formats. When reading those files into R, or when calling `as.data.frame()` on a Table or RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.
Note that the attributes stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. Similarly, Pandas writes its own custom metadata, which the R package does not consume. You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.
## Further reading
- To learn more about arrow metadata, see the documentation for `schema()`.
- To learn more about data types, see the [data types article](./data_types.html).