forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
python.Rmd
197 lines (146 loc) · 7.17 KB
/
python.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
title: "Integrating Arrow, Python, and R"
description: >
Learn how to use arrow and reticulate to efficiently transfer data
between R and Python without making unnecessary copies
output: rmarkdown::html_vignette
---
The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This article provides a brief overview.
Code in this article assumes arrow and reticulate are both loaded:
```r
library(arrow, warn.conflicts = FALSE)
library(reticulate, warn.conflicts = FALSE)
```
## Motivation
One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R arrow package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see [the article on Flight support in arrow](./flight.html) -- so that would be another instance in which PyArrow would be of benefit to R users.
A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance.
## Installing PyArrow
To use Arrow in Python, the `pyarrow` library needs to be installed. For example, you may wish to create a Python [virtual environment](https://docs.python.org/3/library/venv.html) containing the `pyarrow` library. A virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.
You can perform the set up from within R. Let's suppose you want to call your virtual environment something like `my-pyarrow-env`. Your setup code would look like this:
```r
virtualenv_create("my-pyarrow-env")
install_pyarrow("my-pyarrow-env")
```
If you want to install a development version of `pyarrow` to the virtual environment, add `nightly = TRUE` to the `install_pyarrow()` command:
```r
install_pyarrow("my-pyarrow-env", nightly = TRUE)
```
Note that you don't have to use virtual environments. If you prefer [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html), you can use this setup code:
```r
conda_create("my-pyarrow-env")
install_pyarrow("my-pyarrow-env")
```
To learn more about installing and configuring Python from R,
see the [reticulate documentation](https://rstudio.github.io/reticulate/articles/python_packages.html), which discusses the topic in more detail.
## Importing PyArrow
Assuming that arrow and reticulate are both loaded in R, your first step is to make sure that the correct Python environment is being used. To do that with a virtual environment, use a command like this:
```r
use_virtualenv("my-pyarrow-env")
```
For a conda environment use the following:
```r
use_condaenv("my-pyarrow-env")
```
Once you have done this, the next step is to import `pyarrow` into the Python session as shown below:
```r
pa <- import("pyarrow")
```
Executing this command in R is the equivalent of the following import in Python:
```python
import pyarrow as pa
```
It may be a good idea to check your `pyarrow` version too, as shown below:
```r
pa$`__version__`
```
```
## [1] "8.0.0"
```
Support for passing data to and from R is included in `pyarrow` versions 0.17 and greater.
## Using PyArrow
You can use the reticulate function `r_to_py()` to pass objects from R to Python, and similarly you can use `py_to_r()` to pull objects from the Python session into R. To illustrate this, let's create two objects in R: `df_random` is an R data frame containing 100 million rows of random data, and `tb_random` is the same data stored as an Arrow Table:
```r
set.seed(1234)
nrows <- 10^8
df_random <- data.frame(
x = rnorm(nrows),
y = rnorm(nrows),
subset = sample(10, nrows, replace = TRUE)
)
tb_random <- arrow_table(df_random)
```
Transferring the data from R to Python without Arrow is a time-consuming process because the underlying object has to be copied and converted to a Python data structure:
```r
system.time({
df_py <- r_to_py(df_random)
})
```
```
## user system elapsed
## 0.307 5.172 5.529
```
In contrast, sending the Arrow Table across happens almost instantaneously:
```r
system.time({
tb_py <- r_to_py(tb_random)
})
```
```
## user system elapsed
## 0.004 0.000 0.003
```
"Send", however, isn't really the correct word. Internally, we're passing pointers to the data between the R and Python interpreters running together in the same process, without copying anything. Nothing is being sent: we're sharing and accessing the same internal Arrow memory buffers.
It's possible to send data the other direction also. For example let's create an `Array` in pyarrow.
```r
a <- pa$array(c(1, 2, 3))
a
```
```
## Array
## <double>
## [
## 1,
## 2,
## 3
## ]
```
Notice that `a` is now an `Array` object in your R session -- even though you created it in Python -- and you can apply R methods on it:
```r
a[a > 1]
```
```
## Array
## <double>
## [
## 2,
## 3
## ]
```
Similarly, you can combine this object with Arrow objects created in R, and you can use PyArrow methods like `pa$concat_arrays()` to do so:
```r
b <- Array$create(c(5, 6, 7, 8, 9))
a_and_b <- pa$concat_arrays(list(a, b))
a_and_b
```
```
## Array
## <double>
## [
## 1,
## 2,
## 3,
## 5,
## 6,
## 7,
## 8,
## 9
## ]
```
Now you have a single Array in R.
## Further reading
- To learn more about installing and configuring Python from R,
see the [reticulate documentation](https://rstudio.github.io/reticulate/articles/python_packages.html).
- To learn PyArrow, see the official [PyArrow Documentation](https://arrow.apache.org/docs/python/) and [Apache Arrow Python Cookbook](https://arrow.apache.org/cookbook/py/).
- R/Python integration in Arrow is also discussed in the [PyArrow Integrations Documentation](https://arrow.apache.org/docs/python/integration/python_r.html), in this [blog post about reticulate integration in Arrow](https://voltrondata.com/blog/passing-arrow-data-between-r-and-python-with-reticulate/), and in this [blog post about rpy2 integration in Arrow](https://voltrondata.com/blog/data-transfer-between-python-and-r-with-rpy2-and-apache-arrow/).
- The integration between R Arrow and PyArrow is supported through the [Arrow C data interface](https://arrow.apache.org/docs/format/CDataInterface.html#c-data-interface).
- To learn more about Arrow data objects, see the [data objects article](./data_objects.html).