forked from hadley/r-pkgs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata.Rmd
550 lines (400 loc) · 32.2 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
# Data {#sec-data}
```{r, echo = FALSE}
source("common.R")
```
It's often useful to include data in a package.
If the primary purpose of a package is to distribute useful functions, example datasets make it easier to write excellent documentation.
These datasets can be hand-crafted to provide compelling use cases for the functions in the package.
Here are some examples of this type of package data:
- [tidyr](https://tidyr.tidyverse.org/reference/index.html#data): `billboard` (song rankings), `who` (tuberculosis data from the World Health Organization)
- [dplyr](https://dplyr.tidyverse.org/reference/index.html#data): `starwars` (Star Wars characters), `storms` (storm tracks)
At the other extreme, some packages exist solely for the purpose of distributing data, along with its documentation.
These are sometimes called "data packages".
A data package can be a nice way to share example data across multiple packages.
It is also a useful technique for getting relatively large, static files out of a more function-oriented package, which might require more frequent updates.
Here are some examples of data packages:
- [nycflights13](https://nycflights13.tidyverse.org)
- [babyname](http://hadley.github.io/babynames/)[s](https://github.com/hadley/babynames)
Finally, many packages benefit from having internal data that is used for internal purposes, but that is not directly exposed to the users of the package.
In this chapter we describe useful mechanisms for including data in your package.
The practical details differ depending on who needs access to the data, how often it changes, and what they will do with it:
- If you want to store R objects and make them available to the user, put them in `data/`.
This is the best place to put example datasets.
All the concrete examples above for data in a package and data as a package use this mechanism.
See section @sec-data-data.
- If you want to store R objects for your own use as a developer, put them in `R/sysdata.rda`.
This is the best place to put internal data that your functions need.
See section @sec-data-sysdata.
- If you want to store data in some raw, non-R-specific form and make it available to the user, put it in `inst/extdata/`.
For example, readr and readxl each use this mechanism to provide a collection of delimited files and Excel workbooks, respectively.
See section @sec-data-extdata.
- If you want to store dynamic data that reflects the internal state of your package within a single R session, use an environment.
This technique is not as common or well-known as those above, but can be very useful in specific situations.
See section @sec-data-state.
- If you want to store data persistently across R sessions, such as configuration or user-specific data, use one of the officially sanctioned locations.
See section @sec-data-persistent.
## Exported data {#sec-data-data}
The most common location for package data is (surprise!) `data/`.
We recommend that each file in this directory be an `.rda` file created by `save()` containing a single R object, with the same name as the file.
The easiest way to achieve this is to use `usethis::use_data()`.
```{r, eval = FALSE}
my_pkg_data <- sample(1000)
usethis::use_data(my_pkg_data)
```
Let's imagine we are working on a package named "pkg".
The snippet above creates `data/my_pkg_data.rda` inside the source of the pkg package and adds `LazyData: true` in your `DESCRIPTION`.
This makes the `my_pkg_data` R object available to users of pkg via `pkg::my_pkg_data` or, after attaching pkg with `library(pkg)`, as `my_pkg_data`.
The snippet above is something the maintainer executes once (or every time they need to update `my_pkg_data`).
This is workflow code and should **not** appear in the `R/` directory of the source package.
(We'll talk about a suitable place to keep this code below.) For larger datasets, you may want to experiment with the compression setting, which is under the control of the `compress` argument.
The default is "bzip2", but sometimes "gzip" or "xz" can create smaller files.
It's possible to use other types of files below `data/`, but we don't recommend it because `.rda` files are already fast, small, and explicit.
The other possibilities are described in the documentation for `utils::data()` and in the [Data in packages](https://rstudio.github.io/r-manuals/r-exts/Creating-R-packages.html#data-in-packages) section of Writing R Extensions.
In terms of advice to package authors, the help topic for `data()` seems to implicitly make the same recommendations as we do above:
- Store one R object in each `data/*.rda` file
- Use the same name for that object and its `.rda` file
- Use lazy-loading, by default
If the `DESCRIPTION` contains `LazyData: true`, then datasets will be lazily loaded.
This means that they won't occupy any memory until you use them.
The following example shows memory usage before and after loading the nycflights13 package.
You can see that memory usage doesn't change significantly until you inspect the `flights` dataset stored inside the package.
```{r}
lobstr::mem_used()
library(nycflights13)
lobstr::mem_used()
invisible(flights)
lobstr::mem_used()
```
We recommend that you include `LazyData: true` in your `DESCRIPTION` if you are shipping `.rda` files below `data/`.
If you use `use_data()` to create such datasets, it will automatically make this modification to `DESCRIPTION` for you.
::: callout-warning
It is important to note that lazily-loaded datasets do **not** need to be pre-loaded with `utils::data()` and, in fact, it's usually best to avoid doing so.
Above, once we did `library(nycflights13)`, we could immediately access `flights`.
There is no call to `data(flights)`, because it is not necessary.
There are specific downsides to `data(some_pkg_data)` calls that support a policy of only using `data()` when it is actually necessary, i.e. for datasets that would not be available otherwise:
- By default, `data(some_pkg_data)`, creates one or more objects in the user's global workspace. There is the potential to silently overwrite pre-existing objects with new values.
- There is also no guarantee that `data(foo)` will create exactly one object named "foo". It could create more than one object and/or objects with totally different names.
One argument in favor of calls like `data(some_pkg_data, package = "pkg")` that are not strictly necessary is that it clarifies which package provides `some_pkg_data`.
We prefer alternatives that don't modify the global workspace, such as a code comment or access via `pkg::some_pkg_data`.
This excerpt from the documentation of `data()` conveys that it is largely of historical importance:
> `data()` was originally intended to allow users to load datasets from packages for use in their examples, and as such it loaded the datasets into the workspace `.GlobalEnv`.
> This avoided having large datasets in memory when not in use: that need has been almost entirely superseded by lazy-loading of datasets.
:::
### Preserve the origin story of package data {#sec-data-data-raw}
Often, the data you include in `data/` is a cleaned up version of raw data you've gathered from elsewhere.
We highly recommend taking the time to include the code used to do this in the source version of your package.
This makes it easy for you to update or reproduce your version of the data.
This data-creating script is also a natural place to leave comments about important properties of the data, i.e. which features are important for downstream usage in package documentation.
We suggest that you keep this code in one or more `.R` files below `data-raw/`.
You don't want it in the bundled version of your package, so this folder should be listed in `.Rbuildignore`.
usethis has a convenience function that can be called when you first adopt the `data-raw/` practice or when you add an additional `.R` file to the folder:
```{r, eval = FALSE}
usethis::use_data_raw()
usethis::use_data_raw("my_pkg_data")
```
`use_data_raw()` creates the `data-raw/` folder and lists it in `.Rbuildignore`.
A typical script in `data-raw/` includes code to prepare a dataset and ends with a call to `use_data()`.
These data packages all use the approach recommended here for `data-raw/`:
- [babynames](https://github.com/hadley/babynames)
- [nycflights13](https://github.com/hadley/nycflights13)
- [gapminder](https://github.com/jennybc/gapminder)
::: callout-tip
## ggplot2: A cautionary tale
We have a confession to make: the origins of many of ggplot2's example datasets have been lost in the sands of time.
In the grand scheme of things, this is not a huge problem, but maintenance is certainly more pleasant when a package's assets can be reconstructed *de novo* and easily updated as necessary.
:::
::: callout-warning
## Submitting to CRAN
Generally, package data should be smaller than a megabyte - if it's larger you'll need to argue for an exemption.
This is usually easier to do if the data is in its own package and won't be updated frequently, i.e. if you approach this as a dedicated "data package".
For reference, the babynames and nycflights packages have had a release once every one to two years, since they first appeared on CRAN.
If you are bumping up against size issues, you should be intentional with regards to the method of data compression.
The default for `usethis::use_data(compress =)` is "bzip2", whereas the default for `save(compress =)` is (effectively) "gzip", and "xz" is yet another valid option.
You'll have to experiment with different compression methods and make this decision empirically.
`tools::resaveRdaFiles("data/")` automates this process, but doesn't inform you of which compression method was chosen.
You can learn this after the fact with `tools::checkRdaFiles()`.
Assuming you are keeping track of the code to generate your data, it would be wise to update the corresponding `use_data(compress =)` call below `data-raw/` and re-generate the `.rda` cleanly.
:::
### Documenting datasets {#sec-documenting-data}
Objects in `data/` are always effectively exported (they use a slightly different mechanism than `NAMESPACE` but the details are not important).
This means that they must be documented.
Documenting data is like documenting a function with a few minor differences.
Instead of documenting the data directly, you document the name of the dataset and save it in `R/`.
For example, the roxygen2 block used to document the `who` data in tidyr is saved in `R/data.R` and looks something like this:
```{r, eval = FALSE}
#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report ...
#'
#' @format ## `who`
#' A data frame with 7,240 rows and 60 columns:
#' \describe{
#' \item{country}{Country name}
#' \item{iso2, iso3}{2 & 3 letter ISO country codes}
#' \item{year}{Year}
#' ...
#' }
#' @source <https://www.who.int/teams/global-tuberculosis-programme/data>
"who"
```
There are two roxygen tags that are especially important for documenting datasets:
- `@format` gives an overview of the dataset.
For data frames, you should include a definition list that describes each variable.
It's usually a good idea to describe variables' units here.
- `@source` provides details of where you got the data, often a URL.
Never `@export` a data set.
### Non-ASCII characters in data {#sec-data-non-ascii}
The R objects you store in `data/*.rda` often contain strings, with the most common example being character columns in a data frame.
If you can constrain these strings to only use ASCII characters, it certainly makes things simpler.
But of course, there are plenty of legitimate reasons why package data might include non-ASCII characters.
In that case, we recommend that you embrace the [UTF-8 Everywhere manifesto](http://utf8everywhere.org) and use the UTF-8 encoding.
The `DESCRIPTION` file placed by `usethis::create_package()` always includes `Encoding: UTF-8`, so by default a devtools-produced package already advertises that it will use UTF-8.
Making sure that the strings embedded in your package data have the intended encoding is something you accomplish in your data preparation code, i.e. in the R scripts below `data-raw/`.
You can use `Encoding()` to learn the current encoding of the elements in a character vector and functions such as `enc2utf8()` or `iconv()` to convert between encodings.
::: callout-warning
## Submitting to CRAN
If you have UTF-8-encoded strings in your package data, you may see this from `R CMD check`:
```
- checking data for non-ASCII characters ... NOTE
Note: found 352 marked UTF-8 strings
```
This `NOTE` is truly informational.
It requires no action from you.
As long as you actually intend to have UTF-8 strings in your package data, all is well.
Ironically, this `NOTE` is actually suppressed by `R CMD check --as-cran`, despite the fact that this note does appear in the check results once a package is on CRAN (which implies that CRAN does not necessarily check with `--as-cran`).
By default, `devtools::check()` sets the `--as-cran` flag and therefore does not transmit this `NOTE`.
But you can surface it with `check(cran = FALSE, env_vars = c("_R_CHECK_PACKAGE_DATASETS_SUPPRESS_NOTES_" = "false"))`.
:::
<!-- TODO: Offer some advice for those who have non-ASCII strings in their package and it is a surprise (so, it's not intentional). The best resource I have found so far for this is `tools:::.check_package_datasets()`. Perhaps devtools should get a function that does a ground-up implementation of such a search for non-ASCII strings. -->
<!-- https://github.com/wch/r-source/blob/f6737799b169710006b040f72f9abc5e09180229/src/library/tools/R/QC.R#L4672 -->
## Internal data {#sec-data-sysdata}
Sometimes your package functions need access to pre-computed data.
If you put these objects in `data/`, they'll also be available to package users, which is not appropriate.
Sometimes the objects you need are small and simple enough that you can define them with `c()` or `data.frame()` in the code below `R/`, perhaps in `R/data.R`.
Larger or more complicated objects should be stored in your package's internal data in `R/sysdata.rda`, so they are lazy-loaded on demand.
Here are some examples of internal package data:
- Two colour-related packages, [munsell](https://github.com/cwickham/munsell) and [dichromat](https://cran.r-project.org/web/packages/dichromat/index.html), use `R/sysdata.rda` to store large tables of colour data.
- [googledrive](https://github.com/tidyverse/googledrive) and [googlesheets4](https://github.com/tidyverse/googlesheets4) wrap the Google Drive and Google Sheets APIs, respectively. Both use `R/sysdata.rda` to store data derived from a so-called [Discovery Document](https://developers.google.com/discovery/v1/reference/apis) which "describes the surface of the API, how to access the API and how API requests and responses are structured".
<!-- Another example I noted: readr + data-raw/date-symbols.R + date_symbols -->
The easiest way to create `R/sysdata.rda` is to use `usethis::use_data(internal = TRUE)`:
```{r, eval = FALSE}
internal_this <- ...
internal_that <- ...
usethis::use_data(internal_this, internal_that, internal = TRUE)
```
Unlike `data/`, where you use one `.rda` file per exported data object, you store all of your internal data objects together in the single file `R/sysdata.rda`.
Let's imagine we are working on a package named "pkg".
The snippet above creates `R/sysdata.rda` inside the source of the pkg package.
This makes the objects `internal_this` and `internal_that` available for use inside of the functions defined below `R/` and in the tests.
During interactive development, `internal_this` and `internal_that` are available after a call to `devtools::load_all()`, just like an internal function.
Much of the advice given for external data holds for internal data as well:
- It's a good idea to store the code that generates your individual internal data objects, as well as the `use_data()` call that writes all of them into `R/sysdata.rda`. This is workflow code that belongs below `data-raw/`, not below `R/`.
- `usethis::use_data_raw()` can be used to initiate the use of `data-raw/` or to initiate a new `.R` script there.
- If your package is uncomfortably large, experiment with different values of `compress` in `use_data(internal = TRUE)`.
There are also key distinctions, where the handling of internal and external data differs:
- Objects in `R/sysdata.rda` are not exported (they shouldn't be), so they don't need to be documented.
- The `LazyData` field in the package `DESCRIPTION` has no impact on `R/sysdata.rda` but is strictly about the exported data below `data/`. Internal data is always lazy-loaded.
## Raw data file {#sec-data-extdata}
If you want to show examples of loading/parsing raw data, put the original files in `inst/extdata/`.
When the package is installed, all files (and folders) in `inst/` are moved up one level to the top-level directory, which is why they can't have names that conflict with standard parts of an R package, like `R/` or `DESCRIPTION` .
The files below `inst/extdata/` in the source package will be located below `extdata/` in the corresponding installed package.
You may want to revisit @fig-package-files to review the file structure for different package states.
The main reason to include such files is when a key part of a package's functionality is to act on an external file.
Examples of such packages include:
- readr, which reads rectangular data out of delimited files
- readxl, which reads rectangular data out of Excel spreadsheets
- xml2, which can read XML and HTML from file
- archive, which can read archive files, such as tar or ZIP
All of these packages have one or more example files below `inst/extdata/`, which are useful for writing documentation and tests.
It is also common for data packages to provide, e.g., a csv version of the package data that is also provided as an R object.
Examples of such packages include:
- palmerpenguins: `penguins` and `penguins_raw` are also represented as `extdata/penguins.csv` and `extdata/penguins_raw.csv`
- gapminder: `gapminder`, `continent_colors`, and `country_colors` are also represented as `extdata/gapminder.tsv`, `extdata/continent-colors.tsv`, and `extdata/country-colors.tsv`
This has two payoffs: First, it gives teachers and other expositors more to work with once they decide to use a specific dataset.
If you've started teaching R with `palmerpenguins::penguins` or `gapminder::gapminder` and you want to introduce data import, it can be helpful to students if their first use of a new command, like `readr::read_csv()` or `read.csv()`, is applied to a familiar dataset.
They have pre-existing intuition about the expected result.
Finally, if package data evolves over time, having a csv or other plain text representation in the source package can make it easier to see what's changed.
### Filepaths {#sec-data-system-file}
The path to a package file found below `extdata/` clearly depends on the local environment, i.e. it depends on where installed packages live on that machine.
The base function `system.file()` can report the full path to files distributed with an R package.
It can also be useful to *list* the files distributed with an R package.
```{r}
system.file("extdata", package = "readxl") |> list.files()
system.file("extdata", "clippy.xlsx", package = "readxl")
```
These filepaths present yet another workflow dilemma: When you're developing your package, you engage with it in its source form, but your users engage with it as an installed package.
Happily, devtools provides a shim for `base::system.file()` that is activated by `load_all()`.
This makes interactive calls to `system.file()` from the global environment and calls from within the package namespace "just work".
Be aware that, by default, `system.file()` returns the empty string, not an error, for a file that does not exist.
```{r}
system.file("extdata", "I_do_not_exist.csv", package = "readr")
```
If you want to force a failure in this case, specify `mustWork = TRUE`:
```{r error = TRUE}
system.file("extdata", "I_do_not_exist.csv", package = "readr", mustWork = TRUE)
```
The [fs package](https://fs.r-lib.org) offers `fs::path_package()`.
This is essentially `base::system.file()` with a few added features that we find advantageous, whenever it's reasonable to take a dependency on fs:
- It errors if the filepath does not exist.
- It throws distinct errors when the package does not exist vs. when the file does not exist within the package.
- During development, it works for interactive calls, calls from within the loaded package's namespace, and even for calls originating in dependencies.
```{r error = TRUE}
fs::path_package("extdata", package = "idonotexist")
fs::path_package("extdata", "I_do_not_exist.csv", package = "readr")
fs::path_package("extdata", "chickens.csv", package = "readr")
```
````{=html}
<!--
```
during development after installation
/path/to/local/package /path/to/some/installed/package
├── DESCRIPTION ├── DESCRIPTION
├── ... ├── ...
├── inst └── some-installed-file.txt
│ └── some-installed-file.txt
└── ...
```
`fs::path_package("some-installed_file.txt")` builds the correct path in both cases.
A common theme you've now encountered in multiple places is that devtools and related packages try to eliminate hard choices between having a smooth interactive development experience and arranging things correctly in your package.
-->
````
### `pkg_example()` path helpers {#sec-data-example-path-helper}
We like to offer convenience functions that make example files easy to access.
These are just user-friendly wrappers around `system.file()` or `fs::path_package()`, but can have added features, such as the ability to list the example files.
Here's the definition and some usage of `readxl::readxl_example()`:
```{r, eval = FALSE}
readxl_example <- function(path = NULL) {
if (is.null(path)) {
dir(system.file("extdata", package = "readxl"))
} else {
system.file("extdata", path, package = "readxl", mustWork = TRUE)
}
}
```
```{r}
readxl::readxl_example()
readxl::readxl_example("clippy.xlsx")
```
## Internal state {#sec-data-state}
Sometimes there's information that multiple functions from your package need to access that:
- Must be determined at load time (or even later), not at build time. It might even be dynamic.
- Doesn't make sense to pass in via a function argument. Often it's some obscure detail that a user shouldn't even know about.
A great way to manage such data is to use an *environment*.[^data-1]
This environment must be created at build time, but you can populate it with values after the package has been loaded and update those values over the course of an R session.
This works because environments have reference semantics (whereas more pedestrian R objects, such as atomic vectors, lists, or data frames have value semantics).
[^data-1]: If you don't know much about R environments and what makes them special, a great resource is the [Environments chapter](https://adv-r.hadley.nz/environments.html) of Advanced R.
Consider a package that can store the user's favorite letters or numbers.
You might start out with code like this in a file below `R/`:
```{r eval = FALSE}
favorite_letters <- letters[1:3]
#' Report my favorite letters
#' @export
mfl <- function() {
favorite_letters
}
#' Change my favorite letters
#' @export
set_mfl <- function(l = letters[24:26]) {
old <- favorite_letters
favorite_letters <<- l
invisible(old)
}
```
`favorite_letters` is initialized to ("a", "b", "c") when the package is built.
The user can then inspect `favorite_letters` with `mfl()`, at which point they'll probably want to register *their* favorite letters with `set_mfl()`.
Note that we've used the super assignment operator `<<-` in `set_mfl()` in the hope that this will reach up into the package environment and modify the internal data object `favorite_letters`.
But a call to `set_mfl()` fails like so:[^data-2]
[^data-2]: This example will execute without error if you define `favorite_letters`, `mfl()`, and `set_mfl()` in the global workspace and call `set_mfl()` in the console.
But this code will fail once `favorite_letters`, `mfl()`, and `set_mfl()` are defined *inside a package*.
```{r, eval = FALSE}
mfl()
#> [1] "a" "b" "c"
set_mfl(c("j", "f", "b"))
#> Error in set_mfl() :
#> cannot change value of locked binding for 'favorite_letters'
```
Because `favorite_letters` is a regular character vector, modification requires making a copy and rebinding the name `favorite_letters` to this new value.
And that is what's disallowed: you can't change the binding for objects in the package namespace (well, at least not without trying harder than this).
Defining `favorite_letters` this way only works if you will never need to modify it.
However, if we maintain state within an internal package environment, we **can** modify objects contained in the environment (and even add completely new objects).
Here's an alternative implementation that uses an internal environment named "the".
```{r, eval = FALSE}
the <- new.env(parent = emptyenv())
the$favorite_letters <- letters[1:3]
#' Report my favorite letters
#' @export
mfl2 <- function() {
the$favorite_letters
}
#' Change my favorite letters
#' @export
set_mfl2 <- function(l = letters[24:26]) {
old <- the$favorite_letters
the$favorite_letters <- l
invisible(old)
}
```
Now a user *can* register their favorite letters:
```{r, eval = FALSE}
mfl2()
#> [1] "a" "b" "c"
set_mfl2(c("j", "f", "b"))
mfl2()
#> [1] "j" "f" "b"
```
Note that this new value for `the$favorite_letters` persists only for the remainder of the current R session (or until the user calls `set_mfl2()` again).
More precisely, the altered state persists only until the next time the package is loaded (including via `load_all()`).
At load time, the environment `the` is reset to an environment containing exactly one object, named `favorite_letters`, with value ("a", "b", "c").
It's like the movie Groundhog Day.
(We'll discuss more persistent package- and user-specific data in the next section.)
Jim Hester introduced our group to the nifty idea of using "the" as the name of an internal package environment.
This lets you refer to the objects inside in a very natural way, such as `the$token`, meaning "*the* token".
It is also important to specify `parent = emptyenv()` when defining an internal environment, as you generally don't want the environment to inherit from any other (nonempty) environment.
As seen in the example above, the definition of the environment should happen as a top-level assignment in a file below `R/`.
(In particular, this is a legitimate reason to define a non-function at the top-level of a package; see section @sec-code-when-executed for why this should be rare.) As for where to place this definition, there are two considerations:
- Define it before you use it.
If other top-level calls refer to the environment, the definition must come first when the package code is being executed at build time.
This is why `R/aaa.R` is a common and safe choice.
- Make it easy to find later when you're working on related functionality.
If an environment is only used by one family of functions, define it there.
If environment usage is sprinkled around the package, define it in a file with package-wide connotations.
<!-- Examples where we name such an environment "the": rcmdcheck, itdepends, cpp11, gmailr, rlang, covr -->
Here are some examples of how packages use an internal environment:
- googledrive: Various functions need to know the file ID for the current user's home directory on Google Drive. This requires an API call (a relatively expensive and error-prone operation) which yields an eye-watering string of \~40 seemingly random characters that only a computer can love. It would be inhumane to expect a user to know this or to pass it into every function. It would also be inefficient to rediscover the ID repeatedly. Instead, googledrive determines the ID upon first need, then caches it for later use.
- usethis: Most functions need to know the active project, i.e. which directory to target for file modification. This is often the current working directory, but that is not an invariant usethis can rely upon. One potential design is to make it possible to specify the target project as an argument of every function in usethis. But this would create significant clutter in the user interface, as well as internal fussiness. Instead, we determine the active project upon first need, cache it, and provide methods for (re)setting it.
The blog post [Package-Wide Variables/Cache in R Packages](https://trestletech.com/2013/04/package-wide-variablescache-in-r-package/) gives a more detailed development of this technique.
<!-- I've always felt like Hermione's beaded bag / Bag of Holding is a great analogy for this environment technique. As long as you've got this bag, you can keep whatever you like in it. Should I try to develop this? -->
## Persistent user data {#sec-data-persistent}
Sometimes there is data that your package obtains, on behalf of itself or the user, that should persist *even across R sessions*.
This is our last and probably least common form of storing package data.
For the data to persist this way, it has to be stored on disk and the big question is where to write such a file.
This problem is hardly unique to R.
Many applications need to leave notes to themselves.
It is best to comply with external conventions, which in this case means the [XDG Base Directory Specification](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html).
You need to use the official locations for persistent file storage, because it's the responsible and courteous thing to do and also to comply with CRAN policies.
::: callout-warning
## Submitting to CRAN
You can't just write persistent data into the user's home directory.
Here's a relevant excerpt from the CRAN policy at the time of writing:
> Packages should not write in the user's home filespace (including clipboards), nor anywhere else on the file system apart from the R session's temporary directory ....
>
> For R version 4.0 or later (hence a version dependency is required or only conditional use is possible), packages may store user-specific data, configuration and cache files in their respective user directories obtained from `tools::R_user_dir()`, provided that by \[sic\] default sizes are kept as small as possible and the contents are actively managed (including removing outdated material).
:::
The primary function you should use to derive acceptable locations for user data is `tools::R_user_dir()`[^data-3].
Here are some examples of the generated filepaths:
[^data-3]: Note that `tools::R_user_dir()` first appeared in R 4.0.
If you need to support older versions of R, then you should use the [rappdirs package](https://rappdirs.r-lib.org), which is a port of the Python appdirs module, and which follows the [tidyverse policy regarding R version support](https://www.tidyverse.org/blog/2019/04/r-version-support/), meaning the minimum supported R version is advancing and will eventually slide past R 4.0.
rappdirs produces different filepaths than `tools::R_user_dir()`.
However, both tools implement something that is consistent with the XDG spec, just with different opinions about how to create filepaths beyond what the spec dictates.
```{r}
tools::R_user_dir("pkg", which = "data")
tools::R_user_dir("pkg", which = "config")
tools::R_user_dir("pkg", which = "cache")
```
One last thing you should consider with respect to persistent data is: does this data *really* need to persist?
Do you *really* need to be the one responsible for storing it?
If the data is potentially sensitive, such as user credentials, it is recommended to obtain the user's consent to store it, i.e. to require interactive consent when initiating the cache.
Also consider that the user's operating system or command line tools might provide a means of secure storage that is superior to any DIY solution that you might implement.
The packages [keyring](https://cran.r-project.org/package=keyring), [gitcreds](https://gitcreds.r-lib.org), and [credentials](https://docs.ropensci.org/credentials/) are examples of packages that tap into externally-provided tooling.
Before embarking on any creative solution for storing secrets, consider that your effort is probably better spent integrating with an established tool.