forked from tidyverse/dplyr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtbl_cube.Rd
100 lines (82 loc) · 3.48 KB
/
tbl_cube.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tbl-cube.r
\name{tbl_cube}
\alias{tbl_cube}
\title{A data cube tbl.}
\usage{
tbl_cube(dimensions, measures)
}
\arguments{
\item{dimensions}{A named list of vectors. A dimension is a variable
whose values are known before the experiement is conducted; they are
fixed by design (in \pkg{reshape2} they are known as id variables).
\code{tbl_cubes} are dense which means that almost every combination of
the dimensions should have associated measurements: missing values require
an explicit NA, so if the variables are nested, not crossed, the
majority of the data structure will be empty. Dimensions are typically,
but not always, categorical variables.}
\item{measures}{A named list of arrays. A measure is something that is
actually measured, and is not known in advance. The dimension of each
array should be the same as the length of the dimensions. Measures are
typically, but not always, continuous values.}
}
\description{
A cube tbl stores data in a compact array format where dimension
names are not needlessly repeated. They are particularly appropriate for
experimental data where all combinations of factors are tried (e.g.
complete factorial designs), or for storing the result of aggregations.
Compared to data frames, they will occupy much less memory when variables
are crossed, not nested.
}
\details{
\code{tbl_cube} support is currently experimental and little performance
optimisation has been done, but you may find them useful if your data
already comes in this form, or you struggle with the memory overhead of the
sparse/crossed of data frames. There is no support for hierarchical
indices (although I think that would be a relatively straightforward
extension to storing data frames for indices rather than vectors).
}
\section{Implementation}{
Manipulation functions:
\itemize{
\item \code{select} (M)
\item \code{summarise} (M), corresponds to roll-up, but rather more
limited since there are no hierarchies.
\item \code{filter} (D), corresponds to slice/dice.
\item \code{mutate} (M) is not implemented, but should be relatively
straightforward given the implementation of \code{summarise}.
\item \code{arrange} (D?) Not implemented: not obvious how much sense
it would make
}
Joins: not implemented. See \code{vignettes/joins.graffle} for ideas.
Probably straightforward if you get the indexes right, and that's probably
some straightforward array/tensor operation.
}
\examples{
# The built in nasa dataset records meterological data (temperature,
# cloud cover, ozone etc) for a 4d spatio-temporal dataset (lat, long,
# month and year)
nasa
head(as.data.frame(nasa))
titanic <- as.tbl_cube(Titanic)
head(as.data.frame(titanic))
admit <- as.tbl_cube(UCBAdmissions)
head(as.data.frame(admit))
as.tbl_cube(esoph, dim_names = 1:3)
# Some manipulation examples with the NASA dataset --------------------------
# select() operates only on measures: it doesn't affect dimensions in any way
select(nasa, cloudhigh:cloudmid)
select(nasa, matches("temp"))
# filter() operates only on dimensions
filter(nasa, lat > 0, year == 2000)
# Each component can only refer to one dimensions, ensuring that you always
# create a rectangular subset
\dontrun{filter(nasa, lat > long)}
# Arrange is meaningless for tbl_cubes
by_loc <- group_by(nasa, lat, long)
summarise(by_loc, pressure = max(pressure), temp = mean(temperature))
}
\seealso{
\code{\link{as.tbl_cube}} for ways of coercing existing data
structures into a \code{tbl_cube}.
}