-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bind_rows()
on a list of data.tables creates corrupt output
#6676
Comments
Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. Please make sure your reprex is created with the reprex package as it gives nicely formatted output and avoids a number of common pitfalls. |
bind_rows()
on a list of data.tables creates corrupt output
So the reprex package gives me this:
I already spend a lot of time reproducing this - it might not be using the reprex package but the example is solid, isn't it? I work on high security servers with no internet access, on data with social security numbers that I would be fired and possibly prosecuted for disclosing. Reproducing it outside those took a while, as the bug is so sublte. And I already spend 5 days finding the bug and correcting the issues caused by it. And there is more work to be done making sure that the bug hasn't caused errors other places in other projects. I don't have any more time to give, I'm sorry. It would have to wait a couple of weeks then. |
Here's a reprex library(data.table)
library(dplyr, warn.conflicts = FALSE)
data <- data.table(
var1 = c("a", "b"),
key = "var1"
)
data[var1 %in% c('a', 'b')]
#> var1
#> 1: a
#> 2: b
# doesn't work
dt <- bind_rows(data[2, ], data[1, ])
dt[var1 %in% c('a', 'b')]
#> var1
#> 1: b
# works
dt <- bind_rows(data[1, ], data[2, ])
dt[var1 %in% c('a', 'b')]
#> var1
#> 1: a
#> 2: b Created on 2023-02-02 with reprex v2.0.2 |
A bit more information. It looks like library(data.table)
library(dplyr, warn.conflicts = FALSE)
data <- data.table(
var1 = c("a", "b"),
key = "var1"
)
# Knows it is sorted by `var1`
attributes(data)
#> $names
#> [1] "var1"
#>
#> $row.names
#> [1] 1 2
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x7fc95800d4e0>
#>
#> $sorted
#> [1] "var1"
# These are still sorted
attributes(data[1,])$sorted
#> [1] "var1"
attributes(data[2,])$sorted
#> [1] "var1"
# As an example, this is no longer sorted
attributes(data[2:1,])$sorted
#> NULL
both <- bind_rows(
data[2, ],
data[1, ]
)
# But this looks sorted because of attribute reconstruction by `dplyr_reconstruct()`
attributes(both)$sorted
#> [1] "var1"
# gforce's fast chin must use the `sorted` attribute so this is wrong
both[var1 %in% c("a", "b"), verbose = TRUE]
#> Optimized subsetting with key 'var1'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.var1 has same type (character) as x.var1. No coercion needed.
#> on= matches existing key, using key
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> var1
#> 1: b Ironically library(data.table)
data <- data.table(
var1 = c("a", "b"),
key = "var1"
)
data <- vctrs::vec_rbind(
data[2,],
data[1,]
)
# No longer sorted, correct
attributes(data)$sorted
#> NULL Here is another example with library(data.table)
library(dplyr, warn.conflicts = FALSE)
data <- data.table(
var1 = c("a", "b"),
key = "var1"
)
# Knows it is sorted by `var1`
attributes(data)
#> $names
#> [1] "var1"
#>
#> $row.names
#> [1] 1 2
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x7fd6b700d4e0>
#>
#> $sorted
#> [1] "var1"
data[var1 %in% c("a", "b")]
#> var1
#> 1: a
#> 2: b
data <- mutate(data, var1 = rev(var1))
# Oops, still thinks it is sorted
attributes(data)$sorted
#> [1] "var1"
data[var1 %in% c("a", "b")]
#> var1
#> 1: b It is possible we need a We'd have to consider whether or not the reconstruct method should try and restore "miscellaneous" attributes that look unrelated to data.table, but that might be fragile. We try to do that here, but I feel like we should reconsider how wise that is: Lines 240 to 244 in e8702df
|
A little more context, prompted by https://stackoverflow.com/q/76895536/3358272: the user does not have to explicitly (or even knowingly) added an index to a Using the example from that SO question: mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018] # this appears to assigns `year` as an index to mydt1
rbindlist(list(mydt1, mydt2))[year == 2017]
### produces expected output:
# year month
# 1: 2017 1
# 2: 2017 2
# 3: 2017 3
# 4: 2017 4
# 5: 2017 5
# 6: 2017 6
subset(bind_rows(mydt1, mydt2), year == 2017)
### produces the same output as above
bind_rows(mydt1, mydt2)[year == 2017]
# year month
# 1: 2017 1
# 2: 2017 2
# 3: 2017 3 The last output is incorrect because As stated in the answer in the Stack question and in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-secondary-indices-and-auto-indexing.html, setting options(datatable.use.index = FALSE) prevents auto-indexing like this (though I think that only has explanatory value here, no benefit to a change in |
when using bind_rows on a list of data.tables with keys, with either map_dfr or a do.call, the keys are not removed This is a problem, because the keys are not correct anymore, which means that later queries are using an index that is wrong.
not sure if I should report this under data.table or dplyr, but since the issue is in bin_rows, I guess it is dplyr.
here is the issue I filed under data.table
same thing happens w. map_dfr,that uses bind_rows under the hood. This is how I found the issue. It is very common in my workflow, and I suspect I am not the one.
doing the same with rbind removes the keys, which is the expected behaviour - as the index from the key are no longer valid.
curiosly, using split() instead of lapply circumvents the issue - but this is not a solution, since typically you need to apply a function to the data, - splitting it and then recombining it with split() makes no sense. But I am including it here for completeness.
sessionInfo
The text was updated successfully, but these errors were encountered: