-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45300: [R] Remove data.table from class attribute in metadata #45346
base: main
Are you sure you want to change the base?
Conversation
|
@github-actions crossbow submit test-r-extra-packages |
Revision: 9827ff1 Submitted crossbow builds: ursacomputing/crossbow @ actions-e869f440e0
|
I haven't read the SO post yet, but FWIW, we added some extra tests confirming that generally speaking data.table objects do roundtrip. #43634 has some more details, but IIUC the pointed not being roundtripped is ok and data.table should recreate it when it needs to. |
I guess an alternative to dropping the data.table class, if we wanted to round-trip data.table objects, would be when applying R metadata on read, if class contains data.table, call |
Thanks! I'll take a look at the setDT idea later today. I'm also considering whether it would work to just drop the |
I thought we already did--it's an |
IIUC data.table self-heals in situations like these. You might have a performance hit the first time you do something when it recreates the pointer, but that's expected with any serialization of data.table (it's not unique to arrow or parquet). @TysonStanley might be able to correct me if I'm off here (or advise if there's a better way for this to happen) |
You're both right about the behavior. The > library(data.table)
> dt_in <- data.table(x = 1:10)
> attributes(dt_in)
$names
[1] "x"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x13b821ae0>
> arrow::write_parquet(dt_in, "test.parquet")
> dt_out <- read_parquet("test.parquet")
> attributes(dt_out)
$names
[1] "x"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
> setDT(dt_out)
> attributes(dt_out)
$names
[1] "x"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x13b821ae0> It doesn't make sense to me how the Edit: Reading the SO post now. |
Yeah, this is even true if you don't attach dt_in <- data.table::data.table(x = 1:10)
attributes(dt_in)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
arrow::write_parquet(dt_in, "test.parquet")
dt_out <- arrow::read_parquet("test.parquet")
attributes(dt_out)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
data.table::setDT(dt_out)
attributes(dt_out)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
dt_in[, y := 10:1]
attributes(dt_in)
#> $names
#> [1] "x" "y"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
attributes(dt_out)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x141813ee0>
print(dt_in)
#> x y
#> <int> <int>
#> 1: 1 10
#> 2: 2 9
#> 3: 3 8
#> 4: 4 7
#> 5: 5 6
#> 6: 6 5
#> 7: 7 4
#> 8: 8 3
#> 9: 9 2
#> 10: 10 1
print(dt_out)
#> x
#> <int>
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> 6: 6
#> 7: 7
#> 8: 8
#> 9: 9
#> 10: 10 |
Nothing unexpected. It will even create the same library(data.table)
dt = data.table(x=1:3)
attributes(dt)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310>
dt2 = data.table(x=1:3)
attributes(dt2)
#> $names
#> [1] "x"
#>
#> $row.names
#> [1] 1 2 3
#>
#> $class
#> [1] "data.table" "data.frame"
#>
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310> But you can always check that they are not the same by checking |
I thought that's what we're already doing, so maybe I misunderstand the original report. |
I consider the initial report as WAI, although our warning could be better since we also warn about the serializing case when |
Revert "Remove data.table from class attribute" This reverts commit 9827ff1. Use setDT when applying metadata
aa03eec
to
459bde8
Compare
We can silence the warning if we do what @nealrichardson suggested in #45346 (comment) though I'm not sure whether we want to or not. I pushed up 459bde8 for discussion. |
Rationale for this change
Fixes #45300.
What changes are included in this PR?
Updated check in
arrow_attributes
in metadata.R.Are these changes tested?
Yes.
Are there any user-facing changes?
Users may be depending on data.tables round-tripping so this may be considered a very minor breaking change.