Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45300: [R] Remove data.table from class attribute in metadata #45346

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

amoeba
Copy link
Member

@amoeba amoeba commented Jan 24, 2025

Rationale for this change

Fixes #45300.

What changes are included in this PR?

Updated check in arrow_attributes in metadata.R.

Are these changes tested?

Yes.

Are there any user-facing changes?

Users may be depending on data.tables round-tripping so this may be considered a very minor breaking change.

Copy link

⚠️ GitHub issue #45300 has been automatically assigned in GitHub to PR creator.

@jonkeane
Copy link
Member

@github-actions crossbow submit test-r-extra-packages

Copy link

Revision: 9827ff1

Submitted crossbow builds: ursacomputing/crossbow @ actions-e869f440e0

Task Status
test-r-extra-packages GitHub Actions

@jonkeane
Copy link
Member

I haven't read the SO post yet, but FWIW, we added some extra tests confirming that generally speaking data.table objects do roundtrip. #43634 has some more details, but IIUC the pointed not being roundtripped is ok and data.table should recreate it when it needs to.

@nealrichardson
Copy link
Member

I guess an alternative to dropping the data.table class, if we wanted to round-trip data.table objects, would be when applying R metadata on read, if class contains data.table, call setDT() on the data frame?

@amoeba
Copy link
Member Author

amoeba commented Jan 24, 2025

Thanks! I'll take a look at the setDT idea later today. I'm also considering whether it would work to just drop the .internal.selfref attribute.

@nealrichardson
Copy link
Member

nealrichardson commented Jan 24, 2025

Thanks! I'll take a look at the setDT idea later today. I'm also considering whether it would work to just drop the .internal.selfref attribute.

I thought we already did--it's an externalptr right? My understanding of the original issue was that we were dropping the pointer but not the class, so data.table was confused to see an object that declared itself to be a data.table but that didn't have the pointer attached.

@jonkeane
Copy link
Member

so data.table was confused to see an object that declared itself to be a data.table but that didn't have the pointer attached.

IIUC data.table self-heals in situations like these. You might have a performance hit the first time you do something when it recreates the pointer, but that's expected with any serialization of data.table (it's not unique to arrow or parquet). @TysonStanley might be able to correct me if I'm off here (or advise if there's a better way for this to happen)

@amoeba
Copy link
Member Author

amoeba commented Jan 24, 2025

You're both right about the behavior. The .internal.selfref attribute is removed and data.table self-heals like @jonkeane said,

> library(data.table)
> dt_in <- data.table(x = 1:10)
> attributes(dt_in)
$names
[1] "x"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x13b821ae0>
> arrow::write_parquet(dt_in, "test.parquet")
> dt_out <- read_parquet("test.parquet")
> attributes(dt_out)
$names
[1] "x"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"
> setDT(dt_out)
> attributes(dt_out)
$names
[1] "x"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x13b821ae0>

It doesn't make sense to me how the .internal.selfref pointer is the same on both sides though.

Edit: Reading the SO post now.

@TysonStanley
Copy link

Yeah, this is even true if you don't attach data.table but interestingly even with the same .internal.selfref modifying dt_in does not result in changes in dt_out so they are seen as different objects with the same .internal.selfref.

dt_in <- data.table::data.table(x = 1:10)
attributes(dt_in)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>
arrow::write_parquet(dt_in, "test.parquet")
dt_out <- arrow::read_parquet("test.parquet")
attributes(dt_out)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
data.table::setDT(dt_out)
attributes(dt_out)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>
dt_in[, y := 10:1]

attributes(dt_in)
#> $names
#> [1] "x" "y"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>
attributes(dt_out)
#> $names
#> [1] "x"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x141813ee0>

print(dt_in)
#>         x     y
#>     <int> <int>
#>  1:     1    10
#>  2:     2     9
#>  3:     3     8
#>  4:     4     7
#>  5:     5     6
#>  6:     6     5
#>  7:     7     4
#>  8:     8     3
#>  9:     9     2
#> 10:    10     1
print(dt_out)
#>         x
#>     <int>
#>  1:     1
#>  2:     2
#>  3:     3
#>  4:     4
#>  5:     5
#>  6:     6
#>  7:     7
#>  8:     8
#>  9:     9
#> 10:    10

@ben-schwen
Copy link

Yeah, this is even true if you don't attach data.table but interestingly even with the same .internal.selfref modifying dt_in does not result in changes in dt_out so they are seen as different objects with the same .internal.selfref.

Nothing unexpected. It will even create the same .internal.selfref for different objects.

library(data.table)

dt = data.table(x=1:3)
attributes(dt)
#> $names
#> [1] "x"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310>

dt2 = data.table(x=1:3)
attributes(dt2)
#> $names
#> [1] "x"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x5640ce7bc310>

But you can always check that they are not the same by checking address(dt) != address(dt2). "Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory.

@nealrichardson
Copy link
Member

"Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory.

I thought that's what we're already doing, so maybe I misunderstand the original report.

@ben-schwen
Copy link

"Safe option" would be not to drop the data.table class but only drop the .internal.selfref attribute, so data.table will detect that and allocate the right memory.

I thought that's what we're already doing, so maybe I misunderstand the original report.

I consider the initial report as WAI, although our warning could be better since we also warn about the serializing case when saveRDS and subsequent readRDS.

Revert "Remove data.table from class attribute"

This reverts commit 9827ff1.

Use setDT when applying metadata
@amoeba amoeba force-pushed the fix/GH-45300--r-data.table-metadata branch from aa03eec to 459bde8 Compare January 25, 2025 01:42
@amoeba
Copy link
Member Author

amoeba commented Jan 25, 2025

We can silence the warning if we do what @nealrichardson suggested in #45346 (comment) though I'm not sure whether we want to or not. I pushed up 459bde8 for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Arrow write_parquet() breaks data.table ability to set columns by reference
5 participants