Skip to content

Readers should raise an error for HDF files using the compact storage layout. #527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
sharkinsspatial opened this issue Apr 2, 2025 · 3 comments
Open
2 tasks
Assignees

Comments

@sharkinsspatial
Copy link
Collaborator

While investigating possible HDF storage scenarios for scalar values from #523 I discovered that HDF also supports a "compact" storage layout where extremely small datasets or values (<64KB) are inlined into the file header https://support.hdfgroup.org/documentation/hdf5/latest/_l_b_dset_layout.html. The HDF5 lib has no support for inferring the offset and size of datasets stored using the compact layout so we have no way of creating a ChunkManifest for them and should raise an unsupported exception.

  • Create a test fixture with a scalar stored in the compact storage layout using the low-level h5py.h5d API.
  • Update HDFVirtualBackend to check the dataset's storage layout.
@sharkinsspatial sharkinsspatial self-assigned this Apr 2, 2025
@TomNicholas
Copy link
Member

TomNicholas commented Apr 3, 2025

Hmm, this is potentially an issue with the whole "readers as creators of ManifestStores" idea. We can put this inlined data into a virtual dataset and into Icechunk, it just can't be a virtual variable. (Or at least the HDF library won't help us if we want to make that virtual variable.)

If reader implementations had the ability to say "nah actually you're getting this variable in memory" then we could deal with this situation gracefully.

A compromise might be to have the error message suggest explicitly loading that particular variable.

@sharkinsspatial
Copy link
Collaborator Author

sharkinsspatial commented Apr 3, 2025

@TomNicholas I think including the suggestion to load the problematic variables is probably the way to go 👍. I'm hopeful that this is will be a fairly infrequent case.

@TomNicholas
Copy link
Member

BTW a similar thing would happen if you try to load kerchunk references that are inlined into the kerchunk reference file. But in that case it's easier to generate a reference to the data (which lives in the kerchunk json file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants