Skip to content

Tiff test error #526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RichardScottOZ opened this issue Apr 2, 2025 · 25 comments
Open

Tiff test error #526

RichardScottOZ opened this issue Apr 2, 2025 · 25 comments
Labels
bug Something isn't working

Comments

@RichardScottOZ
Copy link

/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
Traceback (most recent call last):
  File "/home/ubuntu/data/model-framework/test_vz.py", line 20, in <module>
    vds = open_virtual_dataset("s3://bananasplits/rasters/fleagle.tif", reader_options={'storage_options': aws_credentials})
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/backend.py", line 200, in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py", line 55, in open_virtual_dataset
    refs = extract_group(refs, group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/translators/kerchunk.py", line 55, in extract_group
    raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

Installed vz half an hour ago, python 3.11.

@maxrjones
Copy link
Member

I'm working on a rewrite of our TIFF virtualization over in #524, since this has been broken for a long while. Are you able to share what file you're trying to virtualize so that I can test it works?

@TomNicholas TomNicholas added the bug Something isn't working label Apr 2, 2025
@TomNicholas
Copy link
Member

@RichardScottOZ the TIFF reader currently in the library is broken and not intended to be used (#291). Sorry about that - I thought we had a NotImplementedError but apparently not.

As @maxrjones says we hopefully will have a shiny new TIFF reader available soon!

FYI even for a reader that is working we would need a much more reproducible example to help you.

@maxrjones
Copy link
Member

To expand a bit on the status update after reading the context in OSGeo/gdal#11824, #524 is not yet ready for others to try. A list of GeoTIFFs/COGs that you'd like to make sure are supported would really speed up development, since it'll expand the list of compression schemes, etc. that are implemented. I will aim to get an MVP finished this week for experimentation.

I'm confident that using async_tiff, as done in that PR is the best path forward, in contrast to the other open/closed PRs you've seen, because it makes it simple to work with Zarr V3 internally. In contrast, TIFFFile (used by Kerchunk) has indicated that they do not plan to support Zarr V3 style references.

@TomNicholas
Copy link
Member

I'm confident that using async_tiff, as done in that PR is the best path forward, in contrast to the other open/closed PRs you've seen, because it makes it simple to work with Zarr V3 internally. In contrast, TIFFFile (used by Kerchunk) has indicated that they do not plan to support Zarr V3 style references.

@maxrjones do you foresee any scenario that your async_tiff approach couldn't handle that the tifffile approach could? Should we just close the tifffile-related issues now?

@RichardScottOZ
Copy link
Author

I can make a more complete thing later today with a bit of luck

@RichardScottOZ
Copy link
Author

I'm working on a rewrite of our TIFF virtualization over in #524, since this has been broken for a long while. Are you able to share what file you're trying to virtualize so that I can test it works?

I can do something very similar at least Max.

@RichardScottOZ
Copy link
Author

import os
import configparser
import contextlib
import xarray as xr
from virtualizarr import open_virtual_dataset

def get_aws_credentials():
    parser = configparser.RawConfigParser()
    parser.read(os.path.expanduser('~/.aws/credentials'))
    credentials = parser.items('default')
    all_credentials = {key.upper(): value for key, value in [*credentials]}
    with contextlib.suppress(KeyError):
        all_credentials["AWS_REGION"] = all_credentials.pop("REGION")
    return all_credentials

creds = get_aws_credentials()

aws_credentials = {"key": creds['AWS_ACCESS_KEY_ID'], "secret": creds['AWS_SECRET_ACCESS_KEY']}
vds = open_virtual_dataset("s3://banana/test_reference.tif", reader_options={'storage_options': aws_credentials})

print(vds.mean())

#error

python test_vz.py 
/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
Traceback (most recent call last):
  File "/home/ubuntu/data/model-framework/test_vz.py", line 20, in <module>
    vds = open_virtual_dataset("s3://banana/test_reference.tif", reader_options={'storage_options': aws_credentials})
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/backend.py", line 200, in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py", line 55, in open_virtual_dataset
    refs = extract_group(refs, group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/translators/kerchunk.py", line 55, in extract_group
    raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

@RichardScottOZ
Copy link
Author

@maxrjones
Copy link
Member

thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., https://github.com/maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested

@norlandrhagen
Copy link
Collaborator

Would love to join!

@mdsumner
Copy link
Contributor

mdsumner commented Apr 2, 2025

@mdsumner
Copy link
Contributor

mdsumner commented Apr 2, 2025

I'll dig up other large files that aren't COGs but famous ones are GEBCO and IBCSO:

https://www.gebco.net/data_and_products/gridded_bathymetry_data/ (the tiff link is a directory that is the file in a zip, so it's weird to have that as a canon url and while you can stream it it doesn't work in some places)

https://doi.pangaea.de/10.1594/PANGAEA.937574?format=html#download (https://download.pangaea.de/dataset/937574/files/IBCSO_v2_ice-surface_WGS84.tif)

@RichardScottOZ
Copy link
Author

thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., https://github.com/maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested

Next week Australian Central Time - GMT +9.5 - half an hour behind Michael basically.

@maxrjones
Copy link
Member

thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested

Next week Australian Central Time - GMT +9.5 - half an hour behind Michael basically.

We don't have much overlap in waking hours, but I sent an invite to y'all for Monday evening GMT-4 Tuesday / Tuesday morning GMT+9.5. No worries if it doesn't work out, the examples shared here will give me plenty to work from.

@RichardScottOZ
Copy link
Author

That should be ok I think Max.

@claytharrison
Copy link

claytharrison commented Apr 4, 2025

To jump in on this - here's a set of GeoTIFFs I'm working on virtualizing: SENTINEL1 Sigma Nought (SIG0) Backscatter in 20 meter resolution.

I've been following a manual approach with tifffile following this tutorial and the lessons from this issue fsspec/kerchunk#78, but the lack of access to intra-file chunks makes it a nonstarter for our usecase unfortunately. We're fitting models to multi-year timeseries stacks per pixel, which requires loading the entire stack of images per 15000x15000 tile into memory when you can't access sub-file chunks. Too much for our beefy machines. I can build the virtual zarr just fine and use it to load data without a problem, but this specific processing case is important and just doesn't work well right now.

Note the nonstandard blocking of 5x15000. I've heard tell from coworkers that there may be some images in there with different blocking but I haven't run any checks to confirm that yet. I do wonder if this would cause some issues when using the proposed TIF reader if true.

I admire all the hard work being done here to enable big geodata workflows, huge thanks to everyone involved :)

@rabernat
Copy link
Collaborator

rabernat commented Apr 4, 2025

We're fitting models to multi-year timeseries stacks per pixel, which requires loading the entire stack of images per 15000x15000 tile into memory when you can't access sub-file chunks. Too much for our beefy machines. I can build the virtual zarr just fine and use it to load data without a problem, but this specific processing case is important and just doesn't work well right now.

This sounds like a scenario where you're better off rechunking your data to align more with your access pattern (i.e. contiguous in time; chunked in space) and then storing it as native (not virtual Zarr).

@claytharrison
Copy link

This sounds like a scenario where you're better off rechunking your data to align more with your access pattern (i.e. contiguous in time; chunked in space) and then storing it as native (not virtual Zarr).

I absolutely agree, and this is where we're slowly headed, but this is a PB-scale dataset that plenty of operational workflows depend on, so it's going to take a while before that's done unfortunately.

Virtualizing it would be a really nice in-between step, and it feels so close if only we could represent those tiff chunks.

@TomNicholas
Copy link
Member

@claytharrison that makes sense, and is a pattern I expect to see a lot in the future. First virtualize the data with the original chunks for cloud-optimized but still-suboptimal performance, then later create additional rechunked copies optimized for expected query patterns.

@cgohlke
Copy link

cgohlke commented Apr 4, 2025

I've been following a manual approach with tifffile following this tutorial and the lessons from this issue fsspec/kerchunk#78, but the lack of access to intra-file chunks makes it a nonstarter for our usecase unfortunately.

Tifffile provides access to intra-file chunks. Did you try to create reference files for the individual files (for example via tiff2fsspec) and merge them (assuming image sizes, chunk sizes, data types, compressions, etc match).

you foresee any scenario that your async_tiff approach couldn't handle that the tifffile approach could?

Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables? Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

@TomNicholas
Copy link
Member

FYI we can (and it sounds like we should) have multiple TIFF readers for virtualizarr. After #498 implementing one of these readers will be as simple as writing one function like reader_func(path: str) -> ManifestStore:, so we can and should have multiple to serve all possible requirements.

@maxrjones
Copy link
Member

Thanks for joining the discussion @cgohlke and for all your work on TIFFFile and imagecodecs!

Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables?

Probably not given it's relatively new and motivated primarily by geospatial use-cases, though @kylebarron would know more as the author of async_tiff.

Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

This is really helpful context, thanks for sharing. I don't expect any of this to be easier with async_tiff, since we still need to translate the metadata to be Zarr-compatible and will run up against the same absence of float24 or complex integer dtypes, for example. I also expect that we'll want to register imagecodecs codecs in the Array metadata (if that's possible) meaning all codec limitations will be the same.

@maxrjones
Copy link
Member

maxrjones commented Apr 7, 2025

FYI we can (and it sounds like we should) have multiple TIFF readers for virtualizarr. After #498 implementing one of these readers will be as simple as writing one function like reader_func(path: str) -> ManifestStore:, so we can and should have multiple to serve all possible requirements.

+1. I'm currently debating whether it would even be better to develop the async_tiff reader separately from the VirtualiZarr repo rather than continuing to build it as a refactor of TIFFVirtualBackend as in #524. Here are my reasons for considering building it out separately (currently experimenting in https://github.com/maxrjones/virtual-tiff):

  • If there is a single canonical TIFF reader, it might make more sense to base it off TIFFFile as a more established library than async_tiff
  • I'd like to be really comprehensive in testing, which might be discouraged considering the VirtualiZarr's primary objective is not readers. As one example, I'm currently downloading and using all of GDAL's test TIFF files (thanks to Michael for the pointer to these)
  • Developing the async_tiff reader separately would speed up tests all around
  • VirtualiZarr is approaching a stable API while also getting a lot more engagement, so it might be time in general to consider splitting off readers to achieve a more sustainable maintenance model

@kylebarron
Copy link
Contributor

I agree that there's scope for multiple TIFF readers. tifffile has been around longer, is in much wider use, and seems to be very stable, so it's probably the better default to use. But judging from other obstore performance improvements (zarr-developers/zarr-python#1661 (comment)) I think there's still potential in an opt-in async-tiff backend (they each use the same Rust IO code under the hood).

Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables?

The current API of async-tiff is quite minimal and mostly consists of parsing and then exposing raw IFD metadata to the user.

I would say that TIFF-like formats that are not actually TIFF are not in scope. It is able to parse the JPEG tables from the IFD metadata though.

Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

I think this question is less about parsing TIFF and more about how to represent the parsed metadata in virtual references? In that case those questions go back to @maxrjones and @TomNicholas, as I haven't been following that side of the work stream. I've only been working on the low-level format parsing.

@TomNicholas
Copy link
Member

I'm currently debating whether it would even be better to develop the async_tiff reader separately from the VirtualiZarr repo

I would be fine with that. But we should have some default TIFF reader somewhere because it's such a widely-used format.

Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

I think this question is less about parsing TIFF and more about how to represent the parsed metadata in virtual references?

Yes, but mostly it's actually a zarr issue rather than a virtualizarr issue (requiring new codecs or data types for example). It's easy to represent any codec or data type in virtualizarr, iff it is actually supported by zarr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants