-
Notifications
You must be signed in to change notification settings - Fork 40
Tiff test error #526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm working on a rewrite of our TIFF virtualization over in #524, since this has been broken for a long while. Are you able to share what file you're trying to virtualize so that I can test it works? |
@RichardScottOZ the TIFF reader currently in the library is broken and not intended to be used (#291). Sorry about that - I thought we had a As @maxrjones says we hopefully will have a shiny new TIFF reader available soon! FYI even for a reader that is working we would need a much more reproducible example to help you. |
To expand a bit on the status update after reading the context in OSGeo/gdal#11824, #524 is not yet ready for others to try. A list of GeoTIFFs/COGs that you'd like to make sure are supported would really speed up development, since it'll expand the list of compression schemes, etc. that are implemented. I will aim to get an MVP finished this week for experimentation. I'm confident that using |
@maxrjones do you foresee any scenario that your |
I can make a more complete thing later today with a bit of luck |
I can do something very similar at least Max. |
import os
import configparser
import contextlib
import xarray as xr
from virtualizarr import open_virtual_dataset
def get_aws_credentials():
parser = configparser.RawConfigParser()
parser.read(os.path.expanduser('~/.aws/credentials'))
credentials = parser.items('default')
all_credentials = {key.upper(): value for key, value in [*credentials]}
with contextlib.suppress(KeyError):
all_credentials["AWS_REGION"] = all_credentials.pop("REGION")
return all_credentials
creds = get_aws_credentials()
aws_credentials = {"key": creds['AWS_ACCESS_KEY_ID'], "secret": creds['AWS_SECRET_ACCESS_KEY']}
vds = open_virtual_dataset("s3://banana/test_reference.tif", reader_options={'storage_options': aws_credentials})
print(vds.mean())
#error
python test_vz.py
/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
warnings.warn(
Traceback (most recent call last):
File "/home/ubuntu/data/model-framework/test_vz.py", line 20, in <module>
vds = open_virtual_dataset("s3://banana/test_reference.tif", reader_options={'storage_options': aws_credentials})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/backend.py", line 200, in open_virtual_dataset
vds = backend_cls.open_virtual_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py", line 55, in open_virtual_dataset
refs = extract_group(refs, group)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/translators/kerchunk.py", line 55, in extract_group
raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of [] |
thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., https://github.com/maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested |
Would love to join! |
here's a few COGs that I would expect to work, I thought I was just doing something wrong (but also didn't pursue very deeply) https://projects.pawsey.org.au/idea-gebco-tif/GEBCO_2024.tif (4Gb) https://github.com/mdsumner/rema-ovr/raw/refs/heads/main/rema_mosaic_1km_v2.0_filled_cop30_dem.tif (60Mb) https://github.com/mdsumner/ibcso-cog/raw/main/IBCSO_v2_ice-surface_cog.tif (220Mb) I would also add the autotest suite from GDAL to really get a comprehensive try at file vagaries I've listed all current "\.tif$" files here (prepend 'https://raw.githubusercontent.com/OSGeo/gdal/refs/heads/master/autotest' to get full path) |
I'll dig up other large files that aren't COGs but famous ones are GEBCO and IBCSO: https://www.gebco.net/data_and_products/gridded_bathymetry_data/ (the tiff link is a directory that is the file in a zip, so it's weird to have that as a canon url and while you can stream it it doesn't work in some places) https://doi.pangaea.de/10.1594/PANGAEA.937574?format=html#download (https://download.pangaea.de/dataset/937574/files/IBCSO_v2_ice-surface_WGS84.tif) |
Next week Australian Central Time - GMT +9.5 - half an hour behind Michael basically. |
We don't have much overlap in waking hours, but I sent an invite to y'all for Monday evening GMT-4 Tuesday / Tuesday morning GMT+9.5. No worries if it doesn't work out, the examples shared here will give me plenty to work from. |
That should be ok I think Max. |
To jump in on this - here's a set of GeoTIFFs I'm working on virtualizing: SENTINEL1 Sigma Nought (SIG0) Backscatter in 20 meter resolution. I've been following a manual approach with Note the nonstandard blocking of 5x15000. I've heard tell from coworkers that there may be some images in there with different blocking but I haven't run any checks to confirm that yet. I do wonder if this would cause some issues when using the proposed TIF reader if true. I admire all the hard work being done here to enable big geodata workflows, huge thanks to everyone involved :) |
This sounds like a scenario where you're better off rechunking your data to align more with your access pattern (i.e. contiguous in time; chunked in space) and then storing it as native (not virtual Zarr). |
I absolutely agree, and this is where we're slowly headed, but this is a PB-scale dataset that plenty of operational workflows depend on, so it's going to take a while before that's done unfortunately. Virtualizing it would be a really nice in-between step, and it feels so close if only we could represent those tiff chunks. |
@claytharrison that makes sense, and is a pattern I expect to see a lot in the future. First virtualize the data with the original chunks for cloud-optimized but still-suboptimal performance, then later create additional rechunked copies optimized for expected query patterns. |
Tifffile provides access to intra-file chunks. Did you try to create reference files for the individual files (for example via tiff2fsspec) and merge them (assuming image sizes, chunk sizes, data types, compressions, etc match).
Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables? Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips). |
FYI we can (and it sounds like we should) have multiple TIFF readers for virtualizarr. After #498 implementing one of these readers will be as simple as writing one function like |
Thanks for joining the discussion @cgohlke and for all your work on TIFFFile and imagecodecs!
Probably not given it's relatively new and motivated primarily by geospatial use-cases, though @kylebarron would know more as the author of
This is really helpful context, thanks for sharing. I don't expect any of this to be easier with |
+1. I'm currently debating whether it would even be better to develop the
|
I agree that there's scope for multiple TIFF readers. tifffile has been around longer, is in much wider use, and seems to be very stable, so it's probably the better default to use. But judging from other obstore performance improvements (zarr-developers/zarr-python#1661 (comment)) I think there's still potential in an opt-in async-tiff backend (they each use the same Rust IO code under the hood).
The current API of async-tiff is quite minimal and mostly consists of parsing and then exposing raw IFD metadata to the user. I would say that TIFF-like formats that are not actually TIFF are not in scope. It is able to parse the JPEG tables from the IFD metadata though.
I think this question is less about parsing TIFF and more about how to represent the parsed metadata in virtual references? In that case those questions go back to @maxrjones and @TomNicholas, as I haven't been following that side of the work stream. I've only been working on the low-level format parsing. |
I would be fine with that. But we should have some default TIFF reader somewhere because it's such a widely-used format.
Yes, but mostly it's actually a zarr issue rather than a virtualizarr issue (requiring new codecs or data types for example). It's easy to represent any codec or data type in virtualizarr, iff it is actually supported by zarr. |
Installed vz half an hour ago, python 3.11.
The text was updated successfully, but these errors were encountered: