-
-
Notifications
You must be signed in to change notification settings - Fork 329
Extreme Slowness/Timeouts with Large Dataset (Chunked, Multi-Band Zarr Store) #3085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @mickyals two follow up questions
|
This code does not work for me with Zarr Python 3.0.8 for several reasons. What version of Zarr are you using? But most critically, this code for i in range(0, shape[0], chunks[0]):
band[i:i+chunks[0]] = np.random.randint(0, 65535, (chunks[0],) + chunks[1:], dtype="uint16") seems impossible. You're trying to assign to a region of shape (24, 8133, 8130) with an array of shape (24,512,512). I would also not both using Zstd together with PCodec. Just PCodec is enough. Zstd does not buy you any additional compression. |
Honestly sorry about that, the full pipeline is very long so I tried to summarised the 400 lines to a simpler method with deep seek to make it simpler for others to understand as i'm not great at articulation of issues. Here is the version from my own code. some variables are not shown at initialization but are mentioned in the code snippet but their absence should not corrupt the overall logical flow of reading the code
I suspect the issue stems from the size of the dataset and chunking selection. Opting for 24 in the time dimension since higher values resulted in less efficient compression and 512x512 for lat and lon since our experiments showed spatial dimensions impacted compression the least. The initial focus was simply not ballooning the size of the GOES NetCDFs when regridding to lat/lon grids from geo projection. I'd like to make this data publicly available to access given the current GOES data comes as discrete nc files but this is hindered by the slowness of processes on the data.
Also the compression combinations used proved to be significant reducers to overall storage footprint, without PCodec the size balloons and the same though to a lesser extent with zstd. This specific combination was the only one I found that kept the final dataset size at a comparable level to the orginal combined nc file size Again really sorry if I confused things more, working mostly on my own here for this project and been working with zarr for less than 3 months. Let me know if more detail is needed @ianhi @rabernat |
@mickyals It's quite difficult to help if we aren't able to run the code. That means that in the example, every variable the code uses needs to be defined. Please also check that it runs on its own and reproduces the error. A great way to do this is to define all the script's dependencies according to pep-723 (https://peps.python.org/pep-0723/#example). See also: https://stackoverflow.com/help/minimal-reproducible-example In your smaller code snippet I can open the zarr (which is very helpful for debugging!) but the Please make sure that the script runs on it's own without other file, does so from a fresh python environment, and demonstrates the slowness. Then we can help narrow down the issue. |
Thank you so much for that. Below is the code to generate a dummy version of another dataset I encountered a relatively drastic change between executions times tiff vs zarr in this case. The original dataset, worldclim 2.1 comes as discrete geotiffs.
The above code converts discrete files fine and is not the issue here. The issue arises for example when attempting to plot variables from the newly created zarr store.
The question really is to understand whether there is anything from the zarr end of my pipeline that I can do to improve the speed of execution of downstream operations like rendering the plots which takes the longest but there is still pretty slow execution after attempting precomputation in an attempt to speed up other tasks
From the goes dataset which can be recreated from the above code by changing |
Uh oh!
There was an error while loading. Please reload this page.
Zarr version
v3.0.6
Environment
OS: Linux Ubuntu
Python: 3.11.7
Dependencies: numcodecs[pcodec]==0.15.1, xarray==2025.3.1, dask==2024.2.0
Description
A Zarr store containing GOES-16 satellite data (dimensions: t:8741, lat:8133, lon:8130, chunks (24, 512, 512)) exhibits severe performance issues:
The final dataset of all 2023 GOES imagery is about 11TB with 6 more years of data left to be processed.
Steps to Reproduce
The minimal code logic for reproduction is below
Questions for Zarr Devs
Additional Context
Full Dataset creation code: https://github.com/mickyals/goes2zarr/blob/main/convert_goes_to_zarr.py
The text was updated successfully, but these errors were encountered: