-
-
Notifications
You must be signed in to change notification settings - Fork 329
incremental writing to a sharded array slower than without sharding. #3014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi @iampritishpatil, thanks for this report. First of all, the easiest way to create an array with sharding is to use As for the questions about sharding performance, I will refer to @normanrz for the official answer but I suspect that our sharding code is not currently optimized for the case where the chunks are not compressed. This could lead to the poor performance you observed. |
The documentation makes it sounds like a shard is written all at once, rather than the chunks being concurrently written. Is this the case?
https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding |
Yes, that is the case. Shards are written when all inner chunks are encoded whereas non-sharded chunks are written as soon as they are encoded. |
Uh oh!
There was an error while loading. Please reload this page.
Zarr version
v3.0.7
Numcodecs version
Python Version
3.11.12
Operating System
Windows
Installation
uv add zarr
Description
Hi Zarr team,
I’m trying to convert a large .npy file to Zarr v3 format and enable sharding, but I can’t figure out how to correctly set up the ShardingCodec in my script. I’ve attached the full script below. It creates a large random .npy file, then attempts to load it and write it to a Zarr store with chunking and (ideally) sharding enabled.
I’m using Zarr v3, and I saw that sharding is supported via ShardingCodec, but I can’t tell where and how to specify the codec during array creation. I tried importing it and defining a codec instance, but I’m not sure how or where to actually apply it.
Could you advise how to modify this script to properly use sharding? Thanks in advance!
Here's some explanation of what I've tried.
I mustn't try to load the whole array into memory, as the use case will be huge. However zarr.open_array doesn't let me specify sharding.
If I use zarr.create_array, it is extremely slow with shards but fast without shards.
update:
I looked into this a bit more, and I think the main issue I notice is that the shared array gets written to much slower than when not using sharding. i belive it should exactly be the opposite.
See attached script
Steps to reproduce
Additional output
Either sharding being slow is the bug, or enabling sharding in create array. Not sure what is the correct thing.
Here's my pyproject.toml
The text was updated successfully, but these errors were encountered: