incremental writing to a sharded array slower than without sharding. #3014

iampritishpatil · 2025-04-24T14:39:28Z

Zarr version

v3.0.7

Numcodecs version

Python Version

3.11.12

Operating System

Windows

Installation

uv add zarr

Description

Hi Zarr team,

I’m trying to convert a large .npy file to Zarr v3 format and enable sharding, but I can’t figure out how to correctly set up the ShardingCodec in my script. I’ve attached the full script below. It creates a large random .npy file, then attempts to load it and write it to a Zarr store with chunking and (ideally) sharding enabled.

I’m using Zarr v3, and I saw that sharding is supported via ShardingCodec, but I can’t tell where and how to specify the codec during array creation. I tried importing it and defining a codec instance, but I’m not sure how or where to actually apply it.

Could you advise how to modify this script to properly use sharding? Thanks in advance!

Here's some explanation of what I've tried.
I mustn't try to load the whole array into memory, as the use case will be huge. However zarr.open_array doesn't let me specify sharding.
If I use zarr.create_array, it is extremely slow with shards but fast without shards.

update:

I looked into this a bit more, and I think the main issue I notice is that the shared array gets written to much slower than when not using sharding. i belive it should exactly be the opposite.

See attached script

Steps to reproduce

#%%
from pathlib import Path
import numpy as np
import zarr

# Path to folder containing .npy file
folder = Path(r'data')
folder.mkdir(parents=True, exist_ok=True)  # Create folder if it doesn't exist
assert folder.exists(), "Folder does not exist"
filename = folder / 'test.npy'
# create a big random npy file.
shape = (2**12, 2**9, 2**9)
#%%
fp = np.lib.format.open_memmap(
    filename,
    mode='w+',
    dtype=np.int16,
    shape=shape
)

for i in range(shape[0]):
    fp[i,:,:] = (np.random.random(shape[1:])*2**14).astype(np.int16)
    print(i)
#%%
fp.flush()


#%%
#restart now for the next part.
#%%
from pathlib import Path
import numpy as np
import zarr

# Path to folder containing .npy file
folder = Path(r'data')
folder.mkdir(parents=True, exist_ok=True)  # Create folder if it doesn't exist
assert folder.exists(), "Folder does not exist"
filename = folder / 'test.npy'

#%%
fp=np.load(filename)
# npy_file = next(folder.glob('*.npy'))
# print(f"Using file: {npy_file}")

# Load using memory mapping
# data = np.load(npy_file, mmap_mode='r')
print(f"Data shape: {fp.shape}, dtype: {fp.dtype}")

# Define chunk size
chunk_size = 128

# Create Zarr array

# Create Zarr array with sharding
from zarr.codecs import ShardingCodec

chunk_size = 128
shard_depth = chunk_size * 8

# Define the sharding codec
from zarr.codecs import ShardingCodec, ZstdCodec

chunk_size = 128
shard_depth = chunk_size * 4


z = zarr.open_array(
    store=folder / 'test.zarr',
    mode='w',
    shape=fp.shape,
    chunks=(chunk_size, chunk_size, chunk_size),
    dtype=fp.dtype,
    zarr_version=3,
)

# z = zarr.create_array(
#     store=folder / 'test.zarr',
#     # mode='w',
#     shape=fp.shape,
#     chunks=(chunk_size, chunk_size, chunk_size),
#     shards=(shard_depth, shard_depth, shard_depth),
#     dtype=fp.dtype,
#     overwrite=True,
# )

# Write data in chunks
for i in range(0, fp.shape[0], shard_depth):
    z[i:i+shard_depth] = fp[i:i+shard_depth]
    print(f"Wrote slice {i} to {i+shard_depth}")
#last chunk
z[i:] = fp[i:]
print("Zarr conversion complete.")
# %%

Additional output

Either sharding being slow is the bug, or enabling sharding in create array. Not sure what is the correct thing.

Here's my pyproject.toml

[project]
name = "some-name"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "ipykernel>=6.29.5",
    "matplotlib>=3.10.1",
    "numpy>=2.2.5",
    "pydantic>=2.11.3",
    "scipy>=1.15.2",
    "zarr>=2.18.3",
]

The text was updated successfully, but these errors were encountered:

d-v-b · 2025-04-24T16:00:15Z

hi @iampritishpatil, thanks for this report. First of all, the easiest way to create an array with sharding is to use create_array, which I see you tried. open_array uses a slightly different API which is less convenient for sharding (because it was designed for the previous version of the zarr format).

As for the questions about sharding performance, I will refer to @normanrz for the official answer but I suspect that our sharding code is not currently optimized for the case where the chunks are not compressed. This could lead to the poor performance you observed.

rbavery · 2025-05-07T21:30:50Z

The documentation makes it sounds like a shard is written all at once, rather than the chunks being concurrently written. Is this the case?

With sharding, you could use a shard size of 1 GB. This would result in 1000 chunks per file and 100 files in total, which seems manageable for most storage systems. You would still be able to read each 1 MB chunk independently, but you would need to write your data in 1 GB increments.

https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding

normanrz · 2025-05-13T11:27:16Z

The documentation makes it sounds like a shard is written all at once, rather than the chunks being concurrently written. Is this the case?

With sharding, you could use a shard size of 1 GB. This would result in 1000 chunks per file and 100 files in total, which seems manageable for most storage systems. You would still be able to read each 1 MB chunk independently, but you would need to write your data in 1 GB increments.

https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding

Yes, that is the case. Shards are written when all inner chunks are encoded whereas non-sharded chunks are written as soon as they are encoded.

iampritishpatil added the bug Potential issues with the zarr-python library label Apr 24, 2025

iampritishpatil changed the title ~~Unable to shard local storage array without compression.~~ incremental writing to a sharded array slower than without sharding. Apr 24, 2025

This was referenced May 1, 2025

Monthly issue metrics report #3030

Open

Monthly issue metrics report sanketverma1704/zarr-python#10

Open

dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

incremental writing to a sharded array slower than without sharding. #3014

incremental writing to a sharded array slower than without sharding. #3014

iampritishpatil commented Apr 24, 2025 •

edited

Loading

d-v-b commented Apr 24, 2025

Uh oh!

rbavery commented May 7, 2025 •

edited

Loading

Uh oh!

normanrz commented May 13, 2025

Uh oh!

Uh oh!

incremental writing to a sharded array slower than without sharding. #3014

incremental writing to a sharded array slower than without sharding. #3014

Comments

iampritishpatil commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

update:

Steps to reproduce

Additional output

d-v-b commented Apr 24, 2025

Uh oh!

rbavery commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanrz commented May 13, 2025

Uh oh!

iampritishpatil commented Apr 24, 2025 •

edited

Loading

rbavery commented May 7, 2025 •

edited

Loading