refactor v3 data types #2874

d-v-b · 2025-02-28T11:43:49Z

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

…into feat/fixed-length-strings

d-v-b · 2025-02-28T13:23:18Z

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

src/zarr/core/metadata/dtype.py

d-v-b · 2025-03-02T20:15:13Z

src/zarr/core/metadata/dtype.py

+    name: ClassVar[str]
+    dtype_cls: ClassVar[type[TDType]]  # this class will create a numpy dtype
+    kind: ClassVar[DataTypeFlavor]
+    default_value: TScalar


child classes define a string name (which feeds into the zarr metadata), a dtype class dtype_cls (which gets assigned automatically from the generic type parameter) , a string kind (we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)

d-v-b · 2025-03-02T20:37:12Z

src/zarr/core/metadata/dtype.py

+class IntWrapperBase(DTypeWrapper[TDType, TScalar]):
+    kind = "numeric"
+
+    @classmethod
+    def from_dtype(cls, dtype: TDType) -> Self:
+        return cls()
+
+    def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int:
+        return int(data)
+
+    def from_json_value(
+        self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None
+    ) -> TScalar:
+        if check_json_int(data):
+            return self.to_dtype(endianness=endianness).type(data)
+        raise TypeError(f"Invalid type: {data}. Expected an integer.")


I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.

src/zarr/codecs/sharding.py

…base

nenb · 2025-03-03T13:56:00Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

d-v-b · 2025-03-04T11:27:15Z

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

…at/fixed-length-strings

d-v-b · 2025-05-17T06:46:07Z

Have not reviewed everything but commenting early to get this part out:

I think that the float to/from json infrastructure for fill values is not complete. As it stands if xarray were to transition to using from_json_value on the type to read fill values on loading it will not be able to read zarr files it wrote with prior versions of zarr python as it would encode the float fill values.

Thanks for checking this! Can you give me instructions for replicating the failing xarray tests?

ianhi · 2025-05-17T19:52:00Z

Install this branch of xarray: https://github.com/ianhi/xarray/tree/zarr-dtype-tests where I've swapped the reading of fill values to use the from_json_values with this commit: ianhi/xarray@70a7e8b

I didn't swap the writing of the values yet to simulate data written with xarray in the past. then run this test:

pytest "xarray/tests/test_backends.py::TestZarrRegionAuto::test_zarr_region[3]"

which should give:

self = <xarray.tests.test_backends.TestZarrRegionAuto object at 0x10adbb110>

    def test_zarr_region(self):
        with self.create() as (target, ds):
            ds_transposed = ds.transpose("y", "x")
            ds_region = 1 + ds_transposed.isel(x=[0], y=[0])
>           self.save(target, ds_region, region={"x": slice(0, 1), "y": slice(0, 1)})

/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:6428:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:6353: in save
    ds.to_zarr(target, **kwargs)
/Users/ian/Documents/dev/xarray/xarray/core/dataset.py:2278: in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
/Users/ian/Documents/dev/xarray/xarray/backends/api.py:2239: in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
/Users/ian/Documents/dev/xarray/xarray/backends/api.py:1975: in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:1027: in store
    k: self.open_store_variable(name=k) for k in existing_variable_names
/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:892: in open_store_variable
    zarr_array.metadata.data_type.from_json_value(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Float64(endianness='little'), data = 'AAAAAAAA+H8='

    def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> TFloatScalar_co:
        """
        Read a JSON-serializable value as a numpy float.

        Parameters
        ----------
        data : JSON
            The JSON-serializable value.
        zarr_format : ZarrFormat
            The zarr format version.

        Returns
        -------
        TScalar_co
            The numpy float.
        """
        if check_json_float(data, zarr_format=zarr_format):
            return self._cast_value_unsafe(float_from_json(data, zarr_format=zarr_format))
>       raise TypeError(
            f"Invalid type: {data}. Expected a float or a special string encoding of a float."
        )
E       TypeError: Invalid type: AAAAAAAA+H8=. Expected a float or a special string encoding of a float.

ianhi · 2025-05-17T21:29:50Z

I didn't get that quite right. What xarray is doing is storing its own attribute "_FillValue" separately from the zarr fill value and is storing it as a base64 encoded string https://github.com/pydata/xarray/blob/7551a7a4fcad82a27b3701cfa0af85f74404a45c/xarray/backends/zarr.py#L135

So ignore that test of xarray for the sake of this PR. Though it was helpful in leading me to that part of the spec. The complete spec still requires being able to specify fill values as

the byte representation of the floating point number as an unsigned integer. For example, for float32, "NaN" is equivalent to "0x7fc00000". This representation is the only way to specify a NaN value other than the specific NaN value denoted by "NaN".

edit:
The endianness of encoding was discussed here: zarr-developers/zarr-specs#279

the conclusion is that using big endian as the argument to encode is the correct choice.

What's confusing to me here is that it seems like the spec does not explicitly call out the endianness of the encoded float in the metadata. Though from the example for NaN it seems to implicitly state big-endian encoding is used for the float as fill value.

import struct
import numpy as np
print(struct.pack(">f",np.nan).hex())
print(struct.pack("<f",np.nan).hex())

gives

7fc00000
0000c07f

helpful link: https://docs.python.org/3/library/struct.html#format-characters
format strings for struct:

float16: e
float32: f
float64: d

d-v-b · 2025-05-19T10:38:50Z

thanks for the detective work @ianhi, I added the v3-specific float-as-hex-string functionality in e4a0372

tests/test_dtype/test_npy/test_float.py

nenb

I took another look.

I left a bunch of nits, and some minor comments that could be addressed in follow-up PRs (if you agree that they are worth following up on).

I certainly won't be blocking on merging this though, I personally think that it's already received a lot of eyes, and has had a lot of work put into it. Thank you for all of this work! 🙏

nenb · 2025-05-19T16:11:25Z

src/zarr/registry.py

 """


 def _collect_entrypoints() -> list[Registry[Any]]:
    """
-    Collects codecs, pipelines, buffers and ndbuffers from entrypoints.
+    Collects codecs, pipelines, dtypes, buffers and ndbuffers from entrypoints.


Nit: _collect_entrypoints appears to be designed to return all Registrys. The dtype registry is not included in the returned value though. Could this be confusing? It appears that the return value of _collect_entrypoints is not used anywhere, and so another alternative could be just to return nothing.

i am at baseline confused by entrypoints. thanks for flagging this, hopefully I can sort this out

src/zarr/core/metadata/v3.py

nenb · 2025-05-19T16:54:41Z

src/zarr/codecs/bytes.py

+        # Note: this check is numpy-dtype-specific
+        # For single-byte (e.g., uint8) or 0-byte (e.g., S0) dtypes,
+        # endianness does not apply.
+        if array_spec.dtype.to_dtype().itemsize < 2:


I think to be safe, we should also have this on the dtype interface. WDYT?

Again, if you agree, this could be a follow-up PR.

i think it's a good idea to put this on the dtype interface. we will need to distinguish between data types with a variable size per element, data types with a static size per array element (i.e., numpy's itemsize), container data types with a static length (e.g., np.dtype('U10'), and container data types with a variable length. I don't think this should be too hard.

see 8c90d2c

nenb · 2025-05-19T17:52:21Z

src/zarr/core/array.py

        shape = parse_shapelike(shape)

        if chunks is not None and chunk_shape is not None:
            raise ValueError("Only one of chunk_shape or chunks can be provided.")

        if chunks:
-            _chunks = normalize_chunks(chunks, shape, dtype_parsed.itemsize)
+            _chunks = normalize_chunks(chunks, shape, dtype_parsed.to_dtype().itemsize)


Same comment as earlier on moving itemsize to dtype interface in future PR.

i went ahead and did it here 8c90d2c

nenb · 2025-05-19T17:58:41Z

src/zarr/core/array.py

-        dtype = parse_dtype(dtype, zarr_format=2)
-
-        # inject VLenUTF8 for str dtype if not already present
-        if np.issubdtype(dtype, np.str_):


Just want to confirm that it was intentional to drop this logic updating the filters for v2?

yes, this was intentional but thank you for flagging it because it still needs some polish

nenb · 2025-05-19T18:10:10Z

src/zarr/core/config.py

+# these categories are keys in a configuration dictionary.
+# it is not a part of the ZDType class because these categories are more of an implementation detail
+# of our config system rather than a useful attribute of any particular data type.
+DTypeCategory = Literal["variable-length-string", "default"]


Nit: Should this just be 'variable-length' rather than 'variable-length-string'?

variable-length-strings gets encoded with the vlen-utf8 codec, which is string-specific. I don't think we would want to use that codec for arbitrary variable-length data types.

(Think this is just user-error on my part but):

How are we dealing with variable length bytes (which were previously supported AFAIU)? These were in the old version of the code, I presume that they are still supported somehow?

src/zarr/core/dtype/npy/time.py

ianhi · 2025-05-20T20:54:52Z

src/zarr/core/dtype/registry.py

+    def get(self, key: str) -> type[ZDType[TBaseDType, TBaseScalar]]:
+        return self.contents[key]
+
+    def match_dtype(self, dtype: TBaseDType) -> ZDType[TBaseDType, TBaseScalar]:


Echoing the docstrings comment here. I was trying to register a dtype and ran into an error here and spent a few minutes confused about why it was done this way until I remembered the issue with exhaustively listing dtypes.

One question here, could it be worth it to first trying a key lookup, then resort to iterating over items?

I personally prefer key-value lookup. The big problem here is the (cursed) numpy object dtype, which potentially resolves to multiple zarr dtypes. I opened an issue to talk about this particular problem here.

Makes sense, all I'm saying is that in absence of resolution of #3077 add a comment explaining that will be very helpful to someone in the future

ianhi · 2025-05-20T21:55:31Z

src/zarr/codecs/blosc.py

@@ -136,7 +136,7 @@ def to_dict(self) -> dict[str, JSON]:
        }

    def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
-        dtype = array_spec.dtype
+        dtype = array_spec.dtype.to_dtype()


Or maybe as_native_dtype

ianhi · 2025-05-20T22:16:04Z

src/zarr/core/dtype/npy/float.py

+class Float16(BaseFloat[np.dtypes.Float16DType, np.float16]):
+    dtype_cls = np.dtypes.Float16DType


@d-v-b how would I specify this the dtype_cls for a non-native numpy dtype that has been registered with numpy. I'm thinking specifically of these: https://github.com/jax-ml/ml_dtypes?tab=readme-ov-file#ml_dtypes

For context I tried to make a very minimal implementation of bfloat16(https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/bfloat16) to better understand the mechanisms at work here. To keep it minimal I tried to inherit from BaseFloat but I couldn't figure out a version of dtype_cls that would not throw errors when the to_dtype was invoked.

import ml_dtypes import numpy as np from ml_dtypes import bfloat16 import zarr from zarr.core.dtype import data_type_registry from zarr.core.dtype.npy.float import BaseFloat @dataclass(frozen=True, kw_only=True, slots=True) class Zbfloat16(BaseFloat): """ Extremely minimal wrapper for bfloat16 """ _zarr_v3_name = "bfloat16" dtype_cls = np.dtype("bfloat16") # Directly using the ml_dtypes class here does not # avoid the error dtype_cls = bfloat16 def to_dtype(self) -> bfloat16: return np.dtype("bfloat16") import zarr store = zarr.storage.MemoryStore() # surprised that this works before registering? # registering is only to pick up the string based names? z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16) # register data_type_registry.register("bfloat16", Zbfloat16) # string based access breaks # due to something with `to_dtype` not working within `match_dtype` z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype="bfloat16")

Gives a long error message that ultimately ends with:

--> 134 return data_type_registry.match_dtype(na_dtype) File [~/Documents/dev/zarr-python/src/zarr/core/dtype/registry.py:46](http://localhost:8888/lab/tree/tmp/~/Documents/dev/zarr-python/src/zarr/core/dtype/registry.py#line=45), in DataTypeRegistry.match_dtype(self, dtype) 44 except DataTypeValidationError: 45 pass ---> 46 raise ValueError(f"No data type wrapper found that matches dtype '{dtype}'") ValueError: No data type wrapper found that matches dtype 'bfloat16'

My approach here is based on looking at the examples of the numpy dtypes and i think would be pretty representative of what someone might try if they were trying to implement their own. From that I think i'm concluding that a short example showing the bfloat16 with some comments as explanations would go a really long way towards having good documentation of how to write a custom dtype

thanks so much for working through this, and I totally agree that it would be a great part of the documentation. I am 99% sure your problems were due to this line:

dtype_cls = np.dtype("bfloat16")

dtype_cls should be bound to the class that creates the dtype. You are binding it to a concrete dtype. The whole idea of ZDType is to allow zarr-python to abstract over the process of creating variants of the same dtype class (e.g., big-endian float64 or little-endian float64). That means we can't associate a specific dtype instance with a ZDType, we have to use the dtype class instead. So in your case it should be:

dtype_cls = type(np.dtype("bfloat16"))

This is necessary because (annoyingly) the ml_dtypes package doesn't provide a stand-alone bfloat16 class.

I will double check that this works later today when I'm not on a train. And in any case, this absolutely needs to be documented better. Clearly the attribute name dtype_cls doesn't convey the idea clearly enough.

addendum:
this line was also problematic:

z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)

You are passing Zbfloat16 in as a class, but it should be an instance (Zbfloat16())

Unfortunately my branch is happy with the invalid input and somehow matched it to the variable length string dtype! so definitely an interesting bug.

here is an example that works:

# /// script # dependencies = [ # "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/fixed-length-strings", # "ml_dtypes" # ] # /// import ml_dtypes import numpy as np from ml_dtypes import bfloat16 import zarr from zarr.core.dtype import data_type_registry from zarr.core.dtype.npy.float import BaseFloat from dataclasses import dataclass @dataclass(frozen=True, kw_only=True, slots=True) class Zbfloat16(BaseFloat): """ Extremely minimal wrapper for bfloat16 """ _zarr_v3_name = "bfloat16" dtype_cls = type(np.dtype("bfloat16")) def to_dtype(self) -> bfloat16: return np.dtype("bfloat16") import zarr store = zarr.storage.MemoryStore() z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16()) z[:] = 1.0 print(z[:]) # [1 1 1 1 1] print(z[:].dtype) # bfloat16 # register data_type_registry.register("bfloat16", Zbfloat16) z2 = zarr.open_array(store=store, mode='r') print(z2[:]) # [1 1 1 1 1] print(z2[:].dtype) # bfloat16

To address this comment:

surprised that this works before registering?
registering is only to pick up the string based names?

right, we need to infer a zdtype from a string (or dict) when reading dtype metadata from json. On the other direction, we have to infer a zdtype from a python object (like a numpy dtype) that represents an in-memory dtype object. At the moment you can bypass the in-memory-object -> zdtype path by providing a zdtype instance explicitly. we could also offer a similar bypass for reading data from disk, but i haven't implemented that yet.

Thanks for the example! Very helpful

this line was also problematic:

z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)

You are passing Zbfloat16 in as a class, but it should be an instance (Zbfloat16())

Unfortunately my branch is happy with the invalid input and somehow matched it to the variable length string dtype! so definitely an interesting bug.

On wrapper the from_dtype is defined as a classmethod which is why this works

This is necessary because (annoyingly) the ml_dtypes package doesn't provide a stand-alone bfloat16 class.

Ok I thought that might be the case

idea is to allow zarr-python to abstract over the process of creating variants of the same dtype class (e.g., big-endian float64 or little-endian float64)

notably the ml_dtypes are exclusively little endian. thats why I re-wrote to_dtype to avoid the endianess stuff in the BaseFloat

I totally agree that it would be a great part of the documentation.

As a an easy step in this direction I suggest taking what you wrote in this thread and putting into the docs, maybe with a small caveat about the typing is not implemented and that this shouldn't inherit from HasEndianess

ianhi · 2025-05-20T22:19:26Z

src/zarr/core/dtype/registry.py

+
+        self.lazy_load_list.clear()
+
+    def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None:


I think it makes sense to add an unregister method as well.

…se cast_value where appropriate

ilan-gold · 2025-05-22T12:14:08Z

@d-v-b I have a little bug report here on this PR (which may be intentional):

import zarr
z = zarr.open("foo.zarr", zarr_format=2)
import numcodecs
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()]) 

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/_compat.py:61, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     51 warnings.warn(
     52     (
     53         f"Pass {formatted_args_msg} as keyword args. From version "
   (...)     58     stacklevel=2,
     59 )
     60 kwargs.update(zip(sig.parameters, args, strict=False))
---> 61 return f(**kwargs)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:2477, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   2382 """Create an array within this group.
   2383 
   2384 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   2471 AsyncArray
   2472 """
   2473 compressors = _parse_deprecated_compressor(
   2474     compressor, compressors, zarr_format=self.metadata.zarr_format
   2475 )
   2476 return Array(
-> 2477     self._sync(
   2478         self._async_group.create_array(
   2479             name=name,
   2480             shape=shape,
   2481             dtype=dtype,
   2482             chunks=chunks,
   2483             shards=shards,
   2484             fill_value=fill_value,
   2485             attributes=attributes,
   2486             chunk_key_encoding=chunk_key_encoding,
   2487             compressors=compressors,
   2488             serializer=serializer,
   2489             dimension_names=dimension_names,
   2490             order=order,
   2491             filters=filters,
   2492             overwrite=overwrite,
   2493             storage_options=storage_options,
   2494             config=config,
   2495         )
   2496     )
   2497 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
    205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    206     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    207     # this should allow us to better type the sync wrapper
--> 208     return sync(
    209         coroutine,
    210         timeout=config.get("async.timeout"),
    211     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1102, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   1007 """Create an array within this group.
   1008 
   1009 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   1097 
   1098 """
   1099 compressors = _parse_deprecated_compressor(
   1100     compressor, compressors, zarr_format=self.metadata.zarr_format
   1101 )
-> 1102 return await create_array(
   1103     store=self.store_path,
   1104     name=name,
   1105     shape=shape,
   1106     dtype=dtype,
   1107     chunks=chunks,
   1108     shards=shards,
   1109     filters=filters,
   1110     compressors=compressors,
   1111     serializer=serializer,
   1112     fill_value=fill_value,
   1113     order=order,
   1114     zarr_format=self.metadata.zarr_format,
   1115     attributes=attributes,
   1116     chunk_key_encoding=chunk_key_encoding,
   1117     dimension_names=dimension_names,
   1118     storage_options=storage_options,
   1119     overwrite=overwrite,
   1120     config=config,
   1121 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4466, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
   4461 mode: Literal["a"] = "a"
   4463 store_path = await make_store_path(
   4464     store, path=name, mode=mode, storage_options=storage_options
   4465 )
-> 4466 return await init_array(
   4467     store_path=store_path,
   4468     shape=shape_parsed,
   4469     dtype=dtype_parsed,
   4470     chunks=chunks,
   4471     shards=shards,
   4472     filters=filters,
   4473     compressors=compressors,
   4474     serializer=serializer,
   4475     fill_value=fill_value,
   4476     order=order,
   4477     zarr_format=zarr_format,
   4478     attributes=attributes,
   4479     chunk_key_encoding=chunk_key_encoding,
   4480     dimension_names=dimension_names,
   4481     overwrite=overwrite,
   4482     config=config,
   4483 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4240, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
   4237     else:
   4238         order_parsed = order
-> 4240     meta = AsyncArray._create_metadata_v2(
   4241         shape=shape_parsed,
   4242         dtype=dtype_wrapped,
   4243         chunks=chunk_shape_parsed,
   4244         dimension_separator=chunk_key_encoding_parsed.separator,
   4245         fill_value=fill_value,
   4246         order=order_parsed,
   4247         filters=filters_parsed,
   4248         compressor=compressor_parsed,
   4249         attributes=attributes,
   4250     )
   4251 else:
   4252     array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
   4253         compressors=compressors,
   4254         filters=filters,
   4255         serializer=serializer,
   4256         dtype=dtype_wrapped,
   4257     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:783, in AsyncArray._create_metadata_v2(shape, dtype, chunks, order, dimension_separator, fill_value, filters, compressor, attributes)
    781 if fill_value is None:
    782     fill_value = dtype.default_value()  # type: ignore[assignment]
--> 783 return ArrayV2Metadata(
    784     shape=shape,
    785     dtype=dtype,
    786     chunks=chunks,
    787     order=order,
    788     dimension_separator=dimension_separator,
    789     fill_value=fill_value,
    790     compressor=compressor,
    791     filters=filters,
    792     attributes=attributes,
    793 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:89, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
     87 fill_value_parsed: TBaseScalar | None
     88 if fill_value is not None:
---> 89     fill_value_parsed = dtype.cast_value(fill_value)
     90 else:
     91     fill_value_parsed = fill_value

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/wrapper.py:168, in ZDType.cast_value(self, data)
    161     return self._cast_value_unsafe(data)
    162 msg = (
    163     f"The value {data} failed a type check. "
    164     f"It cannot be safely cast to a scalar compatible with {self.dtype_cls}. "
    165     f"Consult the documentation for {self} to determine the possible values that can "
    166     "be cast to scalars of the wrapped data type."
    167 )
--> 168 raise TypeError(msg)

TypeError: The value 0 failed a type check. It cannot be safely cast to a scalar compatible with <class 'numpy.dtypes.StringDType'>. Consult the documentation for VariableLengthString() to determine the possible values that can be cast to scalars of the wrapped data type.

~~I believe this previously worked via a 0-->"" conversion, but I need to check. It's possible~~ It seems 0 was a valid on-disk fill value before for v2 object strings, but I need to dig back through the issues here to check this:

My .zarray from zarr 3.0.8:

{
  "shape": [
    10
  ],
  "chunks": [
    10
  ],
  "fill_value": 0,
  "order": "C",
  "filters": [
    {
      "id": "vlen-utf8"
    }
  ],
  "dimension_separator": ".",
  "compressor": {
    "id": "blosc",
    "cname": "lz4",
    "clevel": 5,
    "shuffle": 1,
    "blocksize": 0,
    "typesize": null
  },
  "zarr_format": 2,
  "dtype": "|O"
}

d-v-b · 2025-05-22T12:16:48Z

@d-v-b I have a little bug report here on this PR:

import zarr
z = zarr.open("foo.zarr", zarr_format=2)
import numcodecs
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()]) 

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/_compat.py:61, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     51 warnings.warn(
     52     (
     53         f"Pass {formatted_args_msg} as keyword args. From version "
   (...)     58     stacklevel=2,
     59 )
     60 kwargs.update(zip(sig.parameters, args, strict=False))
---> 61 return f(**kwargs)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:2477, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   2382 """Create an array within this group.
   2383 
   2384 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   2471 AsyncArray
   2472 """
   2473 compressors = _parse_deprecated_compressor(
   2474     compressor, compressors, zarr_format=self.metadata.zarr_format
   2475 )
   2476 return Array(
-> 2477     self._sync(
   2478         self._async_group.create_array(
   2479             name=name,
   2480             shape=shape,
   2481             dtype=dtype,
   2482             chunks=chunks,
   2483             shards=shards,
   2484             fill_value=fill_value,
   2485             attributes=attributes,
   2486             chunk_key_encoding=chunk_key_encoding,
   2487             compressors=compressors,
   2488             serializer=serializer,
   2489             dimension_names=dimension_names,
   2490             order=order,
   2491             filters=filters,
   2492             overwrite=overwrite,
   2493             storage_options=storage_options,
   2494             config=config,
   2495         )
   2496     )
   2497 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
    205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    206     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    207     # this should allow us to better type the sync wrapper
--> 208     return sync(
    209         coroutine,
    210         timeout=config.get("async.timeout"),
    211     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1102, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   1007 """Create an array within this group.
   1008 
   1009 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   1097 
   1098 """
   1099 compressors = _parse_deprecated_compressor(
   1100     compressor, compressors, zarr_format=self.metadata.zarr_format
   1101 )
-> 1102 return await create_array(
   1103     store=self.store_path,
   1104     name=name,
   1105     shape=shape,
   1106     dtype=dtype,
   1107     chunks=chunks,
   1108     shards=shards,
   1109     filters=filters,
   1110     compressors=compressors,
   1111     serializer=serializer,
   1112     fill_value=fill_value,
   1113     order=order,
   1114     zarr_format=self.metadata.zarr_format,
   1115     attributes=attributes,
   1116     chunk_key_encoding=chunk_key_encoding,
   1117     dimension_names=dimension_names,
   1118     storage_options=storage_options,
   1119     overwrite=overwrite,
   1120     config=config,
   1121 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4466, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
   4461 mode: Literal["a"] = "a"
   4463 store_path = await make_store_path(
   4464     store, path=name, mode=mode, storage_options=storage_options
   4465 )
-> 4466 return await init_array(
   4467     store_path=store_path,
   4468     shape=shape_parsed,
   4469     dtype=dtype_parsed,
   4470     chunks=chunks,
   4471     shards=shards,
   4472     filters=filters,
   4473     compressors=compressors,
   4474     serializer=serializer,
   4475     fill_value=fill_value,
   4476     order=order,
   4477     zarr_format=zarr_format,
   4478     attributes=attributes,
   4479     chunk_key_encoding=chunk_key_encoding,
   4480     dimension_names=dimension_names,
   4481     overwrite=overwrite,
   4482     config=config,
   4483 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4240, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
   4237     else:
   4238         order_parsed = order
-> 4240     meta = AsyncArray._create_metadata_v2(
   4241         shape=shape_parsed,
   4242         dtype=dtype_wrapped,
   4243         chunks=chunk_shape_parsed,
   4244         dimension_separator=chunk_key_encoding_parsed.separator,
   4245         fill_value=fill_value,
   4246         order=order_parsed,
   4247         filters=filters_parsed,
   4248         compressor=compressor_parsed,
   4249         attributes=attributes,
   4250     )
   4251 else:
   4252     array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
   4253         compressors=compressors,
   4254         filters=filters,
   4255         serializer=serializer,
   4256         dtype=dtype_wrapped,
   4257     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:783, in AsyncArray._create_metadata_v2(shape, dtype, chunks, order, dimension_separator, fill_value, filters, compressor, attributes)
    781 if fill_value is None:
    782     fill_value = dtype.default_value()  # type: ignore[assignment]
--> 783 return ArrayV2Metadata(
    784     shape=shape,
    785     dtype=dtype,
    786     chunks=chunks,
    787     order=order,
    788     dimension_separator=dimension_separator,
    789     fill_value=fill_value,
    790     compressor=compressor,
    791     filters=filters,
    792     attributes=attributes,
    793 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:89, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
     87 fill_value_parsed: TBaseScalar | None
     88 if fill_value is not None:
---> 89     fill_value_parsed = dtype.cast_value(fill_value)
     90 else:
     91     fill_value_parsed = fill_value

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/wrapper.py:168, in ZDType.cast_value(self, data)
    161     return self._cast_value_unsafe(data)
    162 msg = (
    163     f"The value {data} failed a type check. "
    164     f"It cannot be safely cast to a scalar compatible with {self.dtype_cls}. "
    165     f"Consult the documentation for {self} to determine the possible values that can "
    166     "be cast to scalars of the wrapped data type."
    167 )
--> 168 raise TypeError(msg)

TypeError: The value 0 failed a type check. It cannot be safely cast to a scalar compatible with <class 'numpy.dtypes.StringDType'>. Consult the documentation for VariableLengthString() to determine the possible values that can be cast to scalars of the wrapped data type.

I believe this previously worked via a 0-->"" conversion.

Thanks for this report! After this PR, I don't think we want to silently convert 0 to the empty string. IMO users who want to use a string data type should specify a string fill value.

d-v-b · 2025-05-22T12:18:17Z

this requires changing the default fill value from 0 to None, which will instruct zarr-python to use the per-dtype default fill value.

…at/fixed-length-strings

ilan-gold · 2025-05-22T12:55:03Z

@d-v-b I am seeing test failures with some old data we have written out that uses the 0-as-string fill value.

    def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> str:
        if not check_json_str(data):
>           raise TypeError(f"Invalid type: {data}. Expected a string.")
E           TypeError: Invalid type: 0. Expected a string.
E           Error raised while reading key 'var' of <class 'zarr.core.group.Group'> from /raw

I will push my branch and show you the CI so you have a bit more to look at. I think this is related to #2792 (comment)

d-v-b · 2025-05-22T12:56:54Z

@d-v-b I am seeing test failures with some old data we have written out that uses the 0-as-string fill value.
    def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> str:
        if not check_json_str(data):
>           raise TypeError(f"Invalid type: {data}. Expected a string.")
E           TypeError: Invalid type: 0. Expected a string.
E           Error raised while reading key 'var' of <class 'zarr.core.group.Group'> from /raw
I will push my branch and show you the CI so you have a bit more to look at. I think this is related to #2792 (comment)

🙏 thank you for checking this! its super helpful to me.

…ct dtype (i.e., refuse to match it)

…o JSON

d-v-b added 9 commits February 21, 2025 13:43

modernize typing

f5e3f78

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

b4e71e2

…into feat/fixed-length-strings

lint

3c50f54

new dtypes

d74e7a4

rename base dtype, change type to kind

5000dcb

start working on JSON serialization

9cd5c51

get json de/serialization largely working, and start making tests pass

042fac1

tweak json type guards

556e390

fix dtype sizes, adjust fill value parsing in from_dict, fix tests

b588f70

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025

d-v-b added 2 commits March 2, 2025 12:54

mid-refactor commit

4ed41c6

working form for dtype classes

1b2c773

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/core/metadata/dtype.py Outdated Show resolved Hide resolved

d-v-b commented Mar 2, 2025

View reviewed changes

src/zarr/codecs/sharding.py Outdated Show resolved Hide resolved

d-v-b added 3 commits March 2, 2025 21:55

remove unused code

24930b3

use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…

703e0e1

…base

push into v2

3c232a4

remove endianness kwarg to methods, make it an instance variable instead

b7fe986

d-v-b mentioned this pull request Mar 4, 2025

support for datetime and timedelta dtypes (#2616) #2884

Draft

6 tasks

d-v-b added 4 commits March 4, 2025 18:10

make wrapping safe by default

d9b44b4

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

bf24d69

…at/fixed-length-strings

dtype-specific tests

c1a8566

more tests, fix void type default value logic

2868994

d-v-b mentioned this pull request Mar 5, 2025

Fix fill_value serialization issues #2802

Merged

6 tasks

fix dtype mechanics in bytescodec

9ab0b1e

This was referenced May 19, 2025

[do not merge] test new dtypes in zarr-python pydata/xarray#10332

Open

compatibility with zarr dtypes refactor pydata/xarray#10333

Open

support v3 nan strings in JSON for float dtypes

e4a0372

ianhi reviewed May 19, 2025

View reviewed changes

tests/test_dtype/test_npy/test_float.py Outdated Show resolved Hide resolved

nenb reviewed May 20, 2025

View reviewed changes

ianhi reviewed May 20, 2025

View reviewed changes

d-v-b mentioned this pull request May 21, 2025

[v3] Structured dtype support #2134

Open

revert removal of metadata chunk grid attribute

8a976d6

d-v-b mentioned this pull request May 22, 2025

Make create_array signatures consistent #2819

Open

use none to denote default fill value; remove old structured tests; u…

be0d2df

…se cast_value where appropriate

d-v-b added 2 commits May 22, 2025 14:51

add item size abstraction

8c90d2c

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

0fc653f

…at/fixed-length-strings

d-v-b added 4 commits May 22, 2025 21:31

rename fixed-length string dtypes, and be strict about the numpy obje…

7c58f7a

…ct dtype (i.e., refuse to match it)

remove vestigial use of to_dtype().itemsize()

3a21845

remove another vestigial use of to_dtype().itemsize()

ce0afe3

emit warning about unstable dtype when serializing Structured dtype t…

e67d4dc

…o JSON

d-v-b mentioned this pull request May 23, 2025

⭐ support zarr 3.0? Cloud-Drift/clouddrift#552

Open

put string dtypes in the strings module

4e2a157

		class Float16(BaseFloat[np.dtypes.Float16DType, np.float16]):
		dtype_cls = np.dtypes.Float16DType


		self.lazy_load_list.clear()

		def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None:

Uh oh!

refactor v3 data types #2874

Are you sure you want to change the base?

refactor v3 data types #2874

Uh oh!

Conversation

d-v-b commented Feb 28, 2025

Uh oh!

d-v-b commented Feb 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nenb commented Mar 3, 2025

Uh oh!

d-v-b commented Mar 4, 2025

Uh oh!

d-v-b commented May 17, 2025

Uh oh!

ianhi commented May 17, 2025

Uh oh!

ianhi commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented May 19, 2025

Uh oh!

Uh oh!

nenb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ianhi commented May 17, 2025 •

edited

Loading

d-v-b May 21, 2025 •

edited

Loading

d-v-b May 21, 2025 •

edited

Loading

d-v-b May 21, 2025 •

edited

Loading

ilan-gold commented May 22, 2025 •

edited

Loading