Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 129 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 28, 2025

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

name: ClassVar[str]
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype
kind: ClassVar[DataTypeFlavor]
default_value: TScalar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child classes define a string name (which feeds into the zarr metadata), a dtype class dtype_cls (which gets assigned automatically from the generic type parameter) , a string kind (we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)

Comment on lines 268 to 283
class IntWrapperBase(DTypeWrapper[TDType, TScalar]):
kind = "numeric"

@classmethod
def from_dtype(cls, dtype: TDType) -> Self:
return cls()

def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int:
return int(data)

def from_json_value(
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None
) -> TScalar:
if check_json_int(data):
return self.to_dtype(endianness=endianness).type(data)
raise TypeError(f"Invalid type: {data}. Expected an integer.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.

@nenb
Copy link

nenb commented Mar 3, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

@d-v-b d-v-b mentioned this pull request Mar 5, 2025
6 tasks
@d-v-b
Copy link
Contributor Author

d-v-b commented May 17, 2025

Have not reviewed everything but commenting early to get this part out:

I think that the float to/from json infrastructure for fill values is not complete. As it stands if xarray were to transition to using from_json_value on the type to read fill values on loading it will not be able to read zarr files it wrote with prior versions of zarr python as it would encode the float fill values.

Thanks for checking this! Can you give me instructions for replicating the failing xarray tests?

@ianhi
Copy link
Contributor

ianhi commented May 17, 2025

Install this branch of xarray: https://github.com/ianhi/xarray/tree/zarr-dtype-tests where I've swapped the reading of fill values to use the from_json_values with this commit: ianhi/xarray@70a7e8b

I didn't swap the writing of the values yet to simulate data written with xarray in the past. then run this test:

pytest "xarray/tests/test_backends.py::TestZarrRegionAuto::test_zarr_region[3]"

which should give:

self = <xarray.tests.test_backends.TestZarrRegionAuto object at 0x10adbb110>

    def test_zarr_region(self):
        with self.create() as (target, ds):
            ds_transposed = ds.transpose("y", "x")
            ds_region = 1 + ds_transposed.isel(x=[0], y=[0])
>           self.save(target, ds_region, region={"x": slice(0, 1), "y": slice(0, 1)})

/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:6428:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:6353: in save
    ds.to_zarr(target, **kwargs)
/Users/ian/Documents/dev/xarray/xarray/core/dataset.py:2278: in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
/Users/ian/Documents/dev/xarray/xarray/backends/api.py:2239: in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
/Users/ian/Documents/dev/xarray/xarray/backends/api.py:1975: in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:1027: in store
    k: self.open_store_variable(name=k) for k in existing_variable_names
/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:892: in open_store_variable
    zarr_array.metadata.data_type.from_json_value(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Float64(endianness='little'), data = 'AAAAAAAA+H8='

    def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> TFloatScalar_co:
        """
        Read a JSON-serializable value as a numpy float.

        Parameters
        ----------
        data : JSON
            The JSON-serializable value.
        zarr_format : ZarrFormat
            The zarr format version.

        Returns
        -------
        TScalar_co
            The numpy float.
        """
        if check_json_float(data, zarr_format=zarr_format):
            return self._cast_value_unsafe(float_from_json(data, zarr_format=zarr_format))
>       raise TypeError(
            f"Invalid type: {data}. Expected a float or a special string encoding of a float."
        )
E       TypeError: Invalid type: AAAAAAAA+H8=. Expected a float or a special string encoding of a float.

@ianhi
Copy link
Contributor

ianhi commented May 17, 2025

I didn't get that quite right. What xarray is doing is storing its own attribute "_FillValue" separately from the zarr fill value and is storing it as a base64 encoded string https://github.com/pydata/xarray/blob/7551a7a4fcad82a27b3701cfa0af85f74404a45c/xarray/backends/zarr.py#L135

So ignore that test of xarray for the sake of this PR. Though it was helpful in leading me to that part of the spec. The complete spec still requires being able to specify fill values as

the byte representation of the floating point number as an unsigned integer. For example, for float32, "NaN" is equivalent to "0x7fc00000". This representation is the only way to specify a NaN value other than the specific NaN value denoted by "NaN".

edit:
The endianness of encoding was discussed here: zarr-developers/zarr-specs#279

the conclusion is that using big endian as the argument to encode is the correct choice.


What's confusing to me here is that it seems like the spec does not explicitly call out the endianness of the encoded float in the metadata. Though from the example for NaN it seems to implicitly state big-endian encoding is used for the float as fill value.

import struct
import numpy as np
print(struct.pack(">f",np.nan).hex())
print(struct.pack("<f",np.nan).hex())

gives

7fc00000
0000c07f

helpful link: https://docs.python.org/3/library/struct.html#format-characters
format strings for struct:

float16: e
float32: f
float64: d

@d-v-b
Copy link
Contributor Author

d-v-b commented May 19, 2025

thanks for the detective work @ianhi, I added the v3-specific float-as-hex-string functionality in e4a0372

Copy link

@nenb nenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look.

I left a bunch of nits, and some minor comments that could be addressed in follow-up PRs (if you agree that they are worth following up on).

I certainly won't be blocking on merging this though, I personally think that it's already received a lot of eyes, and has had a lot of work put into it. Thank you for all of this work! 🙏

"""


def _collect_entrypoints() -> list[Registry[Any]]:
"""
Collects codecs, pipelines, buffers and ndbuffers from entrypoints.
Collects codecs, pipelines, dtypes, buffers and ndbuffers from entrypoints.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _collect_entrypoints appears to be designed to return all Registrys. The dtype registry is not included in the returned value though. Could this be confusing? It appears that the return value of _collect_entrypoints is not used anywhere, and so another alternative could be just to return nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am at baseline confused by entrypoints. thanks for flagging this, hopefully I can sort this out

# Note: this check is numpy-dtype-specific
# For single-byte (e.g., uint8) or 0-byte (e.g., S0) dtypes,
# endianness does not apply.
if array_spec.dtype.to_dtype().itemsize < 2:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to be safe, we should also have this on the dtype interface. WDYT?

Again, if you agree, this could be a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's a good idea to put this on the dtype interface. we will need to distinguish between data types with a variable size per element, data types with a static size per array element (i.e., numpy's itemsize), container data types with a static length (e.g., np.dtype('U10'), and container data types with a variable length. I don't think this should be too hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 8c90d2c

shape = parse_shapelike(shape)

if chunks is not None and chunk_shape is not None:
raise ValueError("Only one of chunk_shape or chunks can be provided.")

if chunks:
_chunks = normalize_chunks(chunks, shape, dtype_parsed.itemsize)
_chunks = normalize_chunks(chunks, shape, dtype_parsed.to_dtype().itemsize)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as earlier on moving itemsize to dtype interface in future PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went ahead and did it here 8c90d2c

dtype = parse_dtype(dtype, zarr_format=2)

# inject VLenUTF8 for str dtype if not already present
if np.issubdtype(dtype, np.str_):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm that it was intentional to drop this logic updating the filters for v2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was intentional but thank you for flagging it because it still needs some polish

# these categories are keys in a configuration dictionary.
# it is not a part of the ZDType class because these categories are more of an implementation detail
# of our config system rather than a useful attribute of any particular data type.
DTypeCategory = Literal["variable-length-string", "default"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this just be 'variable-length' rather than 'variable-length-string'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable-length-strings gets encoded with the vlen-utf8 codec, which is string-specific. I don't think we would want to use that codec for arbitrary variable-length data types.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Think this is just user-error on my part but):

How are we dealing with variable length bytes (which were previously supported AFAIU)? These were in the old version of the code, I presume that they are still supported somehow?

def get(self, key: str) -> type[ZDType[TBaseDType, TBaseScalar]]:
return self.contents[key]

def match_dtype(self, dtype: TBaseDType) -> ZDType[TBaseDType, TBaseScalar]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing the docstrings comment here. I was trying to register a dtype and ran into an error here and spent a few minutes confused about why it was done this way until I remembered the issue with exhaustively listing dtypes.

One question here, could it be worth it to first trying a key lookup, then resort to iterating over items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer key-value lookup. The big problem here is the (cursed) numpy object dtype, which potentially resolves to multiple zarr dtypes. I opened an issue to talk about this particular problem here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, all I'm saying is that in absence of resolution of #3077 add a comment explaining that will be very helpful to someone in the future

@@ -136,7 +136,7 @@ def to_dict(self) -> dict[str, JSON]:
}

def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
dtype = array_spec.dtype
dtype = array_spec.dtype.to_dtype()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe as_native_dtype

Comment on lines +154 to +155
class Float16(BaseFloat[np.dtypes.Float16DType, np.float16]):
dtype_cls = np.dtypes.Float16DType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b how would I specify this the dtype_cls for a non-native numpy dtype that has been registered with numpy. I'm thinking specifically of these: https://github.com/jax-ml/ml_dtypes?tab=readme-ov-file#ml_dtypes

For context I tried to make a very minimal implementation of bfloat16(https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/bfloat16) to better understand the mechanisms at work here. To keep it minimal I tried to inherit from BaseFloat but I couldn't figure out a version of dtype_cls that would not throw errors when the to_dtype was invoked.

import ml_dtypes
import numpy as np
from ml_dtypes import bfloat16

import zarr
from zarr.core.dtype import data_type_registry
from zarr.core.dtype.npy.float import BaseFloat


@dataclass(frozen=True, kw_only=True, slots=True)
class Zbfloat16(BaseFloat):
    """
    Extremely minimal wrapper for bfloat16
    """

    _zarr_v3_name = "bfloat16"
    dtype_cls = np.dtype("bfloat16")
    
    # Directly using the ml_dtypes class here does not
    # avoid the error
    dtype_cls = bfloat16

    def to_dtype(self) -> bfloat16:
        return np.dtype("bfloat16")


import zarr

store = zarr.storage.MemoryStore()


# surprised that this works before registering?
# registering is only to pick up the string based names?
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)


# register
data_type_registry.register("bfloat16", Zbfloat16)

# string based access breaks
# due to something with `to_dtype` not working within `match_dtype`
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype="bfloat16")

Gives a long error message that ultimately ends with:

--> 134 return data_type_registry.match_dtype(na_dtype)

File [~/Documents/dev/zarr-python/src/zarr/core/dtype/registry.py:46](http://localhost:8888/lab/tree/tmp/~/Documents/dev/zarr-python/src/zarr/core/dtype/registry.py#line=45), in DataTypeRegistry.match_dtype(self, dtype)
     44     except DataTypeValidationError:
     45         pass
---> 46 raise ValueError(f"No data type wrapper found that matches dtype '{dtype}'")

ValueError: No data type wrapper found that matches dtype 'bfloat16'

My approach here is based on looking at the examples of the numpy dtypes and i think would be pretty representative of what someone might try if they were trying to implement their own. From that I think i'm concluding that a short example showing the bfloat16 with some comments as explanations would go a really long way towards having good documentation of how to write a custom dtype

Copy link
Contributor Author

@d-v-b d-v-b May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for working through this, and I totally agree that it would be a great part of the documentation. I am 99% sure your problems were due to this line:

    dtype_cls = np.dtype("bfloat16")

dtype_cls should be bound to the class that creates the dtype. You are binding it to a concrete dtype. The whole idea of ZDType is to allow zarr-python to abstract over the process of creating variants of the same dtype class (e.g., big-endian float64 or little-endian float64). That means we can't associate a specific dtype instance with a ZDType, we have to use the dtype class instead. So in your case it should be:

    dtype_cls = type(np.dtype("bfloat16"))

This is necessary because (annoyingly) the ml_dtypes package doesn't provide a stand-alone bfloat16 class.

I will double check that this works later today when I'm not on a train. And in any case, this absolutely needs to be documented better. Clearly the attribute name dtype_cls doesn't convey the idea clearly enough.

addendum:
this line was also problematic:

z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)

You are passing Zbfloat16 in as a class, but it should be an instance (Zbfloat16())

Unfortunately my branch is happy with the invalid input and somehow matched it to the variable length string dtype! so definitely an interesting bug.

Copy link
Contributor Author

@d-v-b d-v-b May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is an example that works:

# /// script
# dependencies = [
#   "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/fixed-length-strings",
#   "ml_dtypes"
# ]
# ///

import ml_dtypes
import numpy as np
from ml_dtypes import bfloat16

import zarr
from zarr.core.dtype import data_type_registry
from zarr.core.dtype.npy.float import BaseFloat
from dataclasses import dataclass

@dataclass(frozen=True, kw_only=True, slots=True)
class Zbfloat16(BaseFloat):
    """
    Extremely minimal wrapper for bfloat16
    """

    _zarr_v3_name = "bfloat16"
    dtype_cls = type(np.dtype("bfloat16"))

    def to_dtype(self) -> bfloat16:
        return np.dtype("bfloat16")


import zarr

store = zarr.storage.MemoryStore()
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16())
z[:] = 1.0
print(z[:])
# [1 1 1 1 1]
print(z[:].dtype)
# bfloat16

# register
data_type_registry.register("bfloat16", Zbfloat16)

z2 = zarr.open_array(store=store, mode='r')
print(z2[:])
# [1 1 1 1 1]
print(z2[:].dtype)
# bfloat16

Copy link
Contributor Author

@d-v-b d-v-b May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this comment:

surprised that this works before registering?
registering is only to pick up the string based names?

right, we need to infer a zdtype from a string (or dict) when reading dtype metadata from json. On the other direction, we have to infer a zdtype from a python object (like a numpy dtype) that represents an in-memory dtype object. At the moment you can bypass the in-memory-object -> zdtype path by providing a zdtype instance explicitly. we could also offer a similar bypass for reading data from disk, but i haven't implemented that yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example! Very helpful

this line was also problematic:

z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)

You are passing Zbfloat16 in as a class, but it should be an instance (Zbfloat16())

Unfortunately my branch is happy with the invalid input and somehow matched it to the variable length string dtype! so definitely an interesting bug.

On wrapper the from_dtype is defined as a classmethod which is why this works

This is necessary because (annoyingly) the ml_dtypes package doesn't provide a stand-alone bfloat16 class.

Ok I thought that might be the case

idea is to allow zarr-python to abstract over the process of creating variants of the same dtype class (e.g., big-endian float64 or little-endian float64)

notably the ml_dtypes are exclusively little endian. thats why I re-wrote to_dtype to avoid the endianess stuff in the BaseFloat

I totally agree that it would be a great part of the documentation.

As a an easy step in this direction I suggest taking what you wrote in this thread and putting into the docs, maybe with a small caveat about the typing is not implemented and that this shouldn't inherit from HasEndianess


self.lazy_load_list.clear()

def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to add an unregister method as well.

@ilan-gold
Copy link
Contributor

ilan-gold commented May 22, 2025

@d-v-b I have a little bug report here on this PR (which may be intentional):

import zarr
z = zarr.open("foo.zarr", zarr_format=2)
import numcodecs
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()]) 

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/_compat.py:61, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     51 warnings.warn(
     52     (
     53         f"Pass {formatted_args_msg} as keyword args. From version "
   (...)     58     stacklevel=2,
     59 )
     60 kwargs.update(zip(sig.parameters, args, strict=False))
---> 61 return f(**kwargs)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:2477, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   2382 """Create an array within this group.
   2383 
   2384 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   2471 AsyncArray
   2472 """
   2473 compressors = _parse_deprecated_compressor(
   2474     compressor, compressors, zarr_format=self.metadata.zarr_format
   2475 )
   2476 return Array(
-> 2477     self._sync(
   2478         self._async_group.create_array(
   2479             name=name,
   2480             shape=shape,
   2481             dtype=dtype,
   2482             chunks=chunks,
   2483             shards=shards,
   2484             fill_value=fill_value,
   2485             attributes=attributes,
   2486             chunk_key_encoding=chunk_key_encoding,
   2487             compressors=compressors,
   2488             serializer=serializer,
   2489             dimension_names=dimension_names,
   2490             order=order,
   2491             filters=filters,
   2492             overwrite=overwrite,
   2493             storage_options=storage_options,
   2494             config=config,
   2495         )
   2496     )
   2497 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
    205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    206     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    207     # this should allow us to better type the sync wrapper
--> 208     return sync(
    209         coroutine,
    210         timeout=config.get("async.timeout"),
    211     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1102, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   1007 """Create an array within this group.
   1008 
   1009 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   1097 
   1098 """
   1099 compressors = _parse_deprecated_compressor(
   1100     compressor, compressors, zarr_format=self.metadata.zarr_format
   1101 )
-> 1102 return await create_array(
   1103     store=self.store_path,
   1104     name=name,
   1105     shape=shape,
   1106     dtype=dtype,
   1107     chunks=chunks,
   1108     shards=shards,
   1109     filters=filters,
   1110     compressors=compressors,
   1111     serializer=serializer,
   1112     fill_value=fill_value,
   1113     order=order,
   1114     zarr_format=self.metadata.zarr_format,
   1115     attributes=attributes,
   1116     chunk_key_encoding=chunk_key_encoding,
   1117     dimension_names=dimension_names,
   1118     storage_options=storage_options,
   1119     overwrite=overwrite,
   1120     config=config,
   1121 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4466, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
   4461 mode: Literal["a"] = "a"
   4463 store_path = await make_store_path(
   4464     store, path=name, mode=mode, storage_options=storage_options
   4465 )
-> 4466 return await init_array(
   4467     store_path=store_path,
   4468     shape=shape_parsed,
   4469     dtype=dtype_parsed,
   4470     chunks=chunks,
   4471     shards=shards,
   4472     filters=filters,
   4473     compressors=compressors,
   4474     serializer=serializer,
   4475     fill_value=fill_value,
   4476     order=order,
   4477     zarr_format=zarr_format,
   4478     attributes=attributes,
   4479     chunk_key_encoding=chunk_key_encoding,
   4480     dimension_names=dimension_names,
   4481     overwrite=overwrite,
   4482     config=config,
   4483 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4240, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
   4237     else:
   4238         order_parsed = order
-> 4240     meta = AsyncArray._create_metadata_v2(
   4241         shape=shape_parsed,
   4242         dtype=dtype_wrapped,
   4243         chunks=chunk_shape_parsed,
   4244         dimension_separator=chunk_key_encoding_parsed.separator,
   4245         fill_value=fill_value,
   4246         order=order_parsed,
   4247         filters=filters_parsed,
   4248         compressor=compressor_parsed,
   4249         attributes=attributes,
   4250     )
   4251 else:
   4252     array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
   4253         compressors=compressors,
   4254         filters=filters,
   4255         serializer=serializer,
   4256         dtype=dtype_wrapped,
   4257     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:783, in AsyncArray._create_metadata_v2(shape, dtype, chunks, order, dimension_separator, fill_value, filters, compressor, attributes)
    781 if fill_value is None:
    782     fill_value = dtype.default_value()  # type: ignore[assignment]
--> 783 return ArrayV2Metadata(
    784     shape=shape,
    785     dtype=dtype,
    786     chunks=chunks,
    787     order=order,
    788     dimension_separator=dimension_separator,
    789     fill_value=fill_value,
    790     compressor=compressor,
    791     filters=filters,
    792     attributes=attributes,
    793 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:89, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
     87 fill_value_parsed: TBaseScalar | None
     88 if fill_value is not None:
---> 89     fill_value_parsed = dtype.cast_value(fill_value)
     90 else:
     91     fill_value_parsed = fill_value

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/wrapper.py:168, in ZDType.cast_value(self, data)
    161     return self._cast_value_unsafe(data)
    162 msg = (
    163     f"The value {data} failed a type check. "
    164     f"It cannot be safely cast to a scalar compatible with {self.dtype_cls}. "
    165     f"Consult the documentation for {self} to determine the possible values that can "
    166     "be cast to scalars of the wrapped data type."
    167 )
--> 168 raise TypeError(msg)

TypeError: The value 0 failed a type check. It cannot be safely cast to a scalar compatible with <class 'numpy.dtypes.StringDType'>. Consult the documentation for VariableLengthString() to determine the possible values that can be cast to scalars of the wrapped data type.

I believe this previously worked via a 0-->"" conversion, but I need to check. It's possible It seems 0 was a valid on-disk fill value before for v2 object strings, but I need to dig back through the issues here to check this:

My .zarray from zarr 3.0.8:

{
  "shape": [
    10
  ],
  "chunks": [
    10
  ],
  "fill_value": 0,
  "order": "C",
  "filters": [
    {
      "id": "vlen-utf8"
    }
  ],
  "dimension_separator": ".",
  "compressor": {
    "id": "blosc",
    "cname": "lz4",
    "clevel": 5,
    "shuffle": 1,
    "blocksize": 0,
    "typesize": null
  },
  "zarr_format": 2,
  "dtype": "|O"
}

@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

@d-v-b I have a little bug report here on this PR:

import zarr
z = zarr.open("foo.zarr", zarr_format=2)
import numcodecs
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()]) 

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/_compat.py:61, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     51 warnings.warn(
     52     (
     53         f"Pass {formatted_args_msg} as keyword args. From version "
   (...)     58     stacklevel=2,
     59 )
     60 kwargs.update(zip(sig.parameters, args, strict=False))
---> 61 return f(**kwargs)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:2477, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   2382 """Create an array within this group.
   2383 
   2384 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   2471 AsyncArray
   2472 """
   2473 compressors = _parse_deprecated_compressor(
   2474     compressor, compressors, zarr_format=self.metadata.zarr_format
   2475 )
   2476 return Array(
-> 2477     self._sync(
   2478         self._async_group.create_array(
   2479             name=name,
   2480             shape=shape,
   2481             dtype=dtype,
   2482             chunks=chunks,
   2483             shards=shards,
   2484             fill_value=fill_value,
   2485             attributes=attributes,
   2486             chunk_key_encoding=chunk_key_encoding,
   2487             compressors=compressors,
   2488             serializer=serializer,
   2489             dimension_names=dimension_names,
   2490             order=order,
   2491             filters=filters,
   2492             overwrite=overwrite,
   2493             storage_options=storage_options,
   2494             config=config,
   2495         )
   2496     )
   2497 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
    205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    206     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    207     # this should allow us to better type the sync wrapper
--> 208     return sync(
    209         coroutine,
    210         timeout=config.get("async.timeout"),
    211     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1102, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
   1007 """Create an array within this group.
   1008 
   1009 This method lightly wraps :func:`zarr.core.array.create_array`.
   (...)   1097 
   1098 """
   1099 compressors = _parse_deprecated_compressor(
   1100     compressor, compressors, zarr_format=self.metadata.zarr_format
   1101 )
-> 1102 return await create_array(
   1103     store=self.store_path,
   1104     name=name,
   1105     shape=shape,
   1106     dtype=dtype,
   1107     chunks=chunks,
   1108     shards=shards,
   1109     filters=filters,
   1110     compressors=compressors,
   1111     serializer=serializer,
   1112     fill_value=fill_value,
   1113     order=order,
   1114     zarr_format=self.metadata.zarr_format,
   1115     attributes=attributes,
   1116     chunk_key_encoding=chunk_key_encoding,
   1117     dimension_names=dimension_names,
   1118     storage_options=storage_options,
   1119     overwrite=overwrite,
   1120     config=config,
   1121 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4466, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
   4461 mode: Literal["a"] = "a"
   4463 store_path = await make_store_path(
   4464     store, path=name, mode=mode, storage_options=storage_options
   4465 )
-> 4466 return await init_array(
   4467     store_path=store_path,
   4468     shape=shape_parsed,
   4469     dtype=dtype_parsed,
   4470     chunks=chunks,
   4471     shards=shards,
   4472     filters=filters,
   4473     compressors=compressors,
   4474     serializer=serializer,
   4475     fill_value=fill_value,
   4476     order=order,
   4477     zarr_format=zarr_format,
   4478     attributes=attributes,
   4479     chunk_key_encoding=chunk_key_encoding,
   4480     dimension_names=dimension_names,
   4481     overwrite=overwrite,
   4482     config=config,
   4483 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4240, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
   4237     else:
   4238         order_parsed = order
-> 4240     meta = AsyncArray._create_metadata_v2(
   4241         shape=shape_parsed,
   4242         dtype=dtype_wrapped,
   4243         chunks=chunk_shape_parsed,
   4244         dimension_separator=chunk_key_encoding_parsed.separator,
   4245         fill_value=fill_value,
   4246         order=order_parsed,
   4247         filters=filters_parsed,
   4248         compressor=compressor_parsed,
   4249         attributes=attributes,
   4250     )
   4251 else:
   4252     array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
   4253         compressors=compressors,
   4254         filters=filters,
   4255         serializer=serializer,
   4256         dtype=dtype_wrapped,
   4257     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:783, in AsyncArray._create_metadata_v2(shape, dtype, chunks, order, dimension_separator, fill_value, filters, compressor, attributes)
    781 if fill_value is None:
    782     fill_value = dtype.default_value()  # type: ignore[assignment]
--> 783 return ArrayV2Metadata(
    784     shape=shape,
    785     dtype=dtype,
    786     chunks=chunks,
    787     order=order,
    788     dimension_separator=dimension_separator,
    789     fill_value=fill_value,
    790     compressor=compressor,
    791     filters=filters,
    792     attributes=attributes,
    793 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:89, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
     87 fill_value_parsed: TBaseScalar | None
     88 if fill_value is not None:
---> 89     fill_value_parsed = dtype.cast_value(fill_value)
     90 else:
     91     fill_value_parsed = fill_value

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/wrapper.py:168, in ZDType.cast_value(self, data)
    161     return self._cast_value_unsafe(data)
    162 msg = (
    163     f"The value {data} failed a type check. "
    164     f"It cannot be safely cast to a scalar compatible with {self.dtype_cls}. "
    165     f"Consult the documentation for {self} to determine the possible values that can "
    166     "be cast to scalars of the wrapped data type."
    167 )
--> 168 raise TypeError(msg)

TypeError: The value 0 failed a type check. It cannot be safely cast to a scalar compatible with <class 'numpy.dtypes.StringDType'>. Consult the documentation for VariableLengthString() to determine the possible values that can be cast to scalars of the wrapped data type.

I believe this previously worked via a 0-->"" conversion.

Thanks for this report! After this PR, I don't think we want to silently convert 0 to the empty string. IMO users who want to use a string data type should specify a string fill value.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

this requires changing the default fill value from 0 to None, which will instruct zarr-python to use the per-dtype default fill value.

@ilan-gold
Copy link
Contributor

@d-v-b I am seeing test failures with some old data we have written out that uses the 0-as-string fill value.

    def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> str:
        if not check_json_str(data):
>           raise TypeError(f"Invalid type: {data}. Expected a string.")
E           TypeError: Invalid type: 0. Expected a string.
E           Error raised while reading key 'var' of <class 'zarr.core.group.Group'> from /raw

I will push my branch and show you the CI so you have a bit more to look at. I think this is related to #2792 (comment)

@d-v-b
Copy link
Contributor Author

d-v-b commented May 22, 2025

@d-v-b I am seeing test failures with some old data we have written out that uses the 0-as-string fill value.

    def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> str:
        if not check_json_str(data):
>           raise TypeError(f"Invalid type: {data}. Expected a string.")
E           TypeError: Invalid type: 0. Expected a string.
E           Error raised while reading key 'var' of <class 'zarr.core.group.Group'> from /raw

I will push my branch and show you the CI so you have a bit more to look at. I think this is related to #2792 (comment)

🙏 thank you for checking this! its super helpful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.