-
-
Notifications
You must be signed in to change notification settings - Fork 329
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
src/zarr/core/metadata/dtype.py
Outdated
name: ClassVar[str] | ||
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype | ||
kind: ClassVar[DataTypeFlavor] | ||
default_value: TScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child classes define a string name
(which feeds into the zarr metadata), a dtype class dtype_cls
(which gets assigned automatically from the generic type parameter) , a string kind
(we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)
src/zarr/core/metadata/dtype.py
Outdated
class IntWrapperBase(DTypeWrapper[TDType, TScalar]): | ||
kind = "numeric" | ||
|
||
@classmethod | ||
def from_dtype(cls, dtype: TDType) -> Self: | ||
return cls() | ||
|
||
def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int: | ||
return int(data) | ||
|
||
def from_json_value( | ||
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None | ||
) -> TScalar: | ||
if check_json_int(data): | ||
return self.to_dtype(endianness=endianness).type(data) | ||
raise TypeError(f"Invalid type: {data}. Expected an integer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
Thanks for checking this! Can you give me instructions for replicating the failing xarray tests? |
Install this branch of xarray: https://github.com/ianhi/xarray/tree/zarr-dtype-tests where I've swapped the reading of fill values to use the I didn't swap the writing of the values yet to simulate data written with xarray in the past. then run this test:
which should give:
|
I didn't get that quite right. What So ignore that test of xarray for the sake of this PR. Though it was helpful in leading me to that part of the spec. The complete spec still requires being able to specify fill values as
edit: the conclusion is that using big endian as the argument to encode is the correct choice. What's confusing to me here is that it seems like the spec does not explicitly call out the endianness of the encoded float in the metadata. Though from the example for NaN it seems to implicitly state big-endian encoding is used for the float as fill value.
gives
helpful link: https://docs.python.org/3/library/struct.html#format-characters float16: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look.
I left a bunch of nits, and some minor comments that could be addressed in follow-up PRs (if you agree that they are worth following up on).
I certainly won't be blocking on merging this though, I personally think that it's already received a lot of eyes, and has had a lot of work put into it. Thank you for all of this work! 🙏
""" | ||
|
||
|
||
def _collect_entrypoints() -> list[Registry[Any]]: | ||
""" | ||
Collects codecs, pipelines, buffers and ndbuffers from entrypoints. | ||
Collects codecs, pipelines, dtypes, buffers and ndbuffers from entrypoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: _collect_entrypoints
appears to be designed to return all Registry
s. The dtype registry is not included in the returned value though. Could this be confusing? It appears that the return value of _collect_entrypoints
is not used anywhere, and so another alternative could be just to return nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am at baseline confused by entrypoints. thanks for flagging this, hopefully I can sort this out
src/zarr/codecs/bytes.py
Outdated
# Note: this check is numpy-dtype-specific | ||
# For single-byte (e.g., uint8) or 0-byte (e.g., S0) dtypes, | ||
# endianness does not apply. | ||
if array_spec.dtype.to_dtype().itemsize < 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to be safe, we should also have this on the dtype interface. WDYT?
Again, if you agree, this could be a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's a good idea to put this on the dtype interface. we will need to distinguish between data types with a variable size per element, data types with a static size per array element (i.e., numpy's itemsize
), container data types with a static length (e.g., np.dtype('U10')
, and container data types with a variable length. I don't think this should be too hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 8c90d2c
src/zarr/core/array.py
Outdated
shape = parse_shapelike(shape) | ||
|
||
if chunks is not None and chunk_shape is not None: | ||
raise ValueError("Only one of chunk_shape or chunks can be provided.") | ||
|
||
if chunks: | ||
_chunks = normalize_chunks(chunks, shape, dtype_parsed.itemsize) | ||
_chunks = normalize_chunks(chunks, shape, dtype_parsed.to_dtype().itemsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as earlier on moving itemsize
to dtype interface in future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went ahead and did it here 8c90d2c
dtype = parse_dtype(dtype, zarr_format=2) | ||
|
||
# inject VLenUTF8 for str dtype if not already present | ||
if np.issubdtype(dtype, np.str_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm that it was intentional to drop this logic updating the filters for v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was intentional but thank you for flagging it because it still needs some polish
# these categories are keys in a configuration dictionary. | ||
# it is not a part of the ZDType class because these categories are more of an implementation detail | ||
# of our config system rather than a useful attribute of any particular data type. | ||
DTypeCategory = Literal["variable-length-string", "default"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should this just be 'variable-length' rather than 'variable-length-string'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable-length-strings gets encoded with the vlen-utf8 codec, which is string-specific. I don't think we would want to use that codec for arbitrary variable-length data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Think this is just user-error on my part but):
How are we dealing with variable length bytes (which were previously supported AFAIU)? These were in the old version of the code, I presume that they are still supported somehow?
def get(self, key: str) -> type[ZDType[TBaseDType, TBaseScalar]]: | ||
return self.contents[key] | ||
|
||
def match_dtype(self, dtype: TBaseDType) -> ZDType[TBaseDType, TBaseScalar]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing the docstrings comment here. I was trying to register a dtype and ran into an error here and spent a few minutes confused about why it was done this way until I remembered the issue with exhaustively listing dtypes.
One question here, could it be worth it to first trying a key lookup, then resort to iterating over items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer key-value lookup. The big problem here is the (cursed) numpy object dtype, which potentially resolves to multiple zarr dtypes. I opened an issue to talk about this particular problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, all I'm saying is that in absence of resolution of #3077 add a comment explaining that will be very helpful to someone in the future
src/zarr/codecs/blosc.py
Outdated
@@ -136,7 +136,7 @@ def to_dict(self) -> dict[str, JSON]: | |||
} | |||
|
|||
def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self: | |||
dtype = array_spec.dtype | |||
dtype = array_spec.dtype.to_dtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe as_native_dtype
class Float16(BaseFloat[np.dtypes.Float16DType, np.float16]): | ||
dtype_cls = np.dtypes.Float16DType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-v-b how would I specify this the dtype_cls
for a non-native numpy dtype that has been registered with numpy. I'm thinking specifically of these: https://github.com/jax-ml/ml_dtypes?tab=readme-ov-file#ml_dtypes
For context I tried to make a very minimal implementation of bfloat16
(https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/bfloat16) to better understand the mechanisms at work here. To keep it minimal I tried to inherit from BaseFloat
but I couldn't figure out a version of dtype_cls
that would not throw errors when the to_dtype
was invoked.
import ml_dtypes
import numpy as np
from ml_dtypes import bfloat16
import zarr
from zarr.core.dtype import data_type_registry
from zarr.core.dtype.npy.float import BaseFloat
@dataclass(frozen=True, kw_only=True, slots=True)
class Zbfloat16(BaseFloat):
"""
Extremely minimal wrapper for bfloat16
"""
_zarr_v3_name = "bfloat16"
dtype_cls = np.dtype("bfloat16")
# Directly using the ml_dtypes class here does not
# avoid the error
dtype_cls = bfloat16
def to_dtype(self) -> bfloat16:
return np.dtype("bfloat16")
import zarr
store = zarr.storage.MemoryStore()
# surprised that this works before registering?
# registering is only to pick up the string based names?
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)
# register
data_type_registry.register("bfloat16", Zbfloat16)
# string based access breaks
# due to something with `to_dtype` not working within `match_dtype`
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype="bfloat16")
Gives a long error message that ultimately ends with:
--> 134 return data_type_registry.match_dtype(na_dtype)
File [~/Documents/dev/zarr-python/src/zarr/core/dtype/registry.py:46](http://localhost:8888/lab/tree/tmp/~/Documents/dev/zarr-python/src/zarr/core/dtype/registry.py#line=45), in DataTypeRegistry.match_dtype(self, dtype)
44 except DataTypeValidationError:
45 pass
---> 46 raise ValueError(f"No data type wrapper found that matches dtype '{dtype}'")
ValueError: No data type wrapper found that matches dtype 'bfloat16'
My approach here is based on looking at the examples of the numpy dtypes and i think would be pretty representative of what someone might try if they were trying to implement their own. From that I think i'm concluding that a short example showing the bfloat16
with some comments as explanations would go a really long way towards having good documentation of how to write a custom dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much for working through this, and I totally agree that it would be a great part of the documentation. I am 99% sure your problems were due to this line:
dtype_cls = np.dtype("bfloat16")
dtype_cls
should be bound to the class that creates the dtype. You are binding it to a concrete dtype. The whole idea of ZDType
is to allow zarr-python to abstract over the process of creating variants of the same dtype class (e.g., big-endian float64 or little-endian float64). That means we can't associate a specific dtype instance with a ZDType
, we have to use the dtype class instead. So in your case it should be:
dtype_cls = type(np.dtype("bfloat16"))
This is necessary because (annoyingly) the ml_dtypes
package doesn't provide a stand-alone bfloat16
class.
I will double check that this works later today when I'm not on a train. And in any case, this absolutely needs to be documented better. Clearly the attribute name dtype_cls
doesn't convey the idea clearly enough.
addendum:
this line was also problematic:
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)
You are passing Zbfloat16
in as a class, but it should be an instance (Zbfloat16()
)
Unfortunately my branch is happy with the invalid input and somehow matched it to the variable length string dtype! so definitely an interesting bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is an example that works:
# /// script
# dependencies = [
# "zarr@git+https://github.com/d-v-b/zarr-python.git@feat/fixed-length-strings",
# "ml_dtypes"
# ]
# ///
import ml_dtypes
import numpy as np
from ml_dtypes import bfloat16
import zarr
from zarr.core.dtype import data_type_registry
from zarr.core.dtype.npy.float import BaseFloat
from dataclasses import dataclass
@dataclass(frozen=True, kw_only=True, slots=True)
class Zbfloat16(BaseFloat):
"""
Extremely minimal wrapper for bfloat16
"""
_zarr_v3_name = "bfloat16"
dtype_cls = type(np.dtype("bfloat16"))
def to_dtype(self) -> bfloat16:
return np.dtype("bfloat16")
import zarr
store = zarr.storage.MemoryStore()
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16())
z[:] = 1.0
print(z[:])
# [1 1 1 1 1]
print(z[:].dtype)
# bfloat16
# register
data_type_registry.register("bfloat16", Zbfloat16)
z2 = zarr.open_array(store=store, mode='r')
print(z2[:])
# [1 1 1 1 1]
print(z2[:].dtype)
# bfloat16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this comment:
surprised that this works before registering?
registering is only to pick up the string based names?
right, we need to infer a zdtype from a string (or dict) when reading dtype metadata from json. On the other direction, we have to infer a zdtype from a python object (like a numpy dtype) that represents an in-memory dtype object. At the moment you can bypass the in-memory-object -> zdtype path by providing a zdtype instance explicitly. we could also offer a similar bypass for reading data from disk, but i haven't implemented that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example! Very helpful
this line was also problematic:
z = zarr.create_array(store=store, shape=(5,), chunks=(1,), dtype=Zbfloat16)
You are passing Zbfloat16 in as a class, but it should be an instance (Zbfloat16())
Unfortunately my branch is happy with the invalid input and somehow matched it to the variable length string dtype! so definitely an interesting bug.
On wrapper
the from_dtype
is defined as a classmethod which is why this works
This is necessary because (annoyingly) the ml_dtypes package doesn't provide a stand-alone bfloat16 class.
Ok I thought that might be the case
idea is to allow zarr-python to abstract over the process of creating variants of the same dtype class (e.g., big-endian float64 or little-endian float64)
notably the ml_dtypes
are exclusively little endian. thats why I re-wrote to_dtype
to avoid the endianess stuff in the BaseFloat
I totally agree that it would be a great part of the documentation.
As a an easy step in this direction I suggest taking what you wrote in this thread and putting into the docs, maybe with a small caveat about the typing is not implemented and that this shouldn't inherit from HasEndianess
|
||
self.lazy_load_list.clear() | ||
|
||
def register(self: Self, key: str, cls: type[ZDType[TBaseDType, TBaseScalar]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add an unregister
method as well.
…se cast_value where appropriate
@d-v-b I have a little bug report here on this PR (which may be intentional): import zarr
z = zarr.open("foo.zarr", zarr_format=2)
import numcodecs
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()]) ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()])
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/_compat.py:61, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
51 warnings.warn(
52 (
53 f"Pass {formatted_args_msg} as keyword args. From version "
(...) 58 stacklevel=2,
59 )
60 kwargs.update(zip(sig.parameters, args, strict=False))
---> 61 return f(**kwargs)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:2477, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
2382 """Create an array within this group.
2383
2384 This method lightly wraps :func:`zarr.core.array.create_array`.
(...) 2471 AsyncArray
2472 """
2473 compressors = _parse_deprecated_compressor(
2474 compressor, compressors, zarr_format=self.metadata.zarr_format
2475 )
2476 return Array(
-> 2477 self._sync(
2478 self._async_group.create_array(
2479 name=name,
2480 shape=shape,
2481 dtype=dtype,
2482 chunks=chunks,
2483 shards=shards,
2484 fill_value=fill_value,
2485 attributes=attributes,
2486 chunk_key_encoding=chunk_key_encoding,
2487 compressors=compressors,
2488 serializer=serializer,
2489 dimension_names=dimension_names,
2490 order=order,
2491 filters=filters,
2492 overwrite=overwrite,
2493 storage_options=storage_options,
2494 config=config,
2495 )
2496 )
2497 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1102, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
1007 """Create an array within this group.
1008
1009 This method lightly wraps :func:`zarr.core.array.create_array`.
(...) 1097
1098 """
1099 compressors = _parse_deprecated_compressor(
1100 compressor, compressors, zarr_format=self.metadata.zarr_format
1101 )
-> 1102 return await create_array(
1103 store=self.store_path,
1104 name=name,
1105 shape=shape,
1106 dtype=dtype,
1107 chunks=chunks,
1108 shards=shards,
1109 filters=filters,
1110 compressors=compressors,
1111 serializer=serializer,
1112 fill_value=fill_value,
1113 order=order,
1114 zarr_format=self.metadata.zarr_format,
1115 attributes=attributes,
1116 chunk_key_encoding=chunk_key_encoding,
1117 dimension_names=dimension_names,
1118 storage_options=storage_options,
1119 overwrite=overwrite,
1120 config=config,
1121 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4466, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
4461 mode: Literal["a"] = "a"
4463 store_path = await make_store_path(
4464 store, path=name, mode=mode, storage_options=storage_options
4465 )
-> 4466 return await init_array(
4467 store_path=store_path,
4468 shape=shape_parsed,
4469 dtype=dtype_parsed,
4470 chunks=chunks,
4471 shards=shards,
4472 filters=filters,
4473 compressors=compressors,
4474 serializer=serializer,
4475 fill_value=fill_value,
4476 order=order,
4477 zarr_format=zarr_format,
4478 attributes=attributes,
4479 chunk_key_encoding=chunk_key_encoding,
4480 dimension_names=dimension_names,
4481 overwrite=overwrite,
4482 config=config,
4483 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:4240, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
4237 else:
4238 order_parsed = order
-> 4240 meta = AsyncArray._create_metadata_v2(
4241 shape=shape_parsed,
4242 dtype=dtype_wrapped,
4243 chunks=chunk_shape_parsed,
4244 dimension_separator=chunk_key_encoding_parsed.separator,
4245 fill_value=fill_value,
4246 order=order_parsed,
4247 filters=filters_parsed,
4248 compressor=compressor_parsed,
4249 attributes=attributes,
4250 )
4251 else:
4252 array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
4253 compressors=compressors,
4254 filters=filters,
4255 serializer=serializer,
4256 dtype=dtype_wrapped,
4257 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/array.py:783, in AsyncArray._create_metadata_v2(shape, dtype, chunks, order, dimension_separator, fill_value, filters, compressor, attributes)
781 if fill_value is None:
782 fill_value = dtype.default_value() # type: ignore[assignment]
--> 783 return ArrayV2Metadata(
784 shape=shape,
785 dtype=dtype,
786 chunks=chunks,
787 order=order,
788 dimension_separator=dimension_separator,
789 fill_value=fill_value,
790 compressor=compressor,
791 filters=filters,
792 attributes=attributes,
793 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:89, in ArrayV2Metadata.__init__(self, shape, dtype, chunks, fill_value, order, dimension_separator, compressor, filters, attributes)
87 fill_value_parsed: TBaseScalar | None
88 if fill_value is not None:
---> 89 fill_value_parsed = dtype.cast_value(fill_value)
90 else:
91 fill_value_parsed = fill_value
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/wrapper.py:168, in ZDType.cast_value(self, data)
161 return self._cast_value_unsafe(data)
162 msg = (
163 f"The value {data} failed a type check. "
164 f"It cannot be safely cast to a scalar compatible with {self.dtype_cls}. "
165 f"Consult the documentation for {self} to determine the possible values that can "
166 "be cast to scalars of the wrapped data type."
167 )
--> 168 raise TypeError(msg)
TypeError: The value 0 failed a type check. It cannot be safely cast to a scalar compatible with <class 'numpy.dtypes.StringDType'>. Consult the documentation for VariableLengthString() to determine the possible values that can be cast to scalars of the wrapped data type.
My .zarray from zarr 3.0.8:
|
Thanks for this report! After this PR, I don't think we want to silently convert 0 to the empty string. IMO users who want to use a string data type should specify a string fill value. |
this requires changing the default fill value from 0 to None, which will instruct zarr-python to use the per-dtype default fill value. |
…at/fixed-length-strings
@d-v-b I am seeing test failures with some old data we have written out that uses the 0-as-string fill value. def from_json_value(self, data: JSON, *, zarr_format: ZarrFormat) -> str:
if not check_json_str(data):
> raise TypeError(f"Invalid type: {data}. Expected a string.")
E TypeError: Invalid type: 0. Expected a string.
E Error raised while reading key 'var' of <class 'zarr.core.group.Group'> from /raw I will push my branch and show you the CI so you have a bit more to look at. I think this is related to #2792 (comment) |
🙏 thank you for checking this! its super helpful to me. |
…ct dtype (i.e., refuse to match it)
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.