-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement first-class List type #60629
base: main
Are you sure you want to change the base?
Conversation
Does this assume moving forward with the logical type system PDEP? i.e. List type backed by multiple (in theory) implementations |
PDEP-14 would be nice but I don't think its required here. If we do not revert PDEP-10, then we can assume pyarrow is required and just build off of that. This can fit logically into the extension type system. We may just want to start referring to that as something else besides "numpy_nulllable," but there is an issue already open for that #59032 |
Yah I'd really rather avoid the changes this makes in that part of the code. Will comment in-line and see if we can find alternatives. |
pandas/core/internals/blocks.py
Outdated
try: | ||
return self.values.dtype | ||
except AttributeError: # PyArrow fallback | ||
return self.values.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me. self.values should be the EA, and the EA.dtype should be the right thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks. I think this is a holdover from an intermediate state and I didn't recognize the requirement here. Reverting this fixes a lot of the other comments you've made here as well - thanks!
pandas/core/internals/blocks.py
Outdated
if dtype: | ||
klass = get_block_type(dtype) | ||
else: | ||
klass = get_block_type(values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, values.dtype should be the ListDtype already. I don't see why passing dtype separately is necessary.
pandas/core/series.py
Outdated
@@ -505,7 +505,7 @@ def __init__( | |||
data = data.copy() | |||
else: | |||
data = sanitize_array(data, index, dtype, copy) | |||
data = SingleBlockManager.from_array(data, index, refs=refs) | |||
data = SingleBlockManager.from_array(data, dtype, index, refs=refs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dtype
is your ListDtype, then data.dtype should be ListDtype at this point so the new argument should be unnecessary
pandas/io/formats/format.py
Outdated
@@ -1103,7 +1103,11 @@ def format_array( | |||
List[str] | |||
""" | |||
fmt_klass: type[_GenericArrayFormatter] | |||
if lib.is_np_dtype(values.dtype, "M"): | |||
if hasattr(values, "type") and values.type == "null": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do something more explicit than hasattr checks? i.e. isinstance(dtype, ListDtype) or whatever?
|
||
return ListArray(data) | ||
class TestListArray(BaseConstructorsTests): ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we moved away from this pattern to just ExtensionTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - I think we can move to that as this gets more production ready. I just wanted to start with something really small while in draft mode
the thing ive always assumed would be a PITA with a ListDtype is
vs
have you given any thought to that? |
For a List data type the first option wouldn't be possible, since those are scalars values. So I think the latter is correct; if you wanted unpacking I think you'd need to provide a list of lists |
ad6fa08
to
e25c0d4
Compare
@@ -460,6 +461,8 @@ def treat_as_nested(data) -> bool: | |||
len(data) > 0 | |||
and is_list_like(data[0]) | |||
and getattr(data[0], "ndim", 1) == 1 | |||
# TODO(wayd): hack so pyarrow list elements don't expand | |||
and not isinstance(data[0], pa.ListScalar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think have is list like return False for pyarrow scalar is less hacky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably true in this particular case, although I'm not sure how it will generalize to all uses of is_list_like
. Will do more research
@@ -494,7 +495,7 @@ def __init__( | |||
if not is_list_like(data): | |||
data = [data] | |||
index = default_index(len(data)) | |||
elif is_list_like(data): | |||
elif is_list_like(data) and not isinstance(dtype, ListDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about nested list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is a tough one to handle. I'm not sure if something like:
pd.Series([1, 2, 3], index=range(3), dtype=pd.ListDtype())
should raise or broadcast. I think the tests currently want it to broadcast, but we could override that expectation for this array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good example. also indexing is going to be a pain point. a lot of checking for nested listlikes happens in indexing.py and im wary of stumbling blocks there. also indexing in general with a ListDtype index. i think the solution is to interpret lists as sequences and only treat them as a scalar if explicitly wrapped in pa.ListScalar (or something equivalent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally started down that path but I think the problem with expecting pa.ListScalar
is that when you select from a ListArray, most users probably expect a plain python list back. So to use that same selection as an indexer I think its hard to avoid the built-in type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to use that same selection as an indexer
I think we have to just disallow that. Otherwise we have a long tail of places where we expect labels to be non-listlike and nested objects to mean 2D. e.g. df.set_index(list_column).T.set_index([1, 2, 3])
is ambiguous whether it wants three columns to be set as an index or a single column with label [1, 2, 3]
. I'm particularly worried about indexing.py code where we check for nested data getting way more complicated if the behavior has to depend on dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.set_index(list_column).T.set_index([1, 2, 3])
is ambiguous whether it wants three columns to be set as an index or a single column with label[1, 2, 3]
Do we normally require index elements to be hashable? I think there's an argument to be made that we shouldn't allow this as a index, given Python lists aren't hashable (even though PyArrow lists are)
If we do want to allow this as an index, the example cited is a single column with label [1, 2, 3]
I'm particularly worried about indexing.py code where we check for nested data getting way more complicated if the behavior has to depend on dtypes.
Definitely a valid concern. Do you know what might not be covered in this area by the extension tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we normally require index elements to be hashable?
We don't do any validation at construction-time for an object dtype Index. I expect a bunch of methods will break with non-hashable elements.
I think there's an argument to be made that we shouldn't allow this as a index, given Python lists aren't hashable (even though PyArrow lists are)
Wouldn't that break a lot e.g. obj.groupby(list_dtype_column).method()
? I expect there are lots of methods/functions that do a obj.set_index(col).something().reset_index(col)
that would become fragile.
The simple solution to all of these problems is to require users to be explicit when they want to treat a list as a scalar.
Definitely a valid concern. Do you know what might not be covered in this area by the extension tests?
I can come up with a few examples if we go down this road, but that part of the code has a lot of paths so it won't be comprehensive. We'll be chasing down corner cases for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's a good point about the groupby stuff. I don't have a strong objection to requiring an explicit scalar argument, though I guess the follow up question is do we want users passing a pyarrow scalar or do we want to create our own scalar type.
Not sure how other libraries like polars handle this, but @MarcoGorelli might have some insight
@@ -428,7 +428,7 @@ def _box_pa_scalar(cls, value, pa_type: pa.DataType | None = None) -> pa.Scalar: | |||
""" | |||
if isinstance(value, pa.Scalar): | |||
pa_scalar = value | |||
elif isna(value): | |||
elif not is_list_like(value) and isna(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment why this is necessary. e.g. off the top of my head I dont know if pa.ListScalar subclasses pa.Scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case the isna(value) check fails when value is a list, since it doesn't return a boolean back.
The "not is_list_like" was a quick way to prevent this branch from throwing an exception, but open to better ways of expressing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to your question, pa.ListScalar does inherit from pa.Scalar (all Scalars in pyarrow do) but that is not the type that is hitting this branch, since it is caught in the one preceding
This should be three items [B, NA, A] with | ||
A < B and NA missing. | ||
""" | ||
pytest.skip("ListArray does not support sorting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not? dont python lists have order semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pyarrow doesn't directly support sorting lists, and if it did I'm not sure it would use the same semantics as Python (I don't know how other languages would care for that)
I suppose we could implement something ourselves to cast all elements to Python and use the language's list semantics, but that's likely a big change better left to a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
pandas/_libs/lib.pyx
Outdated
@@ -834,6 +834,9 @@ cpdef ndarray[object] ensure_string_array( | |||
if isinstance(val, bytes): | |||
# GH#49658 discussion of desired behavior here | |||
result[i] = val.decode() | |||
elif isinstance(val, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the cython code we use util.is_array for this check
pandas/io/formats/format.py
Outdated
return fmt_values | ||
|
||
|
||
class _ListFormatter(_GenericArrayFormatter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt look like this is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep dead code - thanks!
@@ -1,138 +0,0 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a huge deal but id prefer if we got the tests passing with the implementation still living here, then moved the implementation in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that's not a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking through this some more - some of the modules like pandas.core end up importing the ListDtype and ListArray. For that, maybe its better to move this out of tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK no big deal if its not easy
b4252c3
to
5a2b113
Compare
name="a", | ||
) | ||
actual = ser.list.len() | ||
expected = Series([3, 2, None], dtype=ArrowDtype(pa.int32()), name="a") | ||
expected = Series([3, 2, None], dtype=ArrowDtype(pa.int64()), name="a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we feel about the changes in this test module. I swapped us over to the first class type instead of the ArrowDtype list for all the tests, but we can also parametrize
The tricky thing about the logical type here is that, without other logical types we do get back the implementation detail of a pyarrow type back for things like list.len()
.
Alternatively we could decide on just returning pd.Int64Dtype()
here. That would abstract the implementation details of the ListArray and not lose any data, although the mask of the Int64Array requires more memory than pyarrow. This approach would also be more inline with what PDEP-13 proposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking this over some more it should probably return pd.Int64Dtype()
for length. That would match what pd.StringDtype.len
returns
Is the idea for this to replace |
Yea I'd prefer this to be the user-facing API for working with list data, but having that fully replace the There's a few places in I/O where this can work its way into the |
@WillAyd |
@Dr-Irv that's a great idea although I'm hesitant to change the API of any existing functions without taking PDEP-13 further |
I don't think it's an API change. It's just a question of the |
That's an interesting idea. My main concern would be the portability of pandas, i.e. running code in one environment without pyarrow installed will yield a potentially different code path than what what you would get in an environment with pyarrow. I think the logical type abstraction would go a long ways to simplify that for end users. On the flip side, we are still somewhat in limbo as to whether or not pyarrow is actually required for pandas 3.0, given the status of PDEP-15 (we've also discussed this on Slack). If we hard commit back to pyarrow, then you could implement this with less friction I think Appreciate the input - I think there's a lot of potential here so just need to decide as a team where we feel comfortable with non-NumPy originated types in the default space |
Quick POC for now. There's a lot to do here but hoping to work in pieces. This currently assumes pyarrow is installed.
The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained. That works for NumPy, but doesn't work with PyArrow, for example when you have an array of all
nulls
that is separately paired with a type oflist[string]