Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: disallow specifying the 'str' dtype with storage in [..] in string alias #60661

Conversation

jorisvandenbossche
Copy link
Member

The intention was for the new default "str" dtype to not include the storage in the string alias, and so to also not allow constructing it that way (this is discussed in the PDEP).

This is also implemented this way, as you can see when directly calling the extension dtype API:

# for "string" this is allowed
>>> pd.StringDtype.construct_from_string("string[pyarrow]")
string[pyarrow]

# but for "str" not
>>> pd.StringDtype.construct_from_string("str[pyarrow]")
...
TypeError: Cannot construct a 'StringDtype' from 'str[pyarrow]'

However, when specifying this as a dtype argument in eg constructors (going through pandas_dtype(...), which goes through the extension dtype registry), this "accidentally" kind of works, but gives an unexpected result:

# this raises as expected
>>> pd.Series(["a"], dtype="str[python]")
TypeError: data type 'str[python]' not understood

# but this gives a result
>>> pd.Series(["a"], dtype="str[pyarrow]")
>>>
0    a
dtype: string[pyarrow]
>>> type(pd.Series(["a"], dtype="str[pyarrow]").dtype)
pandas.ArrowDtype

I think it is confusing that it does work in case of the pyarrow storage, but then does give a different dtype than what you would typically expect.
So I would rather just disallow this case (which is what this PR does), although this is a small breaking change for people currently using dtype="str[pyarrow]" to get the ArrowDtype.

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data API - Consistency Internal Consistency of API/Behavior labels Jan 5, 2025
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Jan 5, 2025
@rhshadrach
Copy link
Member

I think it is confusing that it does work in case of the pyarrow storage, but then does give a different dtype than what you would typically expect. So I would rather just disallow this case (which is what this PR does), although this is a small breaking change for people currently using dtype="str[pyarrow]" to get the ArrowDtype.

This gives a definitively wrong result and so I think that puts it solidly in the bugfix camp, not a breaking change.

@WillAyd
Copy link
Member

WillAyd commented Jan 5, 2025

If we really cared we could convert that during the pickle read, although I dont think that would be a blocker. Generally using pickle to move from one environment to another is discouraged

@jorisvandenbossche
Copy link
Member Author

If we really cared we could convert that during the pickle read, although I dont think that would be a blocker. Generally using pickle to move from one environment to another is discouraged

@WillAyd that comment about pickle was meant for the discussion in #60639?

@jorisvandenbossche jorisvandenbossche merged commit 7415aca into pandas-dev:main Jan 13, 2025
51 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-disallow-str-pyarrow-alias branch January 13, 2025 09:48
Copy link

lumberbot-app bot commented Jan 13, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 7415aca37159a99f8f99d93a1908070ddf36178c
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am "Backport PR #60661: String dtype: disallow specifying the 'str' dtype with storage in [..] in string alias"
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60661-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60661 on branch 2.3.x (String dtype: disallow specifying the 'str' dtype with storage in [..] in string alias)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche
Copy link
Member Author

Manual backport -> #60715

@WillAyd
Copy link
Member

WillAyd commented Jan 13, 2025

@WillAyd that comment about pickle was meant for the discussion in #60639?

Ah...thanks for pointing that out

jorisvandenbossche added a commit that referenced this pull request Jan 14, 2025
…th storage in [..] in string alias (#60661) (#60715)

(cherry picked from commit 7415aca)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants