DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535

Uvi-12 · 2024-12-10T18:22:39Z

This PR updates the "Working with Text Data" page in the pandas documentation to reflect the change in pandas 3.0 where "str" dtype is now the default.

… pandas 3.0

Uvi-12 · 2024-12-10T18:26:01Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

Uvi-12 · 2024-12-17T07:57:44Z

Hi @mroeschke, I've made some updates to the 'text.rst' file which updated 'Working with text data' for str dtype in pandas 3.0

I noticed that after updating the branch, a commit for expressions.py was added, although I have made no commits in the file. Could you please help me understand the changes and guide me through?

rhshadrach

Thanks for the PR! I think it'd be good to also link to the PDEP for history - perhaps at the top?

https://pandas.pydata.org/pdeps/0014-string-dtype.html

rhshadrach · 2024-12-30T16:00:43Z

doc/source/user_guide/text.rst

@@ -15,8 +15,9 @@ There are two ways to store text data in pandas:

 1. ``object`` -dtype NumPy array.
 2. :class:`StringDtype` extension type.
+3. ``str`` -dtype (default from pandas 3.0).


L14 needs to be updated. Also, should mention string here too I think.

The way this is written, it sounds like str is entirely separate from StringDtype, but that isn't the case.

pd.set_option("future.infer_string", True) ser1 = pd.Series(list("xyz"), dtype="str") ser2 = pd.Series(list("xyz"), dtype=pd.StringDtype("pyarrow", np.nan)) print(ser1.dtype == ser2.dtype) # True

Maybe just mention that "str", when future.infer_string is set to True, is an alias for pd.StringDtype("pyarrow", np.nan)) or pd.StringDtype("python", np.nan)) depending on whether pyarrow is installed.

rhshadrach · 2024-12-30T16:01:36Z

doc/source/user_guide/text.rst


-We recommend using :class:`StringDtype` to store text data.
+We recommend using the ``str`` dtype or :class:`StringDtype` to store text data.


Or string. Maybe reword to something like

We recommend __not__ using ``object`` dtype to store text data.

rhshadrach · 2024-12-30T16:03:03Z

doc/source/user_guide/text.rst

+Use the nullable :class:`StringDtype` (``"string"``) when handling NA values in your string data. It offers
+additional flexibility for missing values while maintaining compatibility with pandas' nullable types.


str also handles NA values, the difference between np.NaN vs pd.NA. I think it'd be good to clarify that here.

rhshadrach · 2024-12-30T16:04:38Z

doc/source/user_guide/text.rst


 .. _text.differences:

 Behavior differences
 ^^^^^^^^^^^^^^^^^^^^

-These are places where the behavior of ``StringDtype`` objects differ from
-``object`` dtype
+These are places where the behavior of ``StringDtype`` or ``str`` objects differ from


Needs a mention of string too.

Actually - can just leave this as StringDtype - no need to mention the aliases.

rhshadrach · 2024-12-30T16:05:39Z

doc/source/user_guide/text.rst

+3. In comparison operations, :class:`arrays.StringArray`, ``Series`` backed
+   by a ``StringArray``, and ``str`` dtype will return an object with :class:`BooleanDtype`,
+   rather than a ``bool`` dtype object. Missing values in these types will propagate
+   in comparison operations, rather than always comparing unequal like :attr:`numpy.nan`.


I don't think this is true for str.

rhshadrach · 2024-12-30T16:07:57Z

doc/source/user_guide/text.rst

    ``Series``.

 .. _text.warn_types:

 .. warning::

-    The type of the Series is inferred and is one among the allowed types (i.e. strings).
+    The type of the Series is inferred as ``str`` or ``string`` depending on the context.


In what context do we infer string?

The string dtype is inferred when the data includes pd.NA or other nullable types, ensuring compatibility with pandas' nullable ecosystem. It is also inferred when explicitly specified by the user with dtype="string". Otherwise, str is typically inferred for text data. Please correct me if I am wrong.

The string dtype is inferred when the data includes pd.NA or other nullable types, ensuring compatibility with pandas' nullable ecosystem.

pandas will not infer string here. This is consistent with other nullable dtypes.

pd.set_option("future.infer_string", True) ser = pd.Series(["a", pd.NA, "c"]) print(ser.dtype) # str ser = pd.Series([1, pd.NA, 3]) print(ser.dtype) # object

It is also inferred when explicitly specified by the user with dtype="string".

This is not inference - inference, by definition, is the behavior when it is not explicitly specified.

Thank you for the clarification, Shall I change it back to the original or update it with an explanation?

I think:

When the option future.infer_string is set to True, the type of the Series is inferred as str.

rhshadrach · 2024-12-30T16:09:19Z

doc/source/user_guide/text.rst

@@ -396,7 +431,7 @@ Missing values on either side will result in missing values in the result as wel
 Concatenating a Series and something array-like into a Series
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the lengths of the calling ``Series`` (or ``Index``).
+The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the length of the calling ``Series`` (or ``Index``).


the number or rows should be of instead

rhshadrach · 2024-12-30T16:10:11Z

pandas/core/computation/expressions.py

+
+    return (
+        _where(cond, left_op, right_op)
+        if use_numexpr
+        else _where_standard(cond, left_op, right_op)
+    )


This unintentional change was due to merging main, can you revert.

This reverts commit 1f778dc.

Uvi-12 · 2025-01-01T08:10:49Z

I have made changes according to the review, but somehow I can't revert the changes in expressions.py. Can you please help me with it?

rhshadrach · 2025-01-01T13:05:17Z

I have made changes according to the review, but somehow I can't revert the changes in expressions.py. Can you please help me with it?

You can open the file with an editor and manually modify the file undoing the changes. Locally, you can run

git diff upstream/main pandas/core/computation/expressions.py

and keep making modifications until that reports no differences.

If you'd like, I can push a commit undoing the changes there.

Uvi-12 · 2025-01-01T13:15:27Z

If you'd like, I can push a commit undoing the changes there.

It would be very helpful, Thank you.

I have made the changes as per the review, please let me know if there are any other modifications required in the documentation.

Uvi-12 · 2025-01-09T13:00:05Z

@rhshadrach, is everything alright, or are modifications needed?

rhshadrach

There are several mentions of `pd.StringDtype` or `str` as if these are distinct entities. As mentioned below, they are not. I think we should make it clear that str is an alias, and just mention pd.StringDtype from there on. But open to other approaches too.

rhshadrach · 2025-01-09T20:46:02Z

doc/source/user_guide/text.rst

@@ -15,8 +15,9 @@ There are two ways to store text data in pandas:

 1. ``object`` -dtype NumPy array.
 2. :class:`StringDtype` extension type.
+3. ``str`` -dtype (default from pandas 3.0).


The way this is written, it sounds like str is entirely separate from StringDtype, but that isn't the case.

pd.set_option("future.infer_string", True) ser1 = pd.Series(list("xyz"), dtype="str") ser2 = pd.Series(list("xyz"), dtype=pd.StringDtype("pyarrow", np.nan)) print(ser1.dtype == ser2.dtype) # True

Maybe just mention that "str", when future.infer_string is set to True, is an alias for pd.StringDtype("pyarrow", np.nan)) or pd.StringDtype("python", np.nan)) depending on whether pyarrow is installed.

rhshadrach · 2025-01-09T20:49:44Z

doc/source/user_guide/text.rst

+when creating new data structures.
+
+Use the nullable :class:`StringDtype` (``"string"``) or ``str`` dtype when handling NA values in your string data.
+Note that ``StringDtype`` uses ``pd.NA`` for missing values, whereas ``str`` dtype uses ``np.NaN``. ``StringDtype``


This is incorrect, since StringDtype accepts an na_value argument to choose between pd.NA and np.nan. np.NaN has been removed as of NumPy 2.0, so shouldn't be used.

rhshadrach · 2025-01-09T20:50:21Z

doc/source/user_guide/text.rst

+   pd.Series(["a", "b", "c"], dtype="str")
   pd.Series(["a", "b", "c"], dtype="string")
   pd.Series(["a", "b", "c"], dtype=pd.StringDtype())


I think this should also show passing arguments to StringDtype.

rhshadrach · 2025-01-09T20:53:51Z

doc/source/user_guide/text.rst


 .. _text.differences:

 Behavior differences
 ^^^^^^^^^^^^^^^^^^^^

-These are places where the behavior of ``StringDtype`` objects differ from
-``object`` dtype
+These are places where the behavior of ``StringDtype`` or ``str`` objects differ from


Actually - can just leave this as StringDtype - no need to mention the aliases.

rhshadrach · 2025-01-09T20:57:21Z

doc/source/user_guide/text.rst

    ``Series``.

 .. _text.warn_types:

 .. warning::

-    The type of the Series is inferred and is one among the allowed types (i.e. strings).
+    The type of the Series is inferred as ``str`` or ``string`` depending on the context.


I think:

When the option future.infer_string is set to True, the type of the Series is inferred as str.

DOC (string dtype): updated 'Working with text data' for str dtype in…

0660bc2

… pandas 3.0

pre-commit-ci bot and others added 3 commits December 10, 2024 18:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

043c667

for more information, see https://pre-commit.ci

Merge branch 'main' into string

48dc7fd

Update expressions.py

1f778dc

mroeschke requested a review from jorisvandenbossche December 29, 2024 19:51

rhshadrach requested changes Dec 30, 2024

View reviewed changes

rhshadrach added Docs Strings String extension data type and string data labels Dec 30, 2024

Uvi-12 added 2 commits January 1, 2025 13:37

Updated documentation as per review

7a670d4

Revert "Update expressions.py"

bb777ea

This reverts commit 1f778dc.

Uvi-12 requested a review from rhshadrach January 1, 2025 08:09

Revert changes to expressions.py

2270883

rhshadrach requested changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535

DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535

Uvi-12 commented Dec 10, 2024 •

edited

Loading

Uvi-12 commented Dec 10, 2024

Uvi-12 commented Dec 17, 2024 •

edited

Loading

rhshadrach left a comment

rhshadrach Dec 30, 2024

rhshadrach Jan 9, 2025

rhshadrach Dec 30, 2024

rhshadrach Dec 30, 2024

rhshadrach Dec 30, 2024

rhshadrach Jan 9, 2025

rhshadrach Dec 30, 2024

rhshadrach Dec 30, 2024

Uvi-12 Jan 1, 2025

rhshadrach Jan 1, 2025

Uvi-12 Jan 1, 2025 •

edited

Loading

rhshadrach Jan 9, 2025

rhshadrach Dec 30, 2024

rhshadrach Dec 30, 2024

Uvi-12 commented Jan 1, 2025

rhshadrach commented Jan 1, 2025 •

edited

Loading

Uvi-12 commented Jan 1, 2025

Uvi-12 commented Jan 9, 2025

rhshadrach left a comment

rhshadrach Jan 9, 2025

rhshadrach Jan 9, 2025

rhshadrach Jan 9, 2025

rhshadrach Jan 9, 2025

rhshadrach Jan 9, 2025


		We recommend using :class:`StringDtype` to store text data.
		We recommend using the ``str`` dtype or :class:`StringDtype` to store text data.

		Use the nullable :class:`StringDtype` (``"string"``) when handling NA values in your string data. It offers
		additional flexibility for missing values while maintaining compatibility with pandas' nullable types.

DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535

Are you sure you want to change the base?

DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535

Conversation

Uvi-12 commented Dec 10, 2024 • edited Loading

Uvi-12 commented Dec 10, 2024

Uvi-12 commented Dec 17, 2024 • edited Loading

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uvi-12 Jan 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uvi-12 commented Jan 1, 2025

rhshadrach commented Jan 1, 2025 • edited Loading

Uvi-12 commented Jan 1, 2025

Uvi-12 commented Jan 9, 2025

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uvi-12 commented Dec 10, 2024 •

edited

Loading

Uvi-12 commented Dec 17, 2024 •

edited

Loading

Uvi-12 Jan 1, 2025 •

edited

Loading

rhshadrach commented Jan 1, 2025 •

edited

Loading