-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535
base: main
Are you sure you want to change the base?
Conversation
pre-commit.ci autofix |
Hi @mroeschke, I've made some updates to the 'text.rst' file which updated 'Working with text data' for str dtype in pandas 3.0 I noticed that after updating the branch, a commit for expressions.py was added, although I have made no commits in the file. Could you please help me understand the changes and guide me through? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I think it'd be good to also link to the PDEP for history - perhaps at the top?
@@ -15,8 +15,9 @@ There are two ways to store text data in pandas: | |||
|
|||
1. ``object`` -dtype NumPy array. | |||
2. :class:`StringDtype` extension type. | |||
3. ``str`` -dtype (default from pandas 3.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L14 needs to be updated. Also, should mention string
here too I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is written, it sounds like str
is entirely separate from StringDtype
, but that isn't the case.
pd.set_option("future.infer_string", True)
ser1 = pd.Series(list("xyz"), dtype="str")
ser2 = pd.Series(list("xyz"), dtype=pd.StringDtype("pyarrow", np.nan))
print(ser1.dtype == ser2.dtype)
# True
Maybe just mention that "str"
, when future.infer_string
is set to True, is an alias for pd.StringDtype("pyarrow", np.nan))
or pd.StringDtype("python", np.nan))
depending on whether pyarrow
is installed.
|
||
We recommend using :class:`StringDtype` to store text data. | ||
We recommend using the ``str`` dtype or :class:`StringDtype` to store text data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or string
. Maybe reword to something like
We recommend __not__ using ``object`` dtype to store text data.
doc/source/user_guide/text.rst
Outdated
Use the nullable :class:`StringDtype` (``"string"``) when handling NA values in your string data. It offers | ||
additional flexibility for missing values while maintaining compatibility with pandas' nullable types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
also handles NA values, the difference between np.NaN
vs pd.NA
. I think it'd be good to clarify that here.
.. _text.differences: | ||
|
||
Behavior differences | ||
^^^^^^^^^^^^^^^^^^^^ | ||
|
||
These are places where the behavior of ``StringDtype`` objects differ from | ||
``object`` dtype | ||
These are places where the behavior of ``StringDtype`` or ``str`` objects differ from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a mention of string
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - can just leave this as StringDtype
- no need to mention the aliases.
doc/source/user_guide/text.rst
Outdated
3. In comparison operations, :class:`arrays.StringArray`, ``Series`` backed | ||
by a ``StringArray``, and ``str`` dtype will return an object with :class:`BooleanDtype`, | ||
rather than a ``bool`` dtype object. Missing values in these types will propagate | ||
in comparison operations, rather than always comparing unequal like :attr:`numpy.nan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true for str
.
``Series``. | ||
|
||
.. _text.warn_types: | ||
|
||
.. warning:: | ||
|
||
The type of the Series is inferred and is one among the allowed types (i.e. strings). | ||
The type of the Series is inferred as ``str`` or ``string`` depending on the context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what context do we infer string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string
dtype is inferred when the data includes pd.NA
or other nullable types, ensuring compatibility with pandas' nullable ecosystem. It is also inferred when explicitly specified by the user with dtype="string"
. Otherwise, str
is typically inferred for text data. Please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
string
dtype is inferred when the data includespd.NA
or other nullable types, ensuring compatibility with pandas' nullable ecosystem.
pandas will not infer string
here. This is consistent with other nullable dtypes.
pd.set_option("future.infer_string", True)
ser = pd.Series(["a", pd.NA, "c"])
print(ser.dtype)
# str
ser = pd.Series([1, pd.NA, 3])
print(ser.dtype)
# object
It is also inferred when explicitly specified by the user with
dtype="string"
.
This is not inference - inference, by definition, is the behavior when it is not explicitly specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification, Shall I change it back to the original or update it with an explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think:
When the option
future.infer_string
is set to True, the type of the Series is inferred asstr
.
doc/source/user_guide/text.rst
Outdated
@@ -396,7 +431,7 @@ Missing values on either side will result in missing values in the result as wel | |||
Concatenating a Series and something array-like into a Series | |||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
|||
The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the lengths of the calling ``Series`` (or ``Index``). | |||
The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the length of the calling ``Series`` (or ``Index``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number or rows
should be of
instead
|
||
return ( | ||
_where(cond, left_op, right_op) | ||
if use_numexpr | ||
else _where_standard(cond, left_op, right_op) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unintentional change was due to merging main, can you revert.
This reverts commit 1f778dc.
I have made changes according to the review, but somehow I can't revert the changes in expressions.py. Can you please help me with it? |
You can open the file with an editor and manually modify the file undoing the changes. Locally, you can run
and keep making modifications until that reports no differences. If you'd like, I can push a commit undoing the changes there. |
It would be very helpful, Thank you. I have made the changes as per the review, please let me know if there are any other modifications required in the documentation. |
@rhshadrach, is everything alright, or are modifications needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several mentions of `pd.StringDtype` or `str`
as if these are distinct entities. As mentioned below, they are not. I think we should make it clear that str
is an alias, and just mention pd.StringDtype
from there on. But open to other approaches too.
@@ -15,8 +15,9 @@ There are two ways to store text data in pandas: | |||
|
|||
1. ``object`` -dtype NumPy array. | |||
2. :class:`StringDtype` extension type. | |||
3. ``str`` -dtype (default from pandas 3.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is written, it sounds like str
is entirely separate from StringDtype
, but that isn't the case.
pd.set_option("future.infer_string", True)
ser1 = pd.Series(list("xyz"), dtype="str")
ser2 = pd.Series(list("xyz"), dtype=pd.StringDtype("pyarrow", np.nan))
print(ser1.dtype == ser2.dtype)
# True
Maybe just mention that "str"
, when future.infer_string
is set to True, is an alias for pd.StringDtype("pyarrow", np.nan))
or pd.StringDtype("python", np.nan))
depending on whether pyarrow
is installed.
when creating new data structures. | ||
|
||
Use the nullable :class:`StringDtype` (``"string"``) or ``str`` dtype when handling NA values in your string data. | ||
Note that ``StringDtype`` uses ``pd.NA`` for missing values, whereas ``str`` dtype uses ``np.NaN``. ``StringDtype`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, since StringDtype
accepts an na_value
argument to choose between pd.NA
and np.nan
. np.NaN
has been removed as of NumPy 2.0, so shouldn't be used.
pd.Series(["a", "b", "c"], dtype="str") | ||
pd.Series(["a", "b", "c"], dtype="string") | ||
pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also show passing arguments to StringDtype.
.. _text.differences: | ||
|
||
Behavior differences | ||
^^^^^^^^^^^^^^^^^^^^ | ||
|
||
These are places where the behavior of ``StringDtype`` objects differ from | ||
``object`` dtype | ||
These are places where the behavior of ``StringDtype`` or ``str`` objects differ from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - can just leave this as StringDtype
- no need to mention the aliases.
``Series``. | ||
|
||
.. _text.warn_types: | ||
|
||
.. warning:: | ||
|
||
The type of the Series is inferred and is one among the allowed types (i.e. strings). | ||
The type of the Series is inferred as ``str`` or ``string`` depending on the context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think:
When the option
future.infer_string
is set to True, the type of the Series is inferred asstr
.
Closes #60348
This PR updates the "Working with Text Data" page in the pandas documentation to reflect the change in pandas 3.0 where "str" dtype is now the default.