Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: removed median percentile to be always included in describe GH #60550 #60557

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v2.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ Other enhancements
when using ``np.array()`` or ``np.asarray()`` on pandas objects) has been
updated to work correctly with NumPy >= 2 (:issue:`57739`)
- The :meth:`~Series.sum` reduction is now implemented for ``StringDtype`` columns (:issue:`59853`)
-
- Median percentile is only included in :meth:`~Series.describe` when a blank
list is passed (:issue:`60550`).
ZenithClown marked this conversation as resolved.
Show resolved Hide resolved

.. ---------------------------------------------------------------------------
.. _whatsnew_230.notable_bug_fixes:
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -10795,7 +10795,8 @@ def describe(
The percentiles to include in the output. All should
fall between 0 and 1. The default is
``[.25, .5, .75]``, which returns the 25th, 50th, and
75th percentiles.
75th percentiles. If a blank list is passed, then returns
only the 50th percentile value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this should be the behavior. If I pass an empty list, I should not get any percentiles.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to preserve the default percentile behavior. I will change this so that percentiles are omitted if a blank list is passed.

Copy link
Contributor

@asishm asishm Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove the 50th percentile without at least a deprecation warning. Reporting the median by default is incredibly helpful (by default)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should not be changing the default behavior. Recommend setting no_default as the default value and maintaining the default behavior when it is not changed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so we will add an argument no_default : bool = False as the argument and preserve the behavior with control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I got confused. the default for percentiles is already None. not sure if it's worth changing to lib.no_default.

@ZenithClown I believe what @rhshadrach means is to change the behavior so that the 50th percentile is removed if a user explicitly passes in percentiles=[] instead of the default None (or the suggestion to change to no_default)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asishm - me too! Agreed leaving the default as None is fine.

include : 'all', list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored
for ``Series``. Here are the options:
Expand Down
11 changes: 6 additions & 5 deletions pandas/core/methods/describe.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ def describe_ndframe(
percentiles : list-like of numbers, optional
The percentiles to include in the output. All should fall between 0 and 1.
The default is ``[.25, .5, .75]``, which returns the 25th, 50th, and
75th percentiles.
75th percentiles.If a blank list is passed, then returns only the
ZenithClown marked this conversation as resolved.
Show resolved Hide resolved
ZenithClown marked this conversation as resolved.
Show resolved Hide resolved
50th percentile value.

Returns
-------
Expand Down Expand Up @@ -351,13 +352,13 @@ def _refine_percentiles(
# explicit conversion of `percentiles` to list
percentiles = list(percentiles)

# median should be included only if blank iterable is passed
if len(percentiles) == 0:
return np.array([0.5])

# get them all to be in [0, 1]
validate_percentile(percentiles)

# median should always be included
if 0.5 not in percentiles:
percentiles.append(0.5)

percentiles = np.asarray(percentiles)

# sort and check for duplicates
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/reductions/test_describe_ndframe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# -*- encoding: utf-8 -*-

"""
We test the describe_ndframe function.
"""

import numpy as np
import pytest

from pandas.core.methods.describe import _refine_percentiles

@pytest.mark.parametrize(
"percentiles_, expected", [
(None, np.array([0.25, 0.5, 0.75])),
([], np.array([0.5])),
([0.3, 0.6], np.array([0.3, 0.6])),
]
)
def test_refine_percentiles(percentiles_, expected):
ZenithClown marked this conversation as resolved.
Show resolved Hide resolved
"""
Check the performance of the _refine_percentiles when multiple
values are passed.
"""

assert np.array_equal(_refine_percentiles(percentiles_), expected)
Loading