Skip to content

refactor: Rework Categorical/Enum to use (Frozen)Categories #23016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 97 commits into from
Jul 3, 2025

Conversation

orlp
Copy link
Member

@orlp orlp commented May 30, 2025

Fixes #3036.
Fixes #14247.
Fixes #14996.
Fixes #15293.
Fixes #15781.
Fixes #17479.
Fixes #17643.
Fixes #18065.
Fixes #18501.
Fixes #19868.
Fixes #19943.
Fixes #20290.
Fixes #20318.
Fixes #20364.
Fixes #20562.
Fixes #20878.
Fixes #20931.
Fixes #21175.
Fixes #21583.
Fixes #22448.
Fixes #22586.
Fixes #22664.
Fixes #22830.
Fixes #23015.
Fixes #23071.
Fixes #23289.

This PR, essentially, replaces the entire Categorical/Enum implementation. There is some breakage that was essentially unavoidable, unfortunately:

  • Physical ordering for Categoricals has been removed, the ordering is now always lexical. The parameter has been deprecated, it is not a hard error to pass "physical" as ordering, it just doesn't do anything anymore.
  • A new file format for Parquet is introduced. Reading older Parquet files is backwards-compatible, but writing new files with Enums in them are read back as Categoricals by older versions of Polars.
  • Casts between Categorical and integer types now always refer to the physical categories. These casts will be deprecated and removed at a later stage once we have dedicated functions to go to/from categories. The casts to/from String still exist and will remain so, any other casts have been removed.

The concept of local and global categories is gone. The StringCache still exists in Python, but does nothing anymore, and will be deprecated and removed later.

In a future PR we will expose the new capabilities of the new Categories system, which lets you specify in the DataType which columns should share the same categorical mapping.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels May 30, 2025
@orlp orlp force-pushed the cat-rework branch 5 times, most recently from 72307c2 to 863cf09 Compare June 6, 2025 13:37
@orlp orlp force-pushed the cat-rework branch 4 times, most recently from ddb7532 to 9036ef6 Compare July 1, 2025 10:02
@orlp orlp marked this pull request as ready for review July 3, 2025 14:56
Copy link

codecov bot commented Jul 3, 2025

Codecov Report

Attention: Patch coverage is 78.91129% with 523 lines in your changes missing coverage. Please review.

Project coverage is 80.87%. Comparing base (348a34d) to head (0644413).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-core/src/datatypes/any_value.rs 29.72% 52 Missing ⚠️
crates/polars-row/src/encode.rs 52.04% 47 Missing ⚠️
crates/polars-row/src/variable/utf8.rs 0.00% 47 Missing ⚠️
...s-core/src/chunked_array/comparison/categorical.rs 84.43% 40 Missing ⚠️
...ars-core/src/series/implementations/categorical.rs 86.25% 29 Missing ⚠️
crates/polars-dtype/src/categorical/mod.rs 86.80% 26 Missing ⚠️
...tes/polars-core/src/series/implementations/time.rs 66.17% 23 Missing ⚠️
crates/polars-core/src/frame/column/mod.rs 9.09% 20 Missing ⚠️
...polars-core/src/series/implementations/duration.rs 64.81% 19 Missing ⚠️
...s-core/src/chunked_array/builder/list/anonymous.rs 31.81% 15 Missing ⚠️
... and 48 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #23016      +/-   ##
==========================================
+ Coverage   80.68%   80.87%   +0.18%     
==========================================
  Files        1645     1632      -13     
  Lines      221895   220133    -1762     
  Branches     2783     2782       -1     
==========================================
- Hits       179036   178027    -1009     
+ Misses      42197    41445     -752     
+ Partials      662      661       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ritchie46
Copy link
Member

Yeah bu'

@ritchie46 ritchie46 merged commit 5246d17 into pola-rs:main Jul 3, 2025
33 checks passed
@ritchie46 ritchie46 added the highlight Highlight this PR in the changelog label Jul 3, 2025
Copy link
Collaborator

@coastalwhite coastalwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Super nice to land this! Could you do a doc update maybe?

@orlp
Copy link
Member Author

orlp commented Jul 4, 2025

@coastalwhite I addressed most of your concerns, please respond to the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
highlight Highlight this PR in the changelog internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
None yet
3 participants