Skip to content

list.set_difference is ~2500x slower (quadratic behavior?) for columns with null values. #22751

Open
@avimallu

Description

@avimallu

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from time import perf_counter
import os
import polars as pl

os.environ["POLARS_VERBOSE"] = "1"

for times in [1_000, 10_000, 100_000]:
    df = pl.DataFrame(
        {
            "list_with_null": [[1, 2, None], [4, None, 6], [7, None, 9]] * times,
            "list_without_null": [[1, 2, 8], [4, 3, 6], [7, 5, 9]] * times,
            "list": [[1, 2, 3], [4, 5, 6], [7, 8, 9]] * times,
        }
    )

    a = perf_counter()
    df.with_columns(pl.col("list_without_null").list.set_difference(pl.col("list")))
    b = perf_counter()
    df.with_columns(pl.col("list_with_null").list.set_difference(pl.col("list")))
    c = perf_counter()

    print(f"""
Multiplier: {times:,}
Time for list_without_null: {b - a:.3f}s
Time for list_with_null: {c - b:.3f}s
Slowdown: {(c - b) / (b - a):.1f}x
    """)

Log output

Multiplier: 1,000
Time for list_without_null: 0.002s
Time for list_with_null: 0.007s
Slowdown: 3.1x
    

Multiplier: 10,000
Time for list_without_null: 0.003s
Time for list_with_null: 0.649s
Slowdown: 251.3x
    

Multiplier: 100,000
Time for list_without_null: 0.026s
Time for list_with_null: 64.453s
Slowdown: 2523.4x

Issue description

list.set_difference is extremely slow for large frames when the list has null elements.

Expected behavior

Not much difference in timing between the two.

Installed versions

--------Version info---------
Polars:              1.29.0
Index type:          UInt32
Platform:            macOS-15.5-arm64-arm-64bit
Python:              3.12.8 (main, Dec  3 2024, 18:42:41) [Clang 16.0.0 (clang-1600.0.26.4)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
polars_cloud         <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Metadata

Metadata

Assignees

No one assigned

    Labels

    P-mediumPriority: mediumbugSomething isn't workingperformancePerformance issues or improvementspythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions