Open
Description
Checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of Polars.
Reproducible example
from time import perf_counter
import os
import polars as pl
os.environ["POLARS_VERBOSE"] = "1"
for times in [1_000, 10_000, 100_000]:
df = pl.DataFrame(
{
"list_with_null": [[1, 2, None], [4, None, 6], [7, None, 9]] * times,
"list_without_null": [[1, 2, 8], [4, 3, 6], [7, 5, 9]] * times,
"list": [[1, 2, 3], [4, 5, 6], [7, 8, 9]] * times,
}
)
a = perf_counter()
df.with_columns(pl.col("list_without_null").list.set_difference(pl.col("list")))
b = perf_counter()
df.with_columns(pl.col("list_with_null").list.set_difference(pl.col("list")))
c = perf_counter()
print(f"""
Multiplier: {times:,}
Time for list_without_null: {b - a:.3f}s
Time for list_with_null: {c - b:.3f}s
Slowdown: {(c - b) / (b - a):.1f}x
""")
Log output
Multiplier: 1,000
Time for list_without_null: 0.002s
Time for list_with_null: 0.007s
Slowdown: 3.1x
Multiplier: 10,000
Time for list_without_null: 0.003s
Time for list_with_null: 0.649s
Slowdown: 251.3x
Multiplier: 100,000
Time for list_without_null: 0.026s
Time for list_with_null: 64.453s
Slowdown: 2523.4x
Issue description
list.set_difference
is extremely slow for large frames when the list has null
elements.
Expected behavior
Not much difference in timing between the two.
Installed versions
--------Version info---------
Polars: 1.29.0
Index type: UInt32
Platform: macOS-15.5-arm64-arm-64bit
Python: 3.12.8 (main, Dec 3 2024, 18:42:41) [Clang 16.0.0 (clang-1600.0.26.4)]
LTS CPU: False
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair <not installed>
azure.identity <not installed>
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec <not installed>
gevent <not installed>
google.auth <not installed>
great_tables <not installed>
matplotlib <not installed>
numpy <not installed>
openpyxl <not installed>
pandas <not installed>
polars_cloud <not installed>
pyarrow <not installed>
pydantic <not installed>
pyiceberg <not installed>
sqlalchemy <not installed>
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>