Skip to content

RFC: DataType expressions #22780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
coastalwhite opened this issue May 16, 2025 · 4 comments
Open

RFC: DataType expressions #22780

coastalwhite opened this issue May 16, 2025 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@coastalwhite
Copy link
Collaborator

coastalwhite commented May 16, 2025

Currently, it is impossible to lazily reason about the datatypes used in a query. When given a LazyFrame without knowing, it is impossible to perform casting relative to the schema without knowing the datatypes beforehand or doing a (potentially expensive) .collect_schema(). This makes it difficult to perform certain operations lazily or factor queries out into reusable and understandable components.

One example of behavior that is currently impossible.

pl.scan_parquet('path/to/file')
  .with_columns(a = pl.col.b.cast(dtype_of_column_c))
  .collect()

Or the following:

lf.with_columns(
	a = pl.col.b.cast(supertype_of_a_and_b),
	b = pl.col.c.cast(supertype_of_a_and_b),
).collect()

My proposition is to add a DataTypeExpr and pl.dtype_of function with the following signature:

def dtype_of(item: str | Expr) -> DataTypeExpr:
	"""Lazily get the datatype of a certain column or expression at certain time."""

This would be the equivalent of Expr but instead of evaluating to Series it would evaluate to a DataType.

Of course, this would have limitations. There are certain expressions which do not have a lazily knowable datatype (e.g. shrink_dtype and reshape). For these, I would propose that we throw an error for now.

Some of the basic methods that should probably be available.

  • DataTypeExpr.supertype_with(other: IntoDataTypeExpr, lossless: bool) -> DataTypeExpr / pl.supertype(..datatype_exprs, *, lossless: bool)
  • DataTypeExpr.equals(other: IntoDataTypeExpr) -> Expr[dtype=bool]
  • DataTypeExpr.not_equals -> Expr[dtype=bool]
  • DataTypeExpr.repr -> Expr[dtype=str]

There might be others, but this is probably enough for an MVP.

Other considerations

  • This will become even more important with RFC: pl.Categories and Streaming Compatible Categoricals #22568. As you might not have the original categories available to return to.
  • This also nicely fits in with all the schema evolution that has been happening lately.
  • This functionality is most likely very useful as well for debugging queries.
  • This would only need to exist at the DSL level as in the IR all data types are resolved.
  • What to do with multi-column expressions (selectors, pl.col('a', 'b'), pl.all())
  • How to deal with nested types?

References

@coastalwhite coastalwhite added the enhancement New feature or an improvement of an existing feature label May 16, 2025
@deanm0000
Copy link
Collaborator

How about a data type specific when then. Like pl.col("a").dtype_when(pl.String).then(pl.element().str.to_date().dt.year()).when(pl.Date).then(pl.element().dt.year())

Not sure I actually like that syntax but that functionality where a regular when wouldn't work because of the parallel evaluation of the thens.

@etiennebacher

This comment has been minimized.

@coastalwhite
Copy link
Collaborator Author

How about a data type specific when then. Like pl.col("a").dtype_when(pl.String).then(pl.element().str.to_date().dt.year()).when(pl.Date).then(pl.element().dt.year())

Not sure I actually like that syntax but that functionality where a regular when wouldn't work because of the parallel evaluation of the thens.

I am not sure of the API, but it would be possible. It is also quite similar to LazyFrame.match_to_schema(missing_columns={ ... }) accepting expressions.

@kszlim
Copy link
Contributor

kszlim commented May 16, 2025

How about a data type specific when then. Like pl.col("a").dtype_when(pl.String).then(pl.element().str.to_date().dt.year()).when(pl.Date).then(pl.element().dt.year())

Not sure I actually like that syntax but that functionality where a regular when wouldn't work because of the parallel evaluation of the thens.

Can say that this would be important functionality for me too, though maybe it'd be nice to extend when then and otherwise to take in an optional kwarg that can disable parallel evaluation of all branches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants