You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, it is impossible to lazily reason about the datatypes used in a query. When given a LazyFrame without knowing, it is impossible to perform casting relative to the schema without knowing the datatypes beforehand or doing a (potentially expensive) .collect_schema(). This makes it difficult to perform certain operations lazily or factor queries out into reusable and understandable components.
One example of behavior that is currently impossible.
My proposition is to add a DataTypeExpr and pl.dtype_of function with the following signature:
defdtype_of(item: str|Expr) ->DataTypeExpr:
"""Lazily get the datatype of a certain column or expression at certain time."""
This would be the equivalent of Expr but instead of evaluating to Series it would evaluate to a DataType.
Of course, this would have limitations. There are certain expressions which do not have a lazily knowable datatype (e.g. shrink_dtype and reshape). For these, I would propose that we throw an error for now.
Some of the basic methods that should probably be available.
How about a data type specific when then. Like pl.col("a").dtype_when(pl.String).then(pl.element().str.to_date().dt.year()).when(pl.Date).then(pl.element().dt.year())
Not sure I actually like that syntax but that functionality where a regular when wouldn't work because of the parallel evaluation of the thens.
How about a data type specific when then. Like pl.col("a").dtype_when(pl.String).then(pl.element().str.to_date().dt.year()).when(pl.Date).then(pl.element().dt.year())
Not sure I actually like that syntax but that functionality where a regular when wouldn't work because of the parallel evaluation of the thens.
I am not sure of the API, but it would be possible. It is also quite similar to LazyFrame.match_to_schema(missing_columns={ ... }) accepting expressions.
How about a data type specific when then. Like pl.col("a").dtype_when(pl.String).then(pl.element().str.to_date().dt.year()).when(pl.Date).then(pl.element().dt.year())
Not sure I actually like that syntax but that functionality where a regular when wouldn't work because of the parallel evaluation of the thens.
Can say that this would be important functionality for me too, though maybe it'd be nice to extend when then and otherwise to take in an optional kwarg that can disable parallel evaluation of all branches?
Currently, it is impossible to lazily reason about the datatypes used in a query. When given a
LazyFrame
without knowing, it is impossible to perform casting relative to the schema without knowing the datatypes beforehand or doing a (potentially expensive).collect_schema()
. This makes it difficult to perform certain operations lazily or factor queries out into reusable and understandable components.One example of behavior that is currently impossible.
Or the following:
My proposition is to add a
DataTypeExpr
andpl.dtype_of
function with the following signature:This would be the equivalent of
Expr
but instead of evaluating toSeries
it would evaluate to aDataType
.Of course, this would have limitations. There are certain expressions which do not have a lazily knowable datatype (e.g.
shrink_dtype
andreshape
). For these, I would propose that we throw an error for now.Some of the basic methods that should probably be available.
DataTypeExpr.supertype_with(other: IntoDataTypeExpr, lossless: bool) -> DataTypeExpr
/pl.supertype(..datatype_exprs, *, lossless: bool)
DataTypeExpr.equals(other: IntoDataTypeExpr) -> Expr[dtype=bool]
DataTypeExpr.not_equals -> Expr[dtype=bool]
DataTypeExpr.repr -> Expr[dtype=str]
There might be others, but this is probably enough for an MVP.
Other considerations
pl.Categories
and Streaming Compatible Categoricals #22568. As you might not have the original categories available to return to.pl.col('a', 'b')
,pl.all()
)References
Expr.meta
function that computes the dtype of an expression #16974The text was updated successfully, but these errors were encountered: