-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add hooks to SchemaAdapter
to add custom column generators
#15261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
let err = adapter.map_schema(&file_schema).unwrap_err().to_string(); | ||
assert_eq!(err, "Error during planning: Column a is missing from the file schema, cannot be generated, and is non-nullable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that this fails earlier 😄
I've moved the complex bit over to #15263. I'll let that settle first then resume work here. |
Noting that in #15263 (comment) I realized that it might be good to have a system to report stats for columns that will be generated before they are generated (is it all nulls? is it a constant?) to be used with stats pruning. |
Now that #15263 is merged I'll come back here and:
|
896eb8b
to
74b884f
Compare
de02a9a
to
0bdebc8
Compare
Marking as ready for review. The main TODO is an API for transmitting statistics information for generated columns before they get generated, but that can even be a followup PR. |
Looking at how filter pushdown interacts with partition columns I think this could improve that. I feel like this change could help with that... but some details are missing: we somehow need to pipe the partition values into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a maintainer, but I also have a use case for this and the patch LGTM.
Thank you @adriangb for contributing this!
@@ -98,6 +100,46 @@ pub trait SchemaMapper: Debug + Send + Sync { | |||
fn map_batch(&self, batch: RecordBatch) -> datafusion_common::Result<RecordBatch>; | |||
} | |||
|
|||
pub trait MissingColumnGeneratorFactory: Debug + Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had a chance to review this entire PR yet, but I am feeling a little strange that these traits are not used by SchemaMapper
but instead are used by DefaultSchemaMapper
I wonder what is the usecase for these extra two traits, compared to say adding a method on SchemaMapper
Marking as draft until I have time to work on this |
Closes #15220
A lot of the work of this PR is meant to resolve #15220 (comment). I think I'll move that into a standalone PR.