Skip to content

Add hooks to SchemaAdapter to add custom column generators #15261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Mar 16, 2025

Closes #15220

A lot of the work of this PR is meant to resolve #15220 (comment). I think I'll move that into a standalone PR.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Mar 16, 2025
@adriangb adriangb marked this pull request as draft March 16, 2025 20:48
Comment on lines 279 to 280
let err = adapter.map_schema(&file_schema).unwrap_err().to_string();
assert_eq!(err, "Error during planning: Column a is missing from the file schema, cannot be generated, and is non-nullable");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that this fails earlier 😄

@adriangb
Copy link
Contributor Author

I've moved the complex bit over to #15263. I'll let that settle first then resume work here.

@adriangb
Copy link
Contributor Author

Noting that in #15263 (comment) I realized that it might be good to have a system to report stats for columns that will be generated before they are generated (is it all nulls? is it a constant?) to be used with stats pruning.

@adriangb
Copy link
Contributor Author

Now that #15263 is merged I'll come back here and:

  1. Resolve conflicts.
  2. Add an API for the SchemaAdapter to declare the stats for potentially generated columns if they are known ahead of time.

@adriangb adriangb force-pushed the defaults-schema-adapter branch from 896eb8b to 74b884f Compare March 20, 2025 17:15
@adriangb adriangb force-pushed the defaults-schema-adapter branch from de02a9a to 0bdebc8 Compare March 20, 2025 17:20
@adriangb adriangb marked this pull request as ready for review March 20, 2025 17:20
@adriangb
Copy link
Contributor Author

Marking as ready for review. The main TODO is an API for transmitting statistics information for generated columns before they get generated, but that can even be a followup PR.

@adriangb
Copy link
Contributor Author

adriangb commented Apr 30, 2025

Looking at how filter pushdown interacts with partition columns I think this could improve that.
Currently the partition values get bound when the FileStream is created which is after the predicate pushdown is applied. The filtering for filters that depend both on the partition values and data happens via a FilterExec.
This means that partition values are not available in predicate pushdown, and instead happens upstream in a FilterExec.

I feel like this change could help with that... but some details are missing: we somehow need to pipe the partition values into the FileSource so that it can in turn pass in the info to generate the partition columns on the fly if needed. Or something like that...

Copy link
Contributor

@leoyvens leoyvens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a maintainer, but I also have a use case for this and the patch LGTM.

Thank you @adriangb for contributing this!

@@ -98,6 +100,46 @@ pub trait SchemaMapper: Debug + Send + Sync {
fn map_batch(&self, batch: RecordBatch) -> datafusion_common::Result<RecordBatch>;
}

pub trait MissingColumnGeneratorFactory: Debug + Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to review this entire PR yet, but I am feeling a little strange that these traits are not used by SchemaMapper but instead are used by DefaultSchemaMapper

I wonder what is the usecase for these extra two traits, compared to say adding a method on SchemaMapper

@adriangb
Copy link
Contributor Author

Marking as draft until I have time to work on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support default values for columns in SchemaAdapter
3 participants