Skip to content

GH-46462: [C++][Parquet] Expose currently thrown EncodedStatistics when checking is_stats_set #46463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented May 15, 2025

Rationale for this change

We are currently throwing away EncodedStatistics that are built when calling is_stats_set. If we require EncodedStatistics we are doing some spurious conversions that could potentially be optimized. For example generating the TypedStatistics only when necessary.

What changes are included in this PR?

  • cache the generated encoded_statistics on is_stats_set
  • provide accessor to encoded_statistics
  • avoid generating the typed statistics if all we care is about encoded_statistics

Are these changes tested?

Existing tests are successful and adapted tests to validate new accessor and encoded_statistics values.

Are there any user-facing changes?

No

@raulcd
Copy link
Member Author

raulcd commented May 15, 2025

@pitrou do you think I should add more tests?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally ok, two comments below.

if (possible_stats_ == nullptr) {
possible_stats_ = MakeColumnStats(*column_metadata_, descr_);
}
if (possible_encoded_stats_ == nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must take the lock around this.


std::shared_ptr<EncodedStatistics> enc_stats = column_chunk->encoded_statistics();
EXPECT_EQ(null_count, enc_stats->null_count);
EXPECT_TRUE(expected_stats->HasMinMax());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to check enc_stats->has_min and enc_stats->has_max?

@pitrou
Copy link
Member

pitrou commented May 15, 2025

Also cc @wgtmac .

@@ -143,6 +143,7 @@ class PARQUET_EXPORT ColumnChunkMetaData {
bool is_stats_set() const;
bool is_geo_stats_set() const;
std::shared_ptr<Statistics> statistics() const;
std::shared_ptr<EncodedStatistics> encoded_statistics() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a general question that when encoded_statistics() is useful and worth exposing it to downstream users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I also wondering this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of a couple of cases where this can be useful. For example if I am building a metadata inspection tool where I don't need the TypedStatistics but access to the encoded values, something like the printer improvement on the PR. It could be also useful if I am consolidating several parquet files and I want to get the encoded values directly but don't necessarily need the TypedStatistics.
As an example if I want to check the size of the plain encoded statistics on the metadata with something like the following snippet:

    size_t total_size = 0;
    for (int iter = 0; iter < 1000; ++iter) {
        for (int r = 0; r < metadata->num_row_groups(); r++) {
            const auto row_group = metadata->RowGroup(r);
            for (int c = 0; c < metadata->num_columns(); c++) {
                const auto stats = row_group->ColumnChunk(c)->statistics();
                if (stats->HasMinMax()) {
                    const std::string& min_encoded = stats->EncodeMin();
                    const std::string& max_encoded = stats->EncodeMax();
                    total_size += min_encoded.size() + max_encoded.size();
                }
            }
        }
    }

Having access to the encoded_statistics directly gave me almost an order of magnitude improvement just avoiding having to throw the EncodedStatistics and encode them again. When tested on a 10 column int64, 100 rowgroups:
With current branch: ~2994ms (average of 3043ms, 2984ms, 2955ms)
With my PR and minor changes to use the new API: ~361ms (average of 347ms, 367ms, 369ms)

@pitrou thoughts on this?

}
EncodedStatistics encodedStatistics = possible_stats_->Encode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is unnecessary to call Encode() every time is_stats_set(). Should we take a different approach by caching the result of is_stats_set()?

IMO, many Parquet metadata objects are meant to access once and cache them for follow-up uses. For example, FileMetaData will always create new RowGroupMetaData and ColumnMetaData objects so users are supposed to cache them to avoid creation cost.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants