-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46462: [C++][Parquet] Expose currently thrown EncodedStatistics when checking is_stats_set #46463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reuse FromThrift method for ExtractStatsFromHeader
…rinter to avoid re-encoding
@pitrou do you think I should add more tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks generally ok, two comments below.
if (possible_stats_ == nullptr) { | ||
possible_stats_ = MakeColumnStats(*column_metadata_, descr_); | ||
} | ||
if (possible_encoded_stats_ == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must take the lock around this.
|
||
std::shared_ptr<EncodedStatistics> enc_stats = column_chunk->encoded_statistics(); | ||
EXPECT_EQ(null_count, enc_stats->null_count); | ||
EXPECT_TRUE(expected_stats->HasMinMax()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to check enc_stats->has_min
and enc_stats->has_max
?
Also cc @wgtmac . |
@@ -143,6 +143,7 @@ class PARQUET_EXPORT ColumnChunkMetaData { | |||
bool is_stats_set() const; | |||
bool is_geo_stats_set() const; | |||
std::shared_ptr<Statistics> statistics() const; | |||
std::shared_ptr<EncodedStatistics> encoded_statistics() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a general question that when encoded_statistics()
is useful and worth exposing it to downstream users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I also wondering this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of a couple of cases where this can be useful. For example if I am building a metadata inspection tool where I don't need the TypedStatistics
but access to the encoded values, something like the printer improvement on the PR. It could be also useful if I am consolidating several parquet files and I want to get the encoded values directly but don't necessarily need the TypedStatistics
.
As an example if I want to check the size of the plain encoded statistics on the metadata with something like the following snippet:
size_t total_size = 0;
for (int iter = 0; iter < 1000; ++iter) {
for (int r = 0; r < metadata->num_row_groups(); r++) {
const auto row_group = metadata->RowGroup(r);
for (int c = 0; c < metadata->num_columns(); c++) {
const auto stats = row_group->ColumnChunk(c)->statistics();
if (stats->HasMinMax()) {
const std::string& min_encoded = stats->EncodeMin();
const std::string& max_encoded = stats->EncodeMax();
total_size += min_encoded.size() + max_encoded.size();
}
}
}
}
Having access to the encoded_statistics
directly gave me almost an order of magnitude improvement just avoiding having to throw the EncodedStatistics
and encode them again. When tested on a 10 column int64, 100 rowgroups:
With current branch: ~2994ms (average of 3043ms, 2984ms, 2955ms)
With my PR and minor changes to use the new API: ~361ms (average of 347ms, 367ms, 369ms)
@pitrou thoughts on this?
} | ||
EncodedStatistics encodedStatistics = possible_stats_->Encode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is unnecessary to call Encode()
every time is_stats_set()
. Should we take a different approach by caching the result of is_stats_set()
?
IMO, many Parquet metadata objects are meant to access once and cache them for follow-up uses. For example, FileMetaData
will always create new RowGroupMetaData
and ColumnMetaData
objects so users are supposed to cache them to avoid creation cost.
Rationale for this change
We are currently throwing away
EncodedStatistics
that are built when callingis_stats_set
. If we requireEncodedStatistics
we are doing some spurious conversions that could potentially be optimized. For example generating theTypedStatistics
only when necessary.What changes are included in this PR?
encoded_statistics
onis_stats_set
encoded_statistics
encoded_statistics
Are these changes tested?
Existing tests are successful and adapted tests to validate new accessor and
encoded_statistics
values.Are there any user-facing changes?
No