Skip to content

Optimize coalesce kernel for StringView (10-50% faster) #7650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jun 20, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 12, 2025

Which issue does this PR close?

Rationale for this change

Currently the coalesce kernel buffers views / data until there are enough rows and then concat's the results together. StringViewArrays can be even worse as there is a second copy in gc_string_view_batch

This is wasteful because it

  1. Buffers memory (has 2x the peak usage)
  2. Copies the data twice

We can make it faster and more memory efficient by directly creating the output array

What changes are included in this PR?

  1. Add a specialization for incrementally building StringViewArray without buffering

Note this PR does NOT (yet) add specialized filtering -- instead it focuses on reducing the
overhead of appending views by not copying them (again!) with gc_string_view_batch

Open questions:

  1. There is substantial overlap / duplication with StringViewBuilder -- I wonder if we can / should consolidate them somehow

The differences are that the

  1. Block size calculation management (aka look at the buffer sizes of the incoming buffers)
  2. Finishing array allocates sufficient space for views

Are there any user-facing changes?

The kernel is faster, no API changes

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 12, 2025
@alamb
Copy link
Contributor Author

alamb commented Jun 12, 2025

Update here is that this PR is quite a bit slower than main when coopying many string values. I am looking into why

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@@ -123,8 +123,8 @@ pub struct BatchCoalescer {
schema: SchemaRef,
/// output batch size
batch_size: usize,
/// In-progress buffered batches
buffer: Vec<RecordBatch>,
/// In-progress arrays
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change of this PR is to introduce per-type "InProgressArrays" that have specializations for each type of array

This PR adds a specialization for StringViewArray (to remove the need for gc_string_view_batch)

In follow on PRs (maybe tickets) we can add specialized versions for PrimitiveArray, StringArray, etc

}

fn finish(&mut self) -> Result<ArrayRef, ArrowError> {
// Concatenate all buffered arrays into a single array, which uses 2x
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default fallback does the same as the existing coalesce kernel and buffers all the input arrays until it is time to output and then calls concat

Over time I hope to replace this memory inefficient algorithm with one that is both more memory efficient and faster

/// However, after a while (e.g., after `FilterExec` or `HashJoinExec`) the
/// `StringViewArray` may only refer to a small portion of the buffer,
/// significantly increasing memory usage.
fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch {
Copy link
Contributor Author

@alamb alamb Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of gc_string_view_batch has been incorporated into InProgressStringViewArray, and thus saves a (third!!!) copy of the strings for some StringViewArrays

@@ -158,14 +164,12 @@ impl BatchCoalescer {
/// Push next batch into the Coalescer
///
/// See [`Self::next_completed_batch()`] to retrieve any completed batches.
pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError> {
pub fn push_batch(&mut self, mut batch: RecordBatch) -> Result<(), ArrowError> {
if batch.num_rows() == 0 {
Copy link
Contributor

@Dandandan Dandandan Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider not trying to coalesce the batch whenever it has more than 1/2 (or some other factor) of rows of batch_size?
Probably best to have it configurable, as it depends on the usage if this is a good idea or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was once in DataFusion, but is somehow gone 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- that is a good idea -- I need to review what, if anything, assumes perfectly sized target batches...

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_coalesce (8302949) to 6227419 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_coalesce
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

I removed the use of RecordBatch::slice but I still see a difference locally for reasons I can't understand

I am currently wondering if it is related to caching effects. I am going to try and change the benchmark so it regenerates the inputs each time rather than pre-computing them

@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

Well, I got some of the performance back but I still can't explain why it is 10% slower. The only thing I can come up with is that the coalesce kernel is slower due to allocations or something

group                                                                                alamb_faster_coalesce                  main
-----                                                                                ---------------------                  ----
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.09     96.2±0.63ms        ? ?/sec    1.00     88.1±0.72ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.21      4.6±0.03ms        ? ?/sec    1.00      3.9±0.02ms        ? ?/sec

Maybe I can show that i can make up the difference with some other optimization

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

Ok, I figured out what is going on and why the mixed_utf8view test is slowing down. The issue is that the new utf8view code is triggering garbage collection (string copying) when the old one did not. I put some println! and on main it shows

ideal_buffer_size: 370022, actual_buffer_size: 614032

This is right under the cutoff load factor (0.5) that would force a a copy of the strings into new buffers

However, on this branch, because the GC happens after the input is sliced the overall load factor is smaller which triggers the GC in some cases

ideal_buffer_size: 246034, actual_buffer_size: 614032
ideal_buffer_size: 123988, actual_buffer_size: 614032
ideal_buffer_size: 155553, actual_buffer_size: 614032
Need GC

If I hard code the gc heuristic to be different

index 0be8702c1b..5e4695dd7e 100644
--- a/arrow-select/src/coalesce/byte_view.rs
+++ b/arrow-select/src/coalesce/byte_view.rs
@@ -290,7 +290,7 @@ impl<B: ByteViewType> InProgressArray for InProgressByteViewArray<B> {

         // Copying the strings into a buffer can be time-consuming so
         // only do it if the array is sparse
-        if actual_buffer_size > (ideal_buffer_size * 2) {
+        if actual_buffer_size > (ideal_buffer_size * 100) {
             self.append_views_and_copy_strings(s.views(), ideal_buffer_size, buffers);
         } else {
             self.append_views_and_update_buffer_index(s.views(), buffers);

The performance for this benchmark is the same as on main

I am thinking about how best to fix this

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_coalesce (d63cac7) to 6227419 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_coalesce
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

🤖: Benchmark completed

Details

group                                                                                alamb_faster_coalesce                  main
-----                                                                                ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    309.0±2.48ms        ? ?/sec    1.00    308.3±3.48ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.02      9.2±0.11ms        ? ?/sec    1.00      9.0±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.00      4.3±0.08ms        ? ?/sec    1.05      4.5±0.05ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.03ms        ? ?/sec    1.04      3.6±0.04ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.10    305.0±2.69ms        ? ?/sec    1.00    277.6±2.14ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.6±0.07ms        ? ?/sec    1.03     10.9±0.28ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.00      4.7±0.13ms        ? ?/sec    1.02      4.8±0.12ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      4.6±0.02ms        ? ?/sec    1.03      4.7±0.03ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.06     74.5±0.84ms        ? ?/sec    1.00     70.3±0.74ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     12.9±0.15ms        ? ?/sec    1.02     13.2±0.18ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.00     10.2±0.29ms        ? ?/sec    1.03     10.5±0.29ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      8.4±0.21ms        ? ?/sec    1.13      9.5±0.20ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.09     97.0±0.99ms        ? ?/sec    1.00     88.8±0.70ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.03     15.4±0.13ms        ? ?/sec    1.00     15.0±0.14ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.00      9.9±0.09ms        ? ?/sec    1.05     10.4±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00     10.0±0.25ms        ? ?/sec    1.13     11.3±0.34ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.00     59.4±0.26ms        ? ?/sec    1.19     70.6±0.29ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      7.6±0.05ms        ? ?/sec    1.20      9.1±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      4.8±0.20ms        ? ?/sec    1.11      5.3±0.26ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      3.1±0.03ms        ? ?/sec    1.15      3.6±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     80.5±0.28ms        ? ?/sec    1.12     90.4±0.38ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00     10.9±0.05ms        ? ?/sec    1.11     12.1±0.08ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.6±0.29ms        ? ?/sec    1.08      6.0±0.22ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      3.8±0.01ms        ? ?/sec    1.02      3.8±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.00     52.5±0.16ms        ? ?/sec    1.19     62.6±0.55ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      6.3±0.03ms        ? ?/sec    1.15      7.2±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.00      2.8±0.19ms        ? ?/sec    1.13      3.1±0.18ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00      2.2±0.01ms        ? ?/sec    1.12      2.5±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.00     71.3±0.24ms        ? ?/sec    1.03     73.2±0.34ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00      9.6±0.04ms        ? ?/sec    1.10     10.5±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.00      3.5±0.14ms        ? ?/sec    1.09      3.9±0.12ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      4.5±0.01ms        ? ?/sec    1.09      4.9±0.04ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.00     66.3±0.20ms        ? ?/sec    1.41     93.6±0.30ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.00      8.3±0.02ms        ? ?/sec    1.47     12.1±0.03ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      4.6±0.08ms        ? ?/sec    1.12      5.2±0.21ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00      3.6±0.02ms        ? ?/sec    1.20      4.3±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.00     90.6±0.19ms        ? ?/sec    1.40    127.0±0.55ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.00     12.4±0.04ms        ? ?/sec    1.39     17.2±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      6.1±0.05ms        ? ?/sec    1.16      7.1±0.10ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      6.2±0.02ms        ? ?/sec    1.13      7.0±0.03ms        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2025

This is the last remaining one

filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.09     97.0±0.99ms        ? ?/sec    1.00     88.8±0.70ms        ? ?/sec

I could reproduce a very small delta locally (I measured 5-7% slower)

I am pretty sure I can make it up when I bring the filtering into the coalesce kernel (which will stop making intermediate Arrays) so I would like to proceed with this PR.

I will polish it up and make sure it is ready for review tomorrow morning

@alamb alamb changed the title WIP: Optimize coalesce kernel for StringView Optimize coalesce kernel for StringView (10-50% faster) Jun 19, 2025
@alamb
Copy link
Contributor Author

alamb commented Jun 19, 2025

I think this one is ready for review now

@alamb alamb marked this pull request as ready for review June 19, 2025 11:27
.iter()
.map(|v| {
let len = (*v as u32) as usize;
if len > 12 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should set this as constant somewhere and use it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea and I will do it in a follow on PR

@alamb
Copy link
Contributor Author

alamb commented Jun 20, 2025

Thanks again for all the help @Dandandan -- I'll try and make a few more PRs to this kernel for primitive arrays as well as optimizing the filtering shortly

@@ -212,32 +221,57 @@ impl BatchCoalescer {
/// assert_eq!(completed_batch, expected_batch);
/// ```
pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError> {
if batch.num_rows() == 0 {
// If the batch is empty, we don't need to do anything
let (_schema, arrays, mut num_rows) = batch.into_parts();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core logic here -- to break up the RecordBatch and incrementally copy rows into the target output.

.iter()
.map(|v| {
let len = (*v as u32) as usize;
if len > 12 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea and I will do it in a follow on PR

@alamb alamb merged commit 1bed04c into apache:main Jun 20, 2025
26 checks passed
alamb added a commit that referenced this pull request Jun 25, 2025
# Which issue does this PR close?

As suggested by @Dandandan in
#7650 (comment):

> We probably should set this as constant somewhere and use it

# Rationale for this change

Using a symbolic constant in the code rather than a hard coded constant
makes it easier to:
1. Understand what the value means
2. Link / attach documentation to the constant to provide context

# What changes are included in this PR?

1. Introduce `MAX_INLINE_VIEW_LEN` constant for string/byte views
2. Update code to use that instead of `12`

# Are there any user-facing changes?

A new constant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants