Optimize coalesce kernel for StringView (10-50% faster) #7650

alamb · 2025-06-12T16:45:04Z

Which issue does this PR close?

Part of [EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Rationale for this change

Currently the coalesce kernel buffers views / data until there are enough rows and then concat's the results together. StringViewArrays can be even worse as there is a second copy in gc_string_view_batch

This is wasteful because it

Buffers memory (has 2x the peak usage)
Copies the data twice

We can make it faster and more memory efficient by directly creating the output array

What changes are included in this PR?

Add a specialization for incrementally building StringViewArray without buffering

Note this PR does NOT (yet) add specialized filtering -- instead it focuses on reducing the
overhead of appending views by not copying them (again!) with gc_string_view_batch

Open questions:

There is substantial overlap / duplication with StringViewBuilder -- I wonder if we can / should consolidate them somehow

The differences are that the

Block size calculation management (aka look at the buffer sizes of the incoming buffers)
Finishing array allocates sufficient space for views

Are there any user-facing changes?

The kernel is faster, no API changes

alamb · 2025-06-12T19:15:20Z

Update here is that this PR is quite a bit slower than main when coopying many string values. I am looking into why

alamb · 2025-06-12T20:21:07Z

arrow-select/src/coalesce.rs

@@ -123,8 +123,8 @@ pub struct BatchCoalescer {
    schema: SchemaRef,
    /// output batch size
    batch_size: usize,
-    /// In-progress buffered batches
-    buffer: Vec<RecordBatch>,
+    /// In-progress arrays


The main change of this PR is to introduce per-type "InProgressArrays" that have specializations for each type of array

This PR adds a specialization for StringViewArray (to remove the need for gc_string_view_batch)

In follow on PRs (maybe tickets) we can add specialized versions for PrimitiveArray, StringArray, etc

alamb · 2025-06-12T20:22:19Z

arrow-select/src/coalesce.rs

+    }
+
+    fn finish(&mut self) -> Result<ArrayRef, ArrowError> {
+        // Concatenate all buffered arrays into a single array, which uses 2x


The default fallback does the same as the existing coalesce kernel and buffers all the input arrays until it is time to output and then calls concat

Over time I hope to replace this memory inefficient algorithm with one that is both more memory efficient and faster

alamb · 2025-06-12T20:23:07Z

arrow-select/src/coalesce.rs

-/// However, after a while (e.g., after `FilterExec` or `HashJoinExec`) the
-/// `StringViewArray` may only refer to a small portion of the buffer,
-/// significantly increasing memory usage.
-fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch {


The logic of gc_string_view_batch has been incorporated into InProgressStringViewArray, and thus saves a (third!!!) copy of the strings for some StringViewArrays

arrow-select/src/coalesce.rs

Dandandan · 2025-06-12T21:19:33Z

arrow-select/src/coalesce.rs

@@ -158,14 +164,12 @@ impl BatchCoalescer {
    /// Push next batch into the Coalescer
    ///
    /// See [`Self::next_completed_batch()`] to retrieve any completed batches.
-    pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError> {
+    pub fn push_batch(&mut self, mut batch: RecordBatch) -> Result<(), ArrowError> {
        if batch.num_rows() == 0 {


Maybe consider not trying to coalesce the batch whenever it has more than 1/2 (or some other factor) of rows of batch_size?
Probably best to have it configurable, as it depends on the usage if this is a good idea or not.

This logic was once in DataFusion, but is somehow gone 🤔

Thanks -- that is a good idea -- I need to review what, if anything, assumes perfectly sized target batches...

arrow-select/src/coalesce.rs

alamb · 2025-06-18T12:50:14Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_coalesce (8302949) to 6227419 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_coalesce
Results will be posted here when complete

alamb · 2025-06-18T12:51:23Z

I removed the use of RecordBatch::slice but I still see a difference locally for reasons I can't understand

I am currently wondering if it is related to caching effects. I am going to try and change the benchmark so it regenerates the inputs each time rather than pre-computing them

alamb · 2025-06-18T14:01:52Z

Well, I got some of the performance back but I still can't explain why it is 10% slower. The only thing I can come up with is that the coalesce kernel is slower due to allocations or something

group                                                                                alamb_faster_coalesce                  main
-----                                                                                ---------------------                  ----
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.09     96.2±0.63ms        ? ?/sec    1.00     88.1±0.72ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.21      4.6±0.03ms        ? ?/sec    1.00      3.9±0.02ms        ? ?/sec

Maybe I can show that i can make up the difference with some other optimization

alamb · 2025-06-18T14:34:00Z

Ok, I figured out what is going on and why the mixed_utf8view test is slowing down. The issue is that the new utf8view code is triggering garbage collection (string copying) when the old one did not. I put some println! and on main it shows

ideal_buffer_size: 370022, actual_buffer_size: 614032

This is right under the cutoff load factor (0.5) that would force a a copy of the strings into new buffers

However, on this branch, because the GC happens after the input is sliced the overall load factor is smaller which triggers the GC in some cases

ideal_buffer_size: 246034, actual_buffer_size: 614032
ideal_buffer_size: 123988, actual_buffer_size: 614032
ideal_buffer_size: 155553, actual_buffer_size: 614032
Need GC

If I hard code the gc heuristic to be different

index 0be8702c1b..5e4695dd7e 100644
--- a/arrow-select/src/coalesce/byte_view.rs
+++ b/arrow-select/src/coalesce/byte_view.rs
@@ -290,7 +290,7 @@ impl<B: ByteViewType> InProgressArray for InProgressByteViewArray<B> {

         // Copying the strings into a buffer can be time-consuming so
         // only do it if the array is sparse
-        if actual_buffer_size > (ideal_buffer_size * 2) {
+        if actual_buffer_size > (ideal_buffer_size * 100) {
             self.append_views_and_copy_strings(s.views(), ideal_buffer_size, buffers);
         } else {
             self.append_views_and_update_buffer_index(s.views(), buffers);

The performance for this benchmark is the same as on main

I am thinking about how best to fix this

alamb · 2025-06-18T16:00:32Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_coalesce (d63cac7) to 6227419 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_coalesce
Results will be posted here when complete

alamb · 2025-06-18T16:17:22Z

🤖: Benchmark completed

Details

group                                                                                alamb_faster_coalesce                  main
-----                                                                                ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    309.0±2.48ms        ? ?/sec    1.00    308.3±3.48ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.02      9.2±0.11ms        ? ?/sec    1.00      9.0±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.00      4.3±0.08ms        ? ?/sec    1.05      4.5±0.05ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.03ms        ? ?/sec    1.04      3.6±0.04ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.10    305.0±2.69ms        ? ?/sec    1.00    277.6±2.14ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.6±0.07ms        ? ?/sec    1.03     10.9±0.28ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.00      4.7±0.13ms        ? ?/sec    1.02      4.8±0.12ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      4.6±0.02ms        ? ?/sec    1.03      4.7±0.03ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.06     74.5±0.84ms        ? ?/sec    1.00     70.3±0.74ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     12.9±0.15ms        ? ?/sec    1.02     13.2±0.18ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.00     10.2±0.29ms        ? ?/sec    1.03     10.5±0.29ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      8.4±0.21ms        ? ?/sec    1.13      9.5±0.20ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.09     97.0±0.99ms        ? ?/sec    1.00     88.8±0.70ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.03     15.4±0.13ms        ? ?/sec    1.00     15.0±0.14ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.00      9.9±0.09ms        ? ?/sec    1.05     10.4±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00     10.0±0.25ms        ? ?/sec    1.13     11.3±0.34ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.00     59.4±0.26ms        ? ?/sec    1.19     70.6±0.29ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      7.6±0.05ms        ? ?/sec    1.20      9.1±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      4.8±0.20ms        ? ?/sec    1.11      5.3±0.26ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      3.1±0.03ms        ? ?/sec    1.15      3.6±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     80.5±0.28ms        ? ?/sec    1.12     90.4±0.38ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00     10.9±0.05ms        ? ?/sec    1.11     12.1±0.08ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.6±0.29ms        ? ?/sec    1.08      6.0±0.22ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      3.8±0.01ms        ? ?/sec    1.02      3.8±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.00     52.5±0.16ms        ? ?/sec    1.19     62.6±0.55ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      6.3±0.03ms        ? ?/sec    1.15      7.2±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.00      2.8±0.19ms        ? ?/sec    1.13      3.1±0.18ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00      2.2±0.01ms        ? ?/sec    1.12      2.5±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.00     71.3±0.24ms        ? ?/sec    1.03     73.2±0.34ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00      9.6±0.04ms        ? ?/sec    1.10     10.5±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.00      3.5±0.14ms        ? ?/sec    1.09      3.9±0.12ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      4.5±0.01ms        ? ?/sec    1.09      4.9±0.04ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.00     66.3±0.20ms        ? ?/sec    1.41     93.6±0.30ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.00      8.3±0.02ms        ? ?/sec    1.47     12.1±0.03ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      4.6±0.08ms        ? ?/sec    1.12      5.2±0.21ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00      3.6±0.02ms        ? ?/sec    1.20      4.3±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.00     90.6±0.19ms        ? ?/sec    1.40    127.0±0.55ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.00     12.4±0.04ms        ? ?/sec    1.39     17.2±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      6.1±0.05ms        ? ?/sec    1.16      7.1±0.10ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      6.2±0.02ms        ? ?/sec    1.13      7.0±0.03ms        ? ?/sec

alamb · 2025-06-18T17:49:17Z

This is the last remaining one

filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.09     97.0±0.99ms        ? ?/sec    1.00     88.8±0.70ms        ? ?/sec

I could reproduce a very small delta locally (I measured 5-7% slower)

I am pretty sure I can make it up when I bring the filtering into the coalesce kernel (which will stop making intermediate Arrays) so I would like to proceed with this PR.

I will polish it up and make sure it is ready for review tomorrow morning

alamb · 2025-06-19T11:27:23Z

I think this one is ready for review now

Dandandan · 2025-06-19T22:40:05Z

arrow-array/src/array/byte_view_array.rs

+            .iter()
+            .map(|v| {
+                let len = (*v as u32) as usize;
+                if len > 12 {


We probably should set this as constant somewhere and use it

This is a good idea and I will do it in a follow on PR

alamb · 2025-06-20T13:23:40Z

Thanks again for all the help @Dandandan -- I'll try and make a few more PRs to this kernel for primitive arrays as well as optimizing the filtering shortly

alamb · 2025-06-19T11:06:16Z

arrow-select/src/coalesce.rs

@@ -212,32 +221,57 @@ impl BatchCoalescer {
    /// assert_eq!(completed_batch, expected_batch);
    /// ```
    pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError> {
-        if batch.num_rows() == 0 {
-            // If the batch is empty, we don't need to do anything
+        let (_schema, arrays, mut num_rows) = batch.into_parts();


This is the core logic here -- to break up the RecordBatch and incrementally copy rows into the target output.

alamb · 2025-06-20T13:22:48Z

arrow-array/src/array/byte_view_array.rs

+            .iter()
+            .map(|v| {
+                let len = (*v as u32) as usize;
+                if len > 12 {


This is a good idea and I will do it in a follow on PR

@Dandandan

# Which issue does this PR close? As suggested by @Dandandan in #7650 (comment): > We probably should set this as constant somewhere and use it # Rationale for this change Using a symbolic constant in the code rather than a hard coded constant makes it easier to: 1. Understand what the value means 2. Link / attach documentation to the constant to provide context # What changes are included in this PR? 1. Introduce `MAX_INLINE_VIEW_LEN` constant for string/byte views 2. Update code to use that instead of `12` # Are there any user-facing changes? A new constant

alamb added 7 commits June 12, 2025 10:21

Optimize coalesce kernel for StringViewArrays

dd4e92c

sketch API

6d3cf73

checkpoint

63c6ffa

checkpoint

3d9c08d

checkpoint

5d3284b

update

dfcc5ad

cleanup

7aaeba5

github-actions bot added the arrow Changes to the arrow crate label Jun 12, 2025

alamb added 3 commits June 12, 2025 14:23

a bit of unsafe

7f812d1

tweak

1d4c13b

tweak

289ce49

alamb added 3 commits June 12, 2025 15:18

remove old code

61184c2

Optimize by not inlining

e9f18f2

fix test

6dd3aa7

This comment was marked as outdated.

Sign in to view

alamb mentioned this pull request Jun 12, 2025

Add BatchCoalescer::push_filtered_batch and docs #7652

Merged

alamb commented Jun 12, 2025

View reviewed changes

Dandandan reviewed Jun 12, 2025

View reviewed changes

arrow-select/src/coalesce.rs Outdated Show resolved Hide resolved

Dandandan reviewed Jun 12, 2025

View reviewed changes

arrow-select/src/coalesce.rs Outdated Show resolved Hide resolved

Dandandan reviewed Jun 12, 2025

View reviewed changes

alamb added 2 commits June 12, 2025 18:08

Allocate on write

7166892

comment

eb6e123

This comment was marked as outdated.

Sign in to view

Dandandan reviewed Jun 15, 2025

View reviewed changes

arrow-select/src/coalesce.rs Outdated Show resolved Hide resolved

Avoid RecordBatch::slice

8302949

This comment was marked as outdated.

Sign in to view

alamb mentioned this pull request Jun 18, 2025

Make coalesece_kernel benchmark more realistic #7707

Closed

alamb added 2 commits June 18, 2025 11:52

Use new API for appending values to avoid slices

d63cac7

better error

84a6394

alamb changed the title ~~WIP: Optimize coalesce kernel for StringView~~ Optimize coalesce kernel for StringView (10-50% faster) Jun 19, 2025

alamb added 5 commits June 19, 2025 07:05

Improve comments

e4c4743

more docs

860a968

more docs

39916b3

Merge remote-tracking branch 'apache/main' into alamb/faster_coalesce

9916172

fmt

e67d031

alamb marked this pull request as ready for review June 19, 2025 11:27

Dandandan approved these changes Jun 19, 2025

View reviewed changes

Dandandan reviewed Jun 19, 2025

View reviewed changes

alamb commented Jun 20, 2025

View reviewed changes

alamb merged commit 1bed04c into apache:main Jun 20, 2025
26 checks passed

Optimize coalesce kernel for StringView (10-50% faster) #7650

Optimize coalesce kernel for StringView (10-50% faster) #7650

Uh oh!

Conversation

alamb commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Open questions:

Are there any user-facing changes?

Uh oh!

alamb commented Jun 12, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Dandandan Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

This comment was marked as outdated.

alamb commented Jun 18, 2025

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

alamb commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Jun 12, 2025 •

edited

Loading

alamb Jun 12, 2025 •

edited

Loading

Dandandan Jun 12, 2025 •

edited

Loading

alamb commented Jun 18, 2025 •

edited

Loading

alamb commented Jun 20, 2025 •

edited

Loading