Attempt to reduce surface area for Array implementation #3131

gatesn · 2025-04-25T21:37:44Z

The Array + ArrayImpl traits still don't feel quite right. There's quite a lot of surface area to implement a new encoding, the weird underscore functions to hide ArrayImpl from the public API is a little gross, and it doesn't fit well with handling Arrays from FFI.

Much of this can be fixed by creating an Array struct that wraps some opaque heap-allocated data. This could be something like Array(Arc<dyn ArrayImpl>).

In a former life, we had something more like this:

struct Array {
  encoding: Encoding,
  metadata: Vec<u8>,
  children: Vec<Children>,
  buffers: Vec<Buffer>,
  stats: Stats,
}

But this suffers from forcing us to keep metadata serialized in-memory, and therefore makes creating arrays quite expensive. Although we do pretty much have an ABI-stable Array struct.

We could relax this slightly, and support heap-allocated metadata, i.e.

struct Array {
  encoding: Encoding,
  metadata: Box<dyn Any>,
  children: Vec<Children>,
  buffers: Vec<Buffer>,
  stats: Stats,
}

But there is still a question of whether something lives in metadata, or lives in a buffer. For example, a ConstantArray with a large scalar value should presumably store that value in a buffer, rather than metadata. Therefore we still have to serialize the value when we create the array. And so we end up pushing buffers into the dyn Any too.

We then face the question of whether we should hold children as an opaque Array, or whether it's better to hold a specific child. e.g. FSSTArray holds a VarBin child. And so perhaps we also push children into the dyn Any!

At this point, we may as well keep going, push everything into the opaque heap data, and we end up back with: Array(Arc<dyn ArrayImpl>).

The problem with this solution is what we do with PrimitiveArray::new, should it return Array(Arc<dyn Any>) and be completely type erased? Should it return PrimitiveArray and only expose the functions on ArrayImpl, rather than Array? Both are bad solutions in my opinion.

So one final option we haven't considered is whether we have struct Array<Encoding>(Encoding) and we type alias type PrimitiveArray = Array<PrimitiveEncoding>. This solves many of the problems above in that the encoding can hold typed scalars and children, we don't expose the internal API (the ones without post-validation), and PrimitiveArray::new returns a PrimitiveArray. The slight issue is that we cannot add new functions to PrimitiveArray if it lives in a third-party crate without the slight annoyance of defining a PrimitiveArrayExt trait. The bigger issue is that we can't really pass these arrays around, e.g. into compute functions, since they hold a generic type.

So, perhaps we actually have:

/// Object-safe trait exposing the API of a Vortex array.
trait Array { ... }

/// Type-def to pass around owned arc'd arrays.
type ArrayRef = Arc<dyn Array>;

/// Wrapper struct to dispatch array API functions over an encoding.
struct ArrayImpl<Encoding>(Encoding);

impl<E> Array for ArrayImpl<E> { ... }

The problem with this setup is that when implementing ArrayImpl, self no longer implements Array, so we cannot easily use any of the possibly useful Array functions. Blergh.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt to reduce surface area for Array implementation #3131

Attempt to reduce surface area for Array implementation #3131

gatesn commented Apr 25, 2025 •

edited

Loading

Attempt to reduce surface area for Array implementation #3131

Attempt to reduce surface area for Array implementation #3131

Comments

gatesn commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gatesn commented Apr 25, 2025 •

edited

Loading