Skip to content

Attempt to reduce surface area for Array implementation #3131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gatesn opened this issue Apr 25, 2025 · 0 comments
Open

Attempt to reduce surface area for Array implementation #3131

gatesn opened this issue Apr 25, 2025 · 0 comments

Comments

@gatesn
Copy link
Contributor

gatesn commented Apr 25, 2025

The Array + ArrayImpl traits still don't feel quite right. There's quite a lot of surface area to implement a new encoding, the weird underscore functions to hide ArrayImpl from the public API is a little gross, and it doesn't fit well with handling Arrays from FFI.

Much of this can be fixed by creating an Array struct that wraps some opaque heap-allocated data. This could be something like Array(Arc<dyn ArrayImpl>).

In a former life, we had something more like this:

struct Array {
  encoding: Encoding,
  metadata: Vec<u8>,
  children: Vec<Children>,
  buffers: Vec<Buffer>,
  stats: Stats,
}

But this suffers from forcing us to keep metadata serialized in-memory, and therefore makes creating arrays quite expensive. Although we do pretty much have an ABI-stable Array struct.

We could relax this slightly, and support heap-allocated metadata, i.e.

struct Array {
  encoding: Encoding,
  metadata: Box<dyn Any>,
  children: Vec<Children>,
  buffers: Vec<Buffer>,
  stats: Stats,
}

But there is still a question of whether something lives in metadata, or lives in a buffer. For example, a ConstantArray with a large scalar value should presumably store that value in a buffer, rather than metadata. Therefore we still have to serialize the value when we create the array. And so we end up pushing buffers into the dyn Any too.

We then face the question of whether we should hold children as an opaque Array, or whether it's better to hold a specific child. e.g. FSSTArray holds a VarBin child. And so perhaps we also push children into the dyn Any!

At this point, we may as well keep going, push everything into the opaque heap data, and we end up back with: Array(Arc<dyn ArrayImpl>).

The problem with this solution is what we do with PrimitiveArray::new, should it return Array(Arc<dyn Any>) and be completely type erased? Should it return PrimitiveArray and only expose the functions on ArrayImpl, rather than Array? Both are bad solutions in my opinion.

So one final option we haven't considered is whether we have struct Array<Encoding>(Encoding) and we type alias type PrimitiveArray = Array<PrimitiveEncoding>. This solves many of the problems above in that the encoding can hold typed scalars and children, we don't expose the internal API (the ones without post-validation), and PrimitiveArray::new returns a PrimitiveArray. The slight issue is that we cannot add new functions to PrimitiveArray if it lives in a third-party crate without the slight annoyance of defining a PrimitiveArrayExt trait. The bigger issue is that we can't really pass these arrays around, e.g. into compute functions, since they hold a generic type.

So, perhaps we actually have:

/// Object-safe trait exposing the API of a Vortex array.
trait Array { ... }

/// Type-def to pass around owned arc'd arrays.
type ArrayRef = Arc<dyn Array>;

/// Wrapper struct to dispatch array API functions over an encoding.
struct ArrayImpl<Encoding>(Encoding);

impl<E> Array for ArrayImpl<E> { ... }

The problem with this setup is that when implementing ArrayImpl, self no longer implements Array, so we cannot easily use any of the possibly useful Array functions. Blergh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant