Skip to content

Roadmap for high quality compression? #2984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mwlon opened this issue Apr 12, 2025 · 13 comments
Open

Roadmap for high quality compression? #2984

mwlon opened this issue Apr 12, 2025 · 13 comments

Comments

@mwlon
Copy link

mwlon commented Apr 12, 2025

Thanks for this project. I'm wondering if catering to high compression ratio is on the roadmap, e.g. for long-term storage or network-bandwidth-limited workloads?

If so, from skimming the code, the main blocker I see is the default layout, which currently repartitions to blocks of 8192 rows prior to compression. For high compression ratio, one would rather repartition to something like 1MB or more (say, 2^18 rows), "train" a codec on that data, resulting in say a zstd dictionary or pcodec chunk metadata (call this "the metadata"), write the metadata, and then write your 8192-row blocks with stats following that, leveraging the metadata to compress each block.

Does this make sense, or do I misunderstand the default layout?

@gatesn
Copy link
Contributor

gatesn commented Apr 12, 2025

The default layout repartitions to the next multiple of 8K rows above 1MB of uncompressed data. (You should be able to verify this with the 'cargo vx' command line)

But you're right, that the default BtrBlocksCompressor tries to strike a balance between decompression speed and size, vs being optimised for size alone.

Edit: sorry, I see now where you got the 8k rows. Regarding stats, we do first repartition to exactly 8K rows, compute a zone map, then repartition again as per the above 1MB requirement. So the zone stats aren't tied to the physical chunking of the data.

@mwlon
Copy link
Author

mwlon commented Apr 13, 2025

Aha, I see that now. In that case, maybe one could slap zstd or pcodec on, without any extra metadata business! Are you all planning to add anything like this in cases when the user prefers heavier compression? If not, would the addition be welcome?

@gatesn
Copy link
Contributor

gatesn commented Apr 13, 2025

Yep, so we've left open a slot for per-segment encryption and general purpose compression in the footer that we plan to support soon: https://github.com/spiraldb/vortex/blob/develop/vortex-flatbuffers/flatbuffers/vortex-file/footer.fbs

But separately, we're digging into PCodec a little more to 1) find out if block-level compression would give us satisfactory random access, 2) whether pcodec's strategy selection can help improve compression speed (we spend a lot of time now computing unique count when dict encoding isn't very useful for numeric data), or 3) whether we should just ship a pcodec encoding.

I wonder if you have had any thoughts on the above?

We also hope to have pre-defined use-cases that come with a good set of encodings and layouts. For example, scientific data may benefit from a different set of encodings to analytical data, since we probably care more about compression ratio than we do about random access.

@mwlon
Copy link
Author

mwlon commented Apr 13, 2025

I can partially answer those:

  1. Pco decompresses around 1-4GB/s per CPU core. If that's too slow, I'd recommend an approach like what I described before: write the Pco chunk metadata for the 1MB block somewhere (you could treat this similarly to dictionary encoding's list of unique values), and then use it to decompress smaller Pco pages, so that you only need to decompress, say, 4096 values to get the one you care about. Your wrapping format just needs to supply the page start/end byte ranges for this approach.
  2. I suspect the flow chart for numeric data types should be
match what_user_wants {
  SuperFast -> ..., // use lightweight encodings. Probably skip dictionary altogether since it's just numeric?
  GoodCompression -> ..., // use pco
}

The one thing that might help lightweight encodings is to reuse Pco's automatic mode selection (if Pco were to expose that in the crate somehow)? For that step, Pco does sample the data, but is able to just use statistical techniques on the sample instead of actually compressing it. This would have implications for your downstream encodings and might improve their compression ratios in some cases. More details in the upcoming paper.
3. Likely!

@mwlon
Copy link
Author

mwlon commented Apr 30, 2025

@gatesn Would you be open to PRs for pcodec and zstd encodings?

@gatesn
Copy link
Contributor

gatesn commented Apr 30, 2025

Absolutely for PCodec. I have a sort of background goal of reducing the surface area required to implement a new encoding, it's a little bit messy right now, but the existing ones should give you a good jumping off point.

For ZStd, I'd rather enable compression at the segment level. This is something we're tracking already and should be able to bring forwards. We can then configure it for metadata / data only, or for specific columns, etc.

@mwlon
Copy link
Author

mwlon commented May 1, 2025

With tree-style encodings, why treat general purpose codecs differently? Aren't they effectively just bytes->bytes encodings?

@gatesn
Copy link
Contributor

gatesn commented May 1, 2025

Even with a tree of arrays, we need some sort of leaf node - in our case ByteBuffers.

So while e.g. a PrimitiveArray could in theory store its data in a u8-typed ZstdArray, what would ZstdArray store its data in?

Given our premise places a lot of focus on lightweight encodings, it feels reasonable to say that general purpose compression operates at the buffer level. Or at least, we can make it easy to enable/disable taht. Although there really isn't anything preventing you from writing a ZstdArray, but you would need your own arrays to store bytes in child arrays rather than buffers.

@mwlon
Copy link
Author

mwlon commented May 1, 2025

I'd think that there should be a single terminal/leaf encoding, and that it should be the only one directly containing a buffer. (Equivalently, one could have a dynamic choice whether each encoding's child is a buffer or another array, but that's probably messier.) In this way, any type-compatible tree would be possible, and you wouldn't have need for a special "compression" feature. To me, this seems simpler and more natural.

But that's just aesthetics, and I respect any choice you make in that department. My practical concern is that this will create a difference in behavior between u8 compressions and compressions that take any other data type (like pco). They both output opaque bytes, should often be decompressed in memory, and be treated the same by vortex. Will this property be possible with the planned design? I think very often the strongest reasonable compression strategy is to apply zstd for string-like data and pco for numerical data, which might be bizarre to configure if they are treated as a compression and encoding respectively.

@gatesn
Copy link
Contributor

gatesn commented May 1, 2025

I think I see where you're going...

This is because PCodec is largely just a set of transformations that make the data more amenable to general purpose compression? For example, twiddle some bits and then zstd?

Whereas all of our encodings thus far essentially have bit-packed integers as their terminal node. Which of course then gets stored in buffers that have no further compression (by default).

We went back and forth on this a little bit, whether we should distinguish between a u8 semantic integer and a u8 as part of a byte array where we wouldn't get very far trying to apply integer compression techniques.

Does that sort of make sense? Might explain the different angles we're approaching this from. Of course, I'm very open to fixing this if something needs to change!

@mwlon
Copy link
Author

mwlon commented May 1, 2025

No, it's the opposite: pco fully compresses by itself. Applying zstd afterward would almost always be a waste of (de)compression time. Pco only compresses 16, 32, and 64 bit data types. Does my concern make sense now? If not, maybe it'll be easiest to set up a video chat about this next week.

@mwlon
Copy link
Author

mwlon commented May 17, 2025

From the video call: I think our conclusion was that it makes sense to have encodings for pco and zstd, and compression options for non-array data (e.g. statistics). @gatesn please let me know when your refactor is done and I'll look into adding these encodings.

@gatesn
Copy link
Contributor

gatesn commented May 17, 2025

Refactor for arrays is done. You'll see the basic idea would be something like:

vtable!(Zstd)

impl VTable for ZstdVTable {
   ...
}

Hopefully there are sufficient docs to help implement the required functions. Let me know if you have any questions and I can try to answer ASAP.

Worth nothing two things:

  • the current compressor isn't pluggable, so you'll have to manually change it to, or change the layout strategy in VortexWriteOptions in order to use the new encodings when writing files.
  • you may want to just ignore validity in the first pass, or use the Validity helper struct (copying from PrimitiveArray for example)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants