Parse bytes from Uint8Array #592

hildjj · 2025-04-08T17:51:19Z

See #591. Allow parsing a Uint8Array directly. Features needed:

Generate different parser from a similar grammar input based on an option.
Subset of grammar inputs are valid. In particular, all characters are treated as single byte Latin-1, and any character that can't fit in there is an error.
match byte by decimal/hex number
$ returns Uint8Array.prototype.subarray slice
Instead of Regexps for classes, use a 256-element bitstring. new BigUint64Array(4)?
location()/etc is all in bytes rather than code units
Error reporting in hexdump-like format (see chex for ideas)
building blocks of some kind for UTF8/16, float16/32/64, (u)int8/16/32/64 (all with big- and little-endian variants), and anything else that is common.
think about parsing bits rather than bytes also. Maybe a rule decoration of some kind.
example grammar for PNG or similar

The text was updated successfully, but these errors were encountered:

bf · 2025-04-10T08:34:02Z

think about parsing bits rather than bytes also. Maybe a rule decoration of some kind.

This is an important detail. A parser for DEFLATE would need to work bit-by-bit.

Ideally the grammar files would be always parsed as UTF8 but definition of different data types (UTF8 string, bytes, bites) should be possible. All data would be internally represented as some sort of raw data instead of Javascript's built-in String.

If you look at https://v8docs.nodesource.com/node-5.12/da/d3d/classv8_1_1_array_buffer_view.html you can see that ArrayBuffer is the base class for TypedArray such as Uint8Array. Uint8Array could be used to preserve memory as it has lowest memory footprint of all TypedArray versions.

However, in my testing I noticed that DataView might be the more clever choice because it comes with support for little-endian conversion right out of the gate and it is a very nice way to think about data: On one hand you have the raw data in ArrayBuffer and on the other hand you have a DataView which interprets this data in a very specific way (hey - this is exactly what a parser should do! 😄 ).

Ideally, it would be a merged version of Uint8Array (or Uint8ClampedArray) and DataView.
I was wondering why nobody is doing it this way already and I noticed that big drawback of this is that we'll lose the ability to use RegExp, because those only work on Strings.

In the PR #591 I removed all uses of RegExp and replaced them with set-based operations. I noticed that polyfilling and (ab)using existing Javascript types (e.g. using Number as Byte) makes conversion of the peggyjs parser a lot easier. After playing around with this for a while I settled on a class BinaryView extends Uint8ClampedArray class which utilizes Uint8ClampedArray internally, but reimplements the DataView functionality.

I'm not a compiler expert and have no prior experience building parsers so there might be severe problems with this approach - just wanted to share in case it provides any value.

hildjj · 2025-04-10T15:35:22Z

DataView is a great building block here. Uint8ClampedArray is likely not needed, since its difference from Uint8Array is only when setting bytes, which the parser likely should never be doing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse bytes from Uint8Array #592

Parse bytes from Uint8Array #592

hildjj commented Apr 8, 2025

bf commented Apr 10, 2025 •

edited

Loading

hildjj commented Apr 10, 2025

Parse bytes from Uint8Array #592

Parse bytes from Uint8Array #592

Comments

hildjj commented Apr 8, 2025

bf commented Apr 10, 2025 • edited Loading

hildjj commented Apr 10, 2025

bf commented Apr 10, 2025 •

edited

Loading