Skip to content

Parse bytes from Uint8Array #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hildjj opened this issue Apr 8, 2025 · 2 comments
Open

Parse bytes from Uint8Array #592

hildjj opened this issue Apr 8, 2025 · 2 comments

Comments

@hildjj
Copy link
Contributor

hildjj commented Apr 8, 2025

See #591. Allow parsing a Uint8Array directly. Features needed:

  • Generate different parser from a similar grammar input based on an option.
  • Subset of grammar inputs are valid. In particular, all characters are treated as single byte Latin-1, and any character that can't fit in there is an error.
  • match byte by decimal/hex number
  • $ returns Uint8Array.prototype.subarray slice
  • Instead of Regexps for classes, use a 256-element bitstring. new BigUint64Array(4)?
  • location()/etc is all in bytes rather than code units
  • Error reporting in hexdump-like format (see chex for ideas)
  • building blocks of some kind for UTF8/16, float16/32/64, (u)int8/16/32/64 (all with big- and little-endian variants), and anything else that is common.
  • think about parsing bits rather than bytes also. Maybe a rule decoration of some kind.
  • example grammar for PNG or similar
@bf
Copy link

bf commented Apr 10, 2025

think about parsing bits rather than bytes also. Maybe a rule decoration of some kind.

This is an important detail. A parser for DEFLATE would need to work bit-by-bit.

Ideally the grammar files would be always parsed as UTF8 but definition of different data types (UTF8 string, bytes, bites) should be possible. All data would be internally represented as some sort of raw data instead of Javascript's built-in String.

If you look at https://v8docs.nodesource.com/node-5.12/da/d3d/classv8_1_1_array_buffer_view.html you can see that ArrayBuffer is the base class for TypedArray such as Uint8Array. Uint8Array could be used to preserve memory as it has lowest memory footprint of all TypedArray versions.

However, in my testing I noticed that DataView might be the more clever choice because it comes with support for little-endian conversion right out of the gate and it is a very nice way to think about data: On one hand you have the raw data in ArrayBuffer and on the other hand you have a DataView which interprets this data in a very specific way (hey - this is exactly what a parser should do! 😄 ).

Test Image 4

Ideally, it would be a merged version of Uint8Array (or Uint8ClampedArray) and DataView.
I was wondering why nobody is doing it this way already and I noticed that big drawback of this is that we'll lose the ability to use RegExp, because those only work on Strings.

In the PR #591 I removed all uses of RegExp and replaced them with set-based operations. I noticed that polyfilling and (ab)using existing Javascript types (e.g. using Number as Byte) makes conversion of the peggyjs parser a lot easier. After playing around with this for a while I settled on a class BinaryView extends Uint8ClampedArray class which utilizes Uint8ClampedArray internally, but reimplements the DataView functionality.

I'm not a compiler expert and have no prior experience building parsers so there might be severe problems with this approach - just wanted to share in case it provides any value.

@hildjj
Copy link
Contributor Author

hildjj commented Apr 10, 2025

DataView is a great building block here. Uint8ClampedArray is likely not needed, since its difference from Uint8Array is only when setting bytes, which the parser likely should never be doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants