Updating Unstructured::int_in_range to minimize entropy consumption #222

ejmount · 2025-04-26T19:07:24Z

Changes

int_in_range now uses individual bits from the stored data when the span of the input range allows
- ~~This mitigates but does not entirely eliminate bias~~ Not actually sure which way this changes the bias after further thought - the non-uniformity is now proportional to how far the range is from the next power of 2, not the next power of 256.
- As mentioned in the docs, methods that return segments of the buffer wholesale do not take the partial bytes into account and instead return from the next entirely unused byte
Adds a new public method, int_range_bytes_needed to query how much data from the buffer is needed for an int_in_range call to succeed without using dummy values.
Internal-only APIs that are actually infallible are changed to not return Result in anticipation of List of breaking changes for next breaking release #217
Docs are updated accordingly and new tests have been added to show that entropy bytes are split.

Unresolved Questions

I couldn't understand the purpose of arbitrary_byte_size well enough to determine whether using less entropy affects its calculations, so I have left the logic unchanged. Is this correct?
Some tests fail because they expect int_in_range to return specific answers given specific entropy and now fail because the calculation has changed. (Unlike e.g. int_in_range_covers_unsigned_range which tests that all possible answers are reached by some starting entropy without constraining which inputs map to which.) Should these expected values be updated or possibly the assertions removed?
Parts of the code are more awkward than necessary to avoid adding requirements to Int and potentially breaking compatibility - while the docs say clients shouldn't implement it, it's not enforced, and I wanted this to be a drop-in replacement. Is it worth making simplifications by adding methods to Int?
I realize just as I'm writing this PR that putting a guarantee on data consumed via int_range_bytes_needed may preclude completely fixing bias issues in the future - is this a acceptable, or should int_range_bytes_needed be removed?

- int_in_range now uses bits from the given data individually when the span of the range given to it allows - Adds a new public method, int_range_bytes_needed to query how much data from the buffer is needed for an int_in_range call to succeed without using dummy values - int_in_range_impl is given a slice instead of a iterator so its use of the data is clearer - Adjusts internal (but not public) APIs to reflect certain functions being infallible - Docs and *most* tests are updated accordingly. Tests that check for specific values being produced from arbitrary() calls are left failing.

ejmount · 2025-04-26T20:59:58Z

Just tbc, this is marked as draft because I don't think it's ready for merging until the questions have been resolved, but I don't have any other further edits to make and it is ready for your feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Unstructured::int_in_range to minimize entropy consumption #222

Updating Unstructured::int_in_range to minimize entropy consumption #222

ejmount commented Apr 26, 2025 •

edited

Loading

ejmount commented Apr 26, 2025 •

edited

Loading

Updating Unstructured::int_in_range to minimize entropy consumption #222

Are you sure you want to change the base?

Updating Unstructured::int_in_range to minimize entropy consumption #222

Conversation

ejmount commented Apr 26, 2025 • edited Loading

Changes

Unresolved Questions

ejmount commented Apr 26, 2025 • edited Loading

ejmount commented Apr 26, 2025 •

edited

Loading

ejmount commented Apr 26, 2025 •

edited

Loading