Skip to content

Updating Unstructured::int_in_range to minimize entropy consumption #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ejmount
Copy link

@ejmount ejmount commented Apr 26, 2025

Changes

  • int_in_range now uses individual bits from the stored data when the span of the input range allows
    • This mitigates but does not entirely eliminate bias Not actually sure which way this changes the bias after further thought - the non-uniformity is now proportional to how far the range is from the next power of 2, not the next power of 256.
    • As mentioned in the docs, methods that return segments of the buffer wholesale do not take the partial bytes into account and instead return from the next entirely unused byte
  • Adds a new public method, int_range_bytes_needed to query how much data from the buffer is needed for an int_in_range call to succeed without using dummy values.
  • Internal-only APIs that are actually infallible are changed to not return Result in anticipation of List of breaking changes for next breaking release #217
  • Docs are updated accordingly and new tests have been added to show that entropy bytes are split.

Unresolved Questions

  • I couldn't understand the purpose of arbitrary_byte_size well enough to determine whether using less entropy affects its calculations, so I have left the logic unchanged. Is this correct?
  • Some tests fail because they expect int_in_range to return specific answers given specific entropy and now fail because the calculation has changed. (Unlike e.g. int_in_range_covers_unsigned_range which tests that all possible answers are reached by some starting entropy without constraining which inputs map to which.) Should these expected values be updated or possibly the assertions removed?
  • Parts of the code are more awkward than necessary to avoid adding requirements to Int and potentially breaking compatibility - while the docs say clients shouldn't implement it, it's not enforced, and I wanted this to be a drop-in replacement. Is it worth making simplifications by adding methods to Int?
  • I realize just as I'm writing this PR that putting a guarantee on data consumed via int_range_bytes_needed may preclude completely fixing bias issues in the future - is this a acceptable, or should int_range_bytes_needed be removed?

- int_in_range now uses bits from the given data individually when the span of the range given to it allows
- Adds a new public method, int_range_bytes_needed to query how much data from the buffer is needed for an int_in_range call to succeed without using dummy values
  - int_in_range_impl is given a slice instead of a iterator so its use of the data is clearer
- Adjusts internal (but not public) APIs to reflect certain functions being infallible
- Docs and *most* tests are updated accordingly. Tests that check for specific values being produced from arbitrary() calls are left failing.
@ejmount
Copy link
Author

ejmount commented Apr 26, 2025

Just tbc, this is marked as draft because I don't think it's ready for merging until the questions have been resolved, but I don't have any other further edits to make and it is ready for your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant