Skip to content

Benchmarking and Accuracy Documentation #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdhaber opened this issue Feb 14, 2025 · 2 comments
Open

Benchmarking and Accuracy Documentation #7

mdhaber opened this issue Feb 14, 2025 · 2 comments

Comments

@mdhaber
Copy link
Collaborator

mdhaber commented Feb 14, 2025

Accurate evaluation of special functions (e.g. distribution functions, moments, entropy) with finite precision arithmetic is hard, and expecting results accurate to machine precision for arbitrary machine-representable inputs is impractical. At the time of writing, many SciPy stats and special functions produce inaccurate results without any warnings in the documentation or during execution, users typically report accuracy problems individually as "bugs", and occasionally we make improvements. Historically, there has not been a long-term, birds-eye plan. We want to do better by:

  • providing documentation about the relationship between input values and result accuracy for each method of each distribution,
  • returning error estimates where possible,
  • returning NaNs rather than garbage, and
  • tracking the accuracy over time with benchmarks.

Only results that are inconsistent with the documented accuracy will be "bugs"; the rest is opportunity for improvement.

There are two things that need to happen in SciPy before we can completely satisfy these goals.

  • The distribution infrastructure needs a way of returning error estimate information. For default implementations of methods, this could be as simple as returning the error estimate provided by the numerical quadrature/inversion/minimization routine, assuming that the inaccuracy of the underlying functions is small in comparison. (I don't think it's practical to stack errors with interval arithmetic or anything.) However, the idea of adding an error attribute to the returned array has not really caught on, so we just don't have a way of doing that.
  • The distribution infrastructure needs to be translated to the array API so we can run mparrays through it. The idea is that we can (at least partially) automate the process of generating reference values by evaluating functions with arbitrary precision arithmetic.

In the meantime, though, I think we can plan for what we want to do with those features when they are available. For instance, we can manually add ReferenceDistributions, and use those to run benchmarks and document method accuracy.

I don't have a complete picture of what those benchmarks and documentation will look like. I have a vague notion of some aspects.

  • Documentation might contain statements like "Within subdomain $D$ (defined by e.g. a hyperrectangle), method will produce results within relative error $\epsilon_r$ and absolute error $\epsilon_a$ with probability $p$" (where $p$ is some pre-defined reliability, like 99.99%). Presumably, there will be some small, simple to define base region with good accuracy (e.g. " $x \in [10^{-10}, 10^{10}]$, $c \in [1, 10^{10}]$ "), and maybe outer subdomains and hot spots with reduced accuracy.
  • Documentation might contain plots of error over the domain like these, which we could produce during benchmark runs. We need some tools to make it easy to generate standard ones.
  • We can make it easy for users to run benchmarks for themselves. It would be super cool to have an interactive, in-browser tool that investigates accuracy within a particular region (e.g. see plots above and imagine generating them interactively). @tupui is that in your wheelhouse?

For some inspiration, see how Boost presents accuracy information, e.g. here. (However, I'd like to go quite a bit beyond that.) We should also coordinate with scipy/xsf to avoid duplicating effort, since I think we'lll be trying to solve similar problems.

In any case, clearly there is a lot we can do while waiting for features in the infrastructure!

@steppi
Copy link

steppi commented Feb 25, 2025

I've been thinking of things along these lines too. I'll summarize my ideas for special/xsf because that's what I've been working on, though imagine the considerations for stats will be similar, though stats is more complicated to test.

I worked on creating reference implementations for special functions based on mpmath in scipy/xsref#1. I don't have a reference implementation for every function implemented in xsf yet, but I think it's feasible to get to full coverage.

My idea for testing xsf is to have reference tables with a reasonably comprehensive set of inputs and hopefully accurate reference outputs. Rather than having some fixed tolerance for the relative error that must be met to pass, I want to keep separate error tables of the current relative error 1 per platform. To pass a test, the relative error has to be no more than 2x what is in the error table (but the relative error tolerances in the error tables won't creep up if there is an increase, unless we determine this is really needed (e.g. the reference implementation was inaccurate).

So your idea of tracking the accuracy over time with benchmarks is essentially how I envisioned the testing strategy working for xsf. I want to move towards a state where:

  • The rationale behind the test cases chosen for each reference table is documented, with justification that the cases are reasonably comprehensive, but with an understanding of the limits of the test coverage when relevant.
  • Accuracy on cases in the reference tables is documented with plots and tables which update based on the test runs in CI.

I like the idea of considering documented inaccuracies to be opportunities for improvement rather than outright bugs.

I also support the idea of returning NaNs instead of garbage when an answer is known to be highly inaccurate. in scipy.special we could set the error code SF_ERROR_NO_RESULT in this case. As an aside, if special function error reporting becomes more consistent and principled, then it may be a good idea to consider changing the default from "ignore" to "warn" like it currently is for ufuncs in NumPy.

Some other comments on specific points above:

  • We can make it easy for users to run benchmarks for themselves. It would be super cool to have an interactive, in-browser tool that investigates accuracy within a particular region (e.g. see plots above and imagine generating them interactively). @tupui is that in your wheelhouse?

That's not in my wheelhouse, but I was thinking about having interactive in-REPL tools in xsref for investigating accuracy of special functions. Someone with web dev skills could probably make a nice interface around what I hope to write in xsref.

  • returning error estimates where possible

I like this idea a lot, but think it could be very difficult in many cases. We should do it when it's practical though.

  • However, the idea of adding an error attribute to the returned array has not really caught on, so we just don't have a way of doing that.

This might actually be doable though if we can do it in a way that doesn't require lots of coordination with other projects. Let's discuss how we could actually move this forward.

I'm hoping to finish getting testing set up for xsf quite soon, and hopefully we can learn from that experience and bring some of these ideas to stats too.

Footnotes

  1. Technically, I'm using something I'm calling the extended relative error which reduces to relative error for unexceptional floating point values, but still returns a meaningful float result in [0, inf] when comparing any two floating point numbers, including zeros, infinities, and NaNs.

@mdhaber
Copy link
Collaborator Author

mdhaber commented Feb 26, 2025

This might actually be doable though if we can do it in a way that doesn't require lots of coordination with other projects. Let's discuss how we could actually move this forward.

In the meantime, I went ahead and opened numpy/numpy#28397 to float the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants