Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add accelerated option to TransmissionAbsorptionConverter using numba #2036

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

hrobarts
Copy link
Contributor

@hrobarts hrobarts commented Jan 13, 2025

Changes

Add accelerated option to TransmissionAbsorptionConverter using numba

Testing you performed

Please add any demo scripts to https://github.com/TomographicImaging/CIL-Demos/tree/main/misc

image

Related issues/links

Closes #2032

Checklist

  • I have performed a self-review of my code
  • I have added docstrings in line with the guidance in the developer guide
  • I have updated the relevant documentation
  • I have implemented unit tests that cover any new or modified functionality
  • CHANGELOG.md has been updated with any functionality change
  • Request review from all relevant developers
  • Change pull request label to 'Waiting for review'

Contribution Notes

Please read and adhere to the developer guide and local patterns and conventions.

  • The content of this Pull Request (the Contribution) is intentionally submitted for inclusion in CIL (the Work) under the terms and conditions of the Apache-2.0 License
  • I confirm that the contribution does not violate any intellectual property rights of third parties

--->

@hrobarts hrobarts self-assigned this Jan 13, 2025
@hrobarts
Copy link
Contributor Author

Testing to check the execution time for calculating the log of a large array serially or using numba (either creating the numba loop per pixel or per projection)
image

@hrobarts
Copy link
Contributor Author

hrobarts commented Jan 13, 2025

Comparison of parallelising the log calculation either by indexing a range of values:

@numba.njit(parallel=True)
def numba_loop1(arr_in, num_proj, proj_size, arr_out):
    in_flat = arr_in.ravel()
    out_flat = arr_out.ravel()
    for i in numba.prange(num_proj):
        out_flat[i*proj_size:(i+1)*proj_size] = -np.log(in_flat[i*proj_size:(i+1)*proj_size])

Or using a for loop inside the numba loop:

@numba.njit(parallel=True)
def numba_loop2(arr_in, num_proj, proj_size, arr_out):
    in_flat = arr_in.ravel()
    out_flat = arr_out.ravel()
    for i in numba.prange(num_proj):
        for ij in range(proj_size):
            out_flat[i*proj_size+ij] = -np.log(in_flat[i*proj_size+ij])

@numba.njit(parallel=True)
def numba_loop3(arr_in, num_proj, proj_size, arr_out):
    in_flat = arr_in.ravel()
    out_flat = arr_out.ravel()
    for i in numba.prange(num_proj):
        for ij in range(proj_size):
            out_flat[i*proj_size+ij] = -math.log(in_flat[i*proj_size+ij])

For data.shape = (1800, 1024, 1024)
image

@hrobarts
Copy link
Contributor Author

hrobarts commented Jan 13, 2025

Also compare varying the size of the data chunk to use in each thread. Here the number of threads was 28 and the dashed line shows the chunk size and number of chunks if we had used number of chunks = number of projections for this dataset. The serial version on this dataset takes over 50 seconds.
image

image

Large increase is around chunk_size = 10^7 for both datasets. Suggest to set default chunk size to 6400

@hrobarts hrobarts marked this pull request as ready for review January 16, 2025 08:54
@hrobarts hrobarts requested a review from gfardell January 16, 2025 09:20
@hrobarts
Copy link
Contributor Author

hrobarts commented Jan 21, 2025

Repeat measurements on windows with 20 cores

Measure execution time of the following code on a 1800, 1024, 1024 dataset and chunk size 6400

@numba.njit(parallel=True)
def numba_loop1(arr_in, num_chunks, chunk_size, remainder, arr_out):
    in_flat = arr_in.ravel()
    out_flat = arr_out.ravel()
    for i in numba.prange(num_chunks):
        start = i * chunk_size
        end = start + chunk_size
        out_flat[start:end] = -np.log(in_flat[start:end])

    if remainder > 0:
        start = num_chunks * chunk_size
        end = start + remainder
        out_flat[start:end] = -np.log(in_flat[start:end])

image

Measure the execution time as a function of the number of chunks
image

And compare serial versus accelerated execution time for the whole processor, where the CIL default number of threads is 10
image

CHANGELOG.md Outdated Show resolved Hide resolved
chunk_size = 6400
num_chunks = data.size // chunk_size

if (self._accelerated) & (num_chunks > 5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we compare to 5 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Laura, so we don't bother making multiple threads if the dataset is very small so the overhead outweighs the benefit of creating the thread. This will only be the the case for very small datasets

in_flat = arr_in.ravel()
out_flat = arr_out.ravel()
for i in numba.prange(num_chunks):
start = i * chunk_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would start += chunk_size be faster than reallocating and multiplying at each iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Edo, yes it seems to be quicker. I'll update the code
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accelerated version of TransmissionAbsorption processor
3 participants