[Question] How to Make the Internal Allocator in thrust::reduce Use Pinned Memory? #4317

JigaoLuo · 2025-04-02T09:19:54Z

JigaoLuo
Apr 2, 2025

Hello,

I've run into a problem where Thrust uses pageable memory internally, and this is hindering multistream parallelism. I came across a related question on Stack Overflow (link) and its answer, but I'm unsure if that answer still holds true as time has passed.

I'm not well-versed in Thrust's code. However, while examining the code of thrust::reduce, I noticed the temporary_allocator and temporary_buffer. I'm uncertain whether these can be configured to use pinned host memory and how.

In short, can Thrust (specifically thrust::reduce) currently utilize pinned host memory? If it can, what workaround should I employ to achieve this?

Answered by pauleonix

Apr 3, 2025

While the mentioned buffers can actually be configured to use pinned memory by passing an allocator with a thrust::cuda::universal_host_pinned_memory_resource to the execution policy (See e.g. thrust/examples/cuda/custom_temporary_allocation.cu), I'm not sure if this solves your issue as I think thrust::reduce will still copy the result from these buffers to the host-stack and synchronize afterwards because it needs to return by value. I would also expect bad performance from using pinned memory for the device scratch space as it is not only used for storing the final result. As mentioned on Discord, cub::DeviceReduce() is the right choice in this situation.

View full answer

pauleonix · 2025-04-03T11:25:55Z

pauleonix
Apr 3, 2025
Collaborator

While the mentioned buffers can actually be configured to use pinned memory by passing an allocator with a thrust::cuda::universal_host_pinned_memory_resource to the execution policy (See e.g. thrust/examples/cuda/custom_temporary_allocation.cu), I'm not sure if this solves your issue as I think thrust::reduce will still copy the result from these buffers to the host-stack and synchronize afterwards because it needs to return by value. I would also expect bad performance from using pinned memory for the device scratch space as it is not only used for storing the final result. As mentioned on Discord, cub::DeviceReduce() is the right choice in this situation.

1 reply

pauleonix Apr 17, 2025
Collaborator

@JigaoLuo While this does not apply to reduce for the given reason, there is movement in this area for other algorithms like scans and sorts which don't return their result, see #4204. So I guess there will be less cases were one has to go to CUB for good async performance.

JigaoLuo · 2025-04-03T11:33:12Z

JigaoLuo
Apr 3, 2025
Author

Thank you and also for your answer on Discord.

After our discussion on Discord, I started to wonder if the temporary_allocator could solve the problem, but I didn't think of it at that time.

I have also noticed that in Thrust, many return values are stored in stack variables in host memory. These stack variables are not pinned memory, which significantly impacts the performance. For instance, I've seen cases where functions like thrust::reduce, thrust::count_if and thrust::copy_if return values in this way.

Since I haven't found an easy solution within Thrust to address this performance issue, I've decided to rewrite the relevant code using CUB in the tedious way. One of the advantages of CUB is that most of its functions have a void return type, which is more conducive to optimizing performance.

3 replies

pauleonix Apr 3, 2025
Collaborator

You have to see that Thrust was written with the intent of being very close to the STL algorithm APIs (it predated/motivated the C++17 parallel STL) and also of being able to just drop into a different backend like OpenMP or TBB multithreading without code changes. Arguably if you don't need the backend-flexibility, Thrust is just not the right tool and you should use CUB device algorithms in the first place (assuming the same algorithm is available in CUB which is not true for all algorithms I think). If one needs that flexibility one can use std::future and other standard C++ async tools to wrap around Thrust instead of relying on the backend-specific CUDA streams for asynchonicity. There were features like that in Thrust for some time (i.e. thrust::async::reduce, I think), but they were removed somewhat recently, presumably because there were design issues with them (I know that I once tried to use them and ran into problems with temporary allocation as well).

JigaoLuo Apr 7, 2025
Author

Thanks for this conclusion. I have a much clearer understanding now.

Here’s an observation from me: I’ve noticed that many projects still rely on Thrust and target only the CUDA backend. The likely reason is the convenience of Thrust’s high-level interface. However, in hindsight, this might not have been the best decision.

pauleonix Apr 7, 2025
Collaborator

I think Thrust is just more well-known and has been around for longer (Thrust was not always based on CUB for the CUDA backend). As long as it is not the bottleneck that isn't really a problem. But if you run into these problems with asynchronicity it is probably a good idea to let them know about CUB via an issue or similar. In the context of using streams for asynchronicity already, I think CUB's device algorithms are not really much harder to use than Thrust's algorithms: They still take iterators for input and output buffers, the names are similar and so on. The temporary memory allocation is the biggest difference in my opinion.

For fairness: The other Thrust backends might not have been good/complete enough at the time (some algorithms still fall back to sequential versions. I think e.g. sort and scan do so with the OMP backend.). This is the old chicken-and-egg problem in FOSS: If nobody uses it, nobody improves it. If nobody improves it, nobody uses it. Although it is hard to say if it's actually nobody.

Nvidia's NVC++ HPC compiler uses Thrust in the backend for its offloading of the C++17 stdpar algorithms. And if you choose the multithreading backend instead of the GPU backend via compiler flag, last I have looked you get Thrust's OMP backend instead of the stdpar implementation from libstdc++ based on TBB. This is nice because it gives you some control over the number of threads and their placement through OpenMP env variables.

[Question] How to Make the Internal Allocator in thrust::reduce Use Pinned Memory? #4317

Uh oh!

Uh oh!

JigaoLuo Apr 2, 2025

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

pauleonix Apr 3, 2025 Collaborator

Uh oh!

pauleonix Apr 17, 2025 Collaborator

Uh oh!

Uh oh!

JigaoLuo Apr 3, 2025 Author

Uh oh!

Uh oh!

pauleonix Apr 3, 2025 Collaborator

Uh oh!

JigaoLuo Apr 7, 2025 Author

Uh oh!

Uh oh!

pauleonix Apr 7, 2025 Collaborator

JigaoLuo
Apr 2, 2025

Replies: 2 comments 4 replies

pauleonix
Apr 3, 2025
Collaborator

pauleonix Apr 17, 2025
Collaborator

JigaoLuo
Apr 3, 2025
Author

pauleonix Apr 3, 2025
Collaborator

JigaoLuo Apr 7, 2025
Author

pauleonix Apr 7, 2025
Collaborator