-
Hello, I've run into a problem where Thrust uses pageable memory internally, and this is hindering multistream parallelism. I came across a related question on Stack Overflow (link) and its answer, but I'm unsure if that answer still holds true as time has passed. I'm not well-versed in Thrust's code. However, while examining the code of In short, can Thrust (specifically |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
While the mentioned buffers can actually be configured to use pinned memory by passing an allocator with a |
Beta Was this translation helpful? Give feedback.
-
Thank you and also for your answer on Discord. After our discussion on Discord, I started to wonder if the I have also noticed that in Thrust, many return values are stored in stack variables in host memory. These stack variables are not pinned memory, which significantly impacts the performance. For instance, I've seen cases where functions like Since I haven't found an easy solution within Thrust to address this performance issue, I've decided to rewrite the relevant code using CUB in the tedious way. One of the advantages of CUB is that most of its functions have a void return type, which is more conducive to optimizing performance. |
Beta Was this translation helpful? Give feedback.
While the mentioned buffers can actually be configured to use pinned memory by passing an allocator with a
thrust::cuda::universal_host_pinned_memory_resource
to the execution policy (See e.g.thrust/examples/cuda/custom_temporary_allocation.cu
), I'm not sure if this solves your issue as I thinkthrust::reduce
will still copy the result from these buffers to the host-stack and synchronize afterwards because it needs to return by value. I would also expect bad performance from using pinned memory for the device scratch space as it is not only used for storing the final result. As mentioned on Discord,cub::DeviceReduce()
is the right choice in this situation.