Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
In sglang/srt/mem_cache/memory_pool.py:915-920
def init_kv_buffer(self):
return torch.empty(
(2, self.layer_num, self.size, self.head_num, self.head_dim),
dtype=self.dtype,
device=self.device,
pin_memory=self.pin_memory,
)
when allocate pin memory with size > some threshold, the following error occurs
Traceback (most recent call last):
File "", line 1, in
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.
Should sglang provide a server argument enable_pin_memory
to use pin_memory dynamically?
Reproduction
using torch 2.5.1 or torch 2.6.0
import torch
t = torch.empty((2, 32, 10000, 8, 128), dtype=torch.bfloat16, device="cpu", pin_memory=True)
related issues
deepspeedai/DeepSpeed#7150
Environment
Python Packages
sglang==0.4.6
torch==2.5.1 or torch==2.6.0
System
Linux kernel: 5.10.112-005.ali5000.al8.x86_64
GPU: H20
Nvidia driver version: 550.144.04
Cuda version: 12.4