Skip to content

[Bug] Cuda error: invalid argument when host init_kv_buffer with argument pin_memory=True #6285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
beaulian opened this issue May 14, 2025 · 4 comments
Open
5 tasks done

Comments

@beaulian
Copy link

beaulian commented May 14, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

In sglang/srt/mem_cache/memory_pool.py:915-920

def init_kv_buffer(self):
    return torch.empty(
        (2, self.layer_num, self.size, self.head_num, self.head_dim),
        dtype=self.dtype,
        device=self.device,
        pin_memory=self.pin_memory,
    )

when allocate pin memory with size > some threshold, the following error occurs

Traceback (most recent call last):
File "", line 1, in
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Should sglang provide a server argument enable_pin_memory to use pin_memory dynamically?

Reproduction

using torch 2.5.1 or torch 2.6.0

import torch
t = torch.empty((2, 32, 10000, 8, 128), dtype=torch.bfloat16, device="cpu", pin_memory=True)

related issues
deepspeedai/DeepSpeed#7150

Environment

Python Packages
sglang==0.4.6
torch==2.5.1 or torch==2.6.0

System
Linux kernel: 5.10.112-005.ali5000.al8.x86_64
GPU: H20
Nvidia driver version: 550.144.04
Cuda version: 12.4

@beaulian
Copy link
Author

@xiezhq-hermann

@xiezhq-hermann
Copy link
Collaborator

Hi @beaulian, may I ask what is the thresold you were using?

@beaulian
Copy link
Author

beaulian commented May 15, 2025

Hi @beaulian, may I ask what is the thresold you were using?

Hi, @xiezhq-hermann, it's very strange. When I try many times of torch.empty, it succeeds for any size. Maybe pin_memory=True is not a stable argument.

@beaulian
Copy link
Author

@xiezhq-hermann BWT, have you doing the benchmark after adding --enable-hierarchical-cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants