Skip to content

[Feature] Prevent OOM Crashes in sglang with Large Batches or Image Inputs #6239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
yhyang201 opened this issue May 12, 2025 · 1 comment
Open
2 tasks done

Comments

@yhyang201
Copy link
Contributor

Checklist

Motivation

  1. I tried using the OpenAI batches API, but noticed that when the number of requests becomes very large, it's quite easy to run into OOM (out-of-memory) issues, which causes sglang to crash.
  2. I've also seen similar OOM crashes in sglang when using MLLM and sending requests with large images.

Do you think it's necessary to proactively prevent these cases? If so, what would be a good approach to handle them?

from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import json
import time
from openai import OpenAI


server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8 --port 8000" 
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = [
    {
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "qwen/qwen2.5-0.5b-instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    } for i in range(10000)
]

input_file_path = "batch_requests2.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Batch job created with ID: {batch_response.id}")

Related resources

No response

@m0g1cian
Copy link

Could you try --disable-fast-image-processor and --grammar-backend none? It should completely offload image preprocessing to CPU and reduce VRAM footprint I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants