You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2. Please use English, otherwise it will be closed.
Motivation
I tried using the OpenAI batches API, but noticed that when the number of requests becomes very large, it's quite easy to run into OOM (out-of-memory) issues, which causes sglang to crash.
I've also seen similar OOM crashes in sglang when using MLLM and sending requests with large images.
Do you think it's necessary to proactively prevent these cases? If so, what would be a good approach to handle them?
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
import json
import time
from openai import OpenAI
server_process, port = launch_server_cmd(
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8 --port 8000"
)
wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [{"role": "user", "content": "What is Python?"}],
"max_tokens": 50,
},
} for i in range(10000)
]
input_file_path = "batch_requests2.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
file_response = client.files.create(file=f, purpose="batch")
batch_response = client.batches.create(
input_file_id=file_response.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Batch job created with ID: {batch_response.id}")
Related resources
No response
The text was updated successfully, but these errors were encountered:
Could you try --disable-fast-image-processor and --grammar-backend none? It should completely offload image preprocessing to CPU and reduce VRAM footprint I think
Checklist
Motivation
sglang
to crash.sglang
when using MLLM and sending requests with large images.Do you think it's necessary to proactively prevent these cases? If so, what would be a good approach to handle them?
Related resources
No response
The text was updated successfully, but these errors were encountered: