You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Serving two models: Model 1 with AWQ, about 20 GB:
python3 -m sglang.launch_server --model
/data/qwen3_30b_a3b_awq/cognitivecomputations_Qwen3-30B-A3B-AWQ/ --trust-remote-code --quantization moe_wna16
Model 2, the raw model, about 61 GB:
python3 -m sglang.launch_server --model /data/qwen3_30b_a3b/Qwen_Qwen3-30B-A3B --trust-remote-code
@chesterout You need to make sure the kernel is optimized for your quantization, quant config, and gpu.
Tale GPTQ/AWQ for example, make sure group_size is 128 as the Machete kernels for H100+ are optimized or else you get fall back to Marlin which is not optimized for Hopper. Etc. Some quantization has multiple kernels and each kernel has specific limitations and optimization targets that you need to be aware of.
Uh oh!
There was an error while loading. Please reload this page.
environment:
vllm: 0.8.4
sglang: 0.4.6.post2
Serving two models:
Model 1 with AWQ, about 20 GB:
python3 -m sglang.launch_server --model
/data/qwen3_30b_a3b_awq/cognitivecomputations_Qwen3-30B-A3B-AWQ/ --trust-remote-code --quantization moe_wna16
Model 2, the raw model, about 61 GB:
python3 -m sglang.launch_server --model /data/qwen3_30b_a3b/Qwen_Qwen3-30B-A3B --trust-remote-code
Benchmark:
python3 benchmark/gsm8k/bench_sglang.py --port 30000 --parallel 1400 --num-questions 1400
Model 1, gsm8k, Qwen3_30B_A3B-AWQ, moe_wna16:
Accuracy: 0.894
Invalid: 0.000
Latency: 77.718 s
Output throughput: 2089.969 token/s
Model2, gsm8k, Qwen3_30B_A3B:
Accuracy: 0.908
Invalid: 0.000
Latency: 50.131 s
Output throughput: 3084.839 token/s
The result shows that the accuracy is close, which is good.
However, the throughput is much worse in the quantized version.
Any ideas why is this happening?
The text was updated successfully, but these errors were encountered: