Efficience of quantization #6221

chesterout · 2025-05-12T08:38:03Z

environment:
vllm: 0.8.4
sglang: 0.4.6.post2

Serving two models:
Model 1 with AWQ, about 20 GB:
python3 -m sglang.launch_server --model
/data/qwen3_30b_a3b_awq/cognitivecomputations_Qwen3-30B-A3B-AWQ/ --trust-remote-code --quantization moe_wna16

Model 2, the raw model, about 61 GB:
python3 -m sglang.launch_server --model /data/qwen3_30b_a3b/Qwen_Qwen3-30B-A3B --trust-remote-code

Benchmark:
python3 benchmark/gsm8k/bench_sglang.py --port 30000 --parallel 1400 --num-questions 1400

Model 1, gsm8k, Qwen3_30B_A3B-AWQ, moe_wna16:
Accuracy: 0.894
Invalid: 0.000
Latency: 77.718 s
Output throughput: 2089.969 token/s

Model2, gsm8k, Qwen3_30B_A3B:
Accuracy: 0.908
Invalid: 0.000
Latency: 50.131 s
Output throughput: 3084.839 token/s

The result shows that the accuracy is close, which is good.
However, the throughput is much worse in the quantized version.

Any ideas why is this happening?

Qubitium · 2025-05-12T08:51:58Z

@chesterout You need to make sure the kernel is optimized for your quantization, quant config, and gpu.

Tale GPTQ/AWQ for example, make sure group_size is 128 as the Machete kernels for H100+ are optimized or else you get fall back to Marlin which is not optimized for Hopper. Etc. Some quantization has multiple kernels and each kernel has specific limitations and optimization targets that you need to be aware of.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficience of quantization #6221

Efficience of quantization #6221

chesterout commented May 12, 2025 •

edited

Loading

Qubitium commented May 12, 2025

Uh oh!

Efficience of quantization #6221

Efficience of quantization #6221

Comments

chesterout commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qubitium commented May 12, 2025

Uh oh!

chesterout commented May 12, 2025 •

edited

Loading