Skip to content

Efficience of quantization #6221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chesterout opened this issue May 12, 2025 · 1 comment
Open

Efficience of quantization #6221

chesterout opened this issue May 12, 2025 · 1 comment

Comments

@chesterout
Copy link

chesterout commented May 12, 2025

environment:
vllm: 0.8.4
sglang: 0.4.6.post2

Serving two models:
Model 1 with AWQ, about 20 GB:
python3 -m sglang.launch_server --model
/data/qwen3_30b_a3b_awq/cognitivecomputations_Qwen3-30B-A3B-AWQ/ --trust-remote-code --quantization moe_wna16

Model 2, the raw model, about 61 GB:
python3 -m sglang.launch_server --model /data/qwen3_30b_a3b/Qwen_Qwen3-30B-A3B --trust-remote-code

Benchmark:
python3 benchmark/gsm8k/bench_sglang.py --port 30000 --parallel 1400 --num-questions 1400

Model 1, gsm8k, Qwen3_30B_A3B-AWQ, moe_wna16:
Accuracy: 0.894
Invalid: 0.000
Latency: 77.718 s
Output throughput: 2089.969 token/s

Model2, gsm8k, Qwen3_30B_A3B:
Accuracy: 0.908
Invalid: 0.000
Latency: 50.131 s
Output throughput: 3084.839 token/s

The result shows that the accuracy is close, which is good.
However, the throughput is much worse in the quantized version.

Any ideas why is this happening?

@Qubitium
Copy link
Contributor

@chesterout You need to make sure the kernel is optimized for your quantization, quant config, and gpu.

Tale GPTQ/AWQ for example, make sure group_size is 128 as the Machete kernels for H100+ are optimized or else you get fall back to Marlin which is not optimized for Hopper. Etc. Some quantization has multiple kernels and each kernel has specific limitations and optimization targets that you need to be aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants