Support int8 kvcahe #3034

sleepcoo · 2025-01-21T12:12:51Z

Motivation

Implemented online int8 asymmetric quantization for kv cache.

In Summary

the current int8 kvcache has the following advantages:

Online kv cache int8 quantization does not require an offline dataset.
KV int8 quantization has almost lossless accuracy.
In scenarios with high computational power and low GPU memory, there are significant benefits.

I also tested lmdeploy, and its turbomind showed performance gains, but the gains with the PyTorch engine are also minimal and even negative in some scenarios due to the overhead of attn dequant. I will later investigate the implementation of lmdeploy turbomind to see how it can be integrated into sglang, and support Online kv cache quantization with fp8 as well. cc @zhyncs

Quantitative Method

We adopted int8 asymmetric quantization to quantize the kv cache online using per token and per head methods.

Usage

Using KV quantization and reasoning through sglang is very simple. Only need to set the parameter to --kv-cache-dtype=int8. ( currently only the Triton backend is supported, the Flashinfer backend is coming soon.@yinfan98 )

Evaluation

We used the benchmark methods benchmark/gsm8k and benchmark/mmlu in sglang. The results are shown in the table below:

model	dataset	metric	Int8	Fp16
Qwen2.5 14B	gsm8k	accuracy	0.718	0.723
LLama2-13B	gsm8k	accuracy	0.234	0.234
LLama3-8B	gsm8k	accuracy	0.752	0.752

Test command

python3 benchmark/gsm8k/bench_sglang.py  --num-questions 1319

Performance

test on A100-40G

model	kv type	RPS	v.s. kv fp16
Qwen2.5 14B	fp16	10.48	1
Qwen2.5 14B	int8	10.41	0.996
Llama2 13B	fp16	5.21	1
llama2 13B	int8	5.97	1.15

todo

Add unit tests
Add evaluation results

python/sglang/srt/layers/attention/triton_backend.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/server_args.py

python/sglang/test/test_quantinize_kvcache.py

python/sglang/srt/layers/attention/triton_ops/extend_attention.py

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

…ernel

yinfan98 · 2025-01-23T17:19:59Z

Please help review it again. cc: @ispobock @zhyncs @sleepcoo

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

test/srt/test_int8_kvcache.py

ispobock · 2025-01-26T02:04:13Z

test/srt/test_int8_kvcache.py

+
+class TestInt8KvcacheLlamaMHA(TestInt8vcacheBase):
+    model_config = {
+        "model_name": DEFAULT_EAGLE_TARGET_MODEL_FOR_TEST,


Why using eagle target model for test？You can specify the model name here or add a new constant for MHA model.

sleepcoo and others added 17 commits January 20, 2025 19:43

support int8 kv cache

65d7241

add int kv decoding kernel

3d99b77

add int kv decoding kernel

c2a76dc

support extend attention kv int8 and fix memory bug

317f04c

support int8 cache quant & fix bug

11da118

fix attention triton backend int8 kv cache bug & fix memory pool bug

3081a69

Update extend_attention.py

fcf4ed0

fix bug && remove debug log

3a41497

opt code style

3db329d

fix quant attention bug

e0ce159

fix output bug

f24e945

add test_native_quantize_cache_kv

d8769a8

opt int8 kvcache dequant performance

0746635

fix bug && code format

ef832f5

fix bug

e33da95

int8 kv cache compatible with sparse attention

f4cf2da

fix bug

0f30b65

sleepcoo force-pushed the support-int8-kvcahe branch from c5e9426 to 0f30b65 Compare January 21, 2025 12:31

sleepcoo and others added 7 commits January 21, 2025 20:32

fix test

6cf7fa7

Merge branch 'main' into support-int8-kvcahe

6c2d3a7

fix bug

2f89825

fix bug

7c13bff

Merge branch 'main' into support-int8-kvcahe

88edfd3

fix code format

e0003af

fix bug

7f3232d

sleepcoo force-pushed the support-int8-kvcahe branch from 9f6c235 to 7f3232d Compare January 22, 2025 05:35

Merge branch 'main' into support-int8-kvcahe

949202a

sleepcoo marked this pull request as ready for review January 22, 2025 06:00

sleepcoo requested review from merrymercy and Ying1123 as code owners January 22, 2025 06:00

ispobock requested changes Jan 23, 2025

View reviewed changes

sleepcoo added 2 commits January 23, 2025 19:09

fix code style

5207ef4

fix code style

5774bdd

ispobock assigned ispobock, HandH1998 and zhyncs and unassigned ispobock and zhyncs Jan 23, 2025

sleepcoo and others added 6 commits January 23, 2025 19:27

fix code style

490e12f

Merge branch 'main' into support-int8-kvcahe

3fa3101

Merge branch 'main' into support-int8-kvcahe

6d86b77

move _quant kernel to in8 quant file and change num_warps for quant k…

d7a6d03

…ernel

add quant int8 kernel

4e96edb

Merge branch 'main' into support-int8-kvcahe

9a35fe5

Merge branch 'main' into support-int8-kvcahe

492bab9

ispobock reviewed Jan 24, 2025

View reviewed changes

test/srt/test_int8_kvcache.py Outdated Show resolved Hide resolved

sleepcoo added 2 commits January 24, 2025 11:27

fix code style

735768f

Modify the precision threshold.

7067a71

sleepcoo force-pushed the support-int8-kvcahe branch from 0dc4d87 to 7067a71 Compare January 24, 2025 03:27

Merge branch 'main' into support-int8-kvcahe

6a564f3

sleepcoo force-pushed the support-int8-kvcahe branch from 8b5a9e0 to 4d2840f Compare January 24, 2025 10:02

Modify the precision threshold of the int kvcache unit test.

3254082

sleepcoo force-pushed the support-int8-kvcahe branch from 4d2840f to 3254082 Compare January 24, 2025 10:05

sleepcoo and others added 2 commits January 25, 2025 11:08

Merge branch 'main' into support-int8-kvcahe

4c0768b

Update triton_backend.py

3b2fa1d

ispobock reviewed Jan 26, 2025

View reviewed changes

Merge branch 'main' into support-int8-kvcahe

153f3bc

ispobock approved these changes Jan 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support int8 kvcahe #3034

Support int8 kvcahe #3034

sleepcoo commented Jan 21, 2025

yinfan98 commented Jan 23, 2025

ispobock Jan 26, 2025

Support int8 kvcahe #3034

Are you sure you want to change the base?

Support int8 kvcahe #3034

Conversation

sleepcoo commented Jan 21, 2025

Motivation

In Summary

Quantitative Method

Usage

Evaluation

Performance

todo

yinfan98 commented Jan 23, 2025

ispobock Jan 26, 2025

Choose a reason for hiding this comment