Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support int8 kvcahe #3034

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

Conversation

sleepcoo
Copy link
Contributor

Motivation

Implemented online int8 asymmetric quantization for kv cache.

In Summary

the current int8 kvcache has the following advantages:

  • Online kv cache int8 quantization does not require an offline dataset.
  • KV int8 quantization has almost lossless accuracy.
  • In scenarios with high computational power and low GPU memory, there are significant benefits.

I also tested lmdeploy, and its turbomind showed performance gains, but the gains with the PyTorch engine are also minimal and even negative in some scenarios due to the overhead of attn dequant. I will later investigate the implementation of lmdeploy turbomind to see how it can be integrated into sglang, and support Online kv cache quantization with fp8 as well. cc @zhyncs

Quantitative Method

We adopted int8 asymmetric quantization to quantize the kv cache online using per token and per head methods.

Usage

Using KV quantization and reasoning through sglang is very simple. Only need to set the parameter to --kv-cache-dtype=int8. ( currently only the Triton backend is supported, the Flashinfer backend is coming soon.@yinfan98 )

Evaluation

We used the benchmark methods benchmark/gsm8k and benchmark/mmlu in sglang. The results are shown in the table below:

model dataset metric Int8 Fp16
Qwen2.5 14B gsm8k accuracy 0.718 0.723  
LLama2-13B gsm8k accuracy  0.234 0.234
LLama3-8B gsm8k accuracy 0.752 0.752

Test command

python3 benchmark/gsm8k/bench_sglang.py  --num-questions 1319

Performance

test on A100-40G

model kv type RPS v.s. kv fp16
Qwen2.5 14B fp16 10.48 1
Qwen2.5 14B int8 10.41 0.996
Llama2 13B fp16 5.21 1
llama2 13B int8 5.97 1.15

todo

  • Add unit tests
  • Add evaluation results

@sleepcoo sleepcoo force-pushed the support-int8-kvcahe branch from c5e9426 to 0f30b65 Compare January 21, 2025 12:31
@sleepcoo sleepcoo force-pushed the support-int8-kvcahe branch from 9f6c235 to 7f3232d Compare January 22, 2025 05:35
@sleepcoo sleepcoo marked this pull request as ready for review January 22, 2025 06:00
@ispobock ispobock assigned ispobock, HandH1998 and zhyncs and unassigned ispobock and zhyncs Jan 23, 2025
@yinfan98
Copy link
Contributor

Please help review it again. cc: @ispobock @zhyncs @sleepcoo

@sleepcoo sleepcoo force-pushed the support-int8-kvcahe branch from 0dc4d87 to 7067a71 Compare January 24, 2025 03:27
@sleepcoo sleepcoo force-pushed the support-int8-kvcahe branch from 8b5a9e0 to 4d2840f Compare January 24, 2025 10:02
@sleepcoo sleepcoo force-pushed the support-int8-kvcahe branch from 4d2840f to 3254082 Compare January 24, 2025 10:05

class TestInt8KvcacheLlamaMHA(TestInt8vcacheBase):
model_config = {
"model_name": DEFAULT_EAGLE_TARGET_MODEL_FOR_TEST,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using eagle target model for test?You can specify the model name here or add a new constant for MHA model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants