-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support int8 kvcahe #3034
Open
sleepcoo
wants to merge
45
commits into
sgl-project:main
Choose a base branch
from
sleepcoo:support-int8-kvcahe
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Support int8 kvcahe #3034
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sleepcoo
force-pushed
the
support-int8-kvcahe
branch
from
January 21, 2025 12:31
c5e9426
to
0f30b65
Compare
sleepcoo
force-pushed
the
support-int8-kvcahe
branch
from
January 22, 2025 05:35
9f6c235
to
7f3232d
Compare
ispobock
requested changes
Jan 23, 2025
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
ispobock
reviewed
Jan 24, 2025
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/layers/attention/triton_ops/decode_attention.py
Outdated
Show resolved
Hide resolved
ispobock
reviewed
Jan 24, 2025
sleepcoo
force-pushed
the
support-int8-kvcahe
branch
from
January 24, 2025 03:27
0dc4d87
to
7067a71
Compare
sleepcoo
force-pushed
the
support-int8-kvcahe
branch
from
January 24, 2025 10:02
8b5a9e0
to
4d2840f
Compare
sleepcoo
force-pushed
the
support-int8-kvcahe
branch
from
January 24, 2025 10:05
4d2840f
to
3254082
Compare
ispobock
reviewed
Jan 26, 2025
|
||
class TestInt8KvcacheLlamaMHA(TestInt8vcacheBase): | ||
model_config = { | ||
"model_name": DEFAULT_EAGLE_TARGET_MODEL_FOR_TEST, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using eagle target model for test?You can specify the model name here or add a new constant for MHA model.
ispobock
approved these changes
Jan 26, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Implemented online int8 asymmetric quantization for kv cache.
In Summary
the current int8 kvcache has the following advantages:
I also tested lmdeploy, and its turbomind showed performance gains, but the gains with the PyTorch engine are also minimal and even negative in some scenarios due to the overhead of attn dequant. I will later investigate the implementation of lmdeploy turbomind to see how it can be integrated into sglang, and support Online kv cache quantization with fp8 as well. cc @zhyncs
Quantitative Method
We adopted int8 asymmetric quantization to quantize the kv cache online using per token and per head methods.
Usage
Using KV quantization and reasoning through sglang is very simple. Only need to set the parameter to --kv-cache-dtype=int8. ( currently only the Triton backend is supported, the Flashinfer backend is coming soon.@yinfan98 )
Evaluation
We used the benchmark methods benchmark/gsm8k and benchmark/mmlu in sglang. The results are shown in the table below:
Test command
Performance
test on A100-40G
todo