Skip to content

Commit d1492f1

Browse files
lvhan028yao-fengchenDoorKickers
authored
bump version to v0.8.0 (#3432)
* bump version to v0.8.0 * update supported models * update supported models * cherry-pick #3420 * modify ascend dockerfile to support direct run lmdeploy serve (#3436) * update supported models on ascend platform * update supported models --------- Co-authored-by: yaofengchen <[email protected]> Co-authored-by: Lantian Zhang <[email protected]>
1 parent 3afd6c0 commit d1492f1

File tree

13 files changed

+143
-51
lines changed

13 files changed

+143
-51
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
124124
<li>Qwen2 (0.5B - 72B)</li>
125125
<li>Qwen2-MoE (57BA14B)</li>
126126
<li>Qwen2.5 (0.5B - 32B)</li>
127+
<li>Qwen3, Qwen3-MoE</li>
127128
<li>Baichuan (7B)</li>
128129
<li>Baichuan2 (7B-13B)</li>
129130
<li>Code Llama (7B - 34B)</li>
@@ -158,6 +159,7 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
158159
<li>InternVL-Chat (v1.1-v1.5)</li>
159160
<li>InternVL2 (1B-76B)</li>
160161
<li>InternVL2.5(MPO) (1B-78B)</li>
162+
<li>InternVL3 (1B-78B)</li>
161163
<li>Mono-InternVL (2B)</li>
162164
<li>ChemVLM (8B-26B)</li>
163165
<li>CogVLM-Chat (17B)</li>

README_ja.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
122122
<li>Qwen2 (0.5B - 72B)</li>
123123
<li>Qwen2-MoE (57BA14B)</li>
124124
<li>Qwen2.5 (0.5B - 32B)</li>
125+
<li>Qwen3, Qwen3-MoE</li>
125126
<li>Baichuan (7B)</li>
126127
<li>Baichuan2 (7B-13B)</li>
127128
<li>Code Llama (7B - 34B)</li>
@@ -156,6 +157,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
156157
<li>InternVL-Chat (v1.1-v1.5)</li>
157158
<li>InternVL2 (1B-76B)</li>
158159
<li>InternVL2.5(MPO) (1B-78B)</li>
160+
<li>InternVL3 (1B-78B)</li>
159161
<li>Mono-InternVL (2B)</li>
160162
<li>ChemVLM (8B-26B)</li>
161163
<li>CogVLM-Chat (17B)</li>

README_zh-CN.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
126126
<li>Qwen2 (0.5B - 72B)</li>
127127
<li>Qwen2-MoE (57BA14B)</li>
128128
<li>Qwen2.5 (0.5B - 32B)</li>
129+
<li>Qwen3, Qwen3-MoE</li>
129130
<li>Baichuan (7B)</li>
130131
<li>Baichuan2 (7B-13B)</li>
131132
<li>Code Llama (7B - 34B)</li>
@@ -160,6 +161,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
160161
<li>InternVL-Chat (v1.1-v1.5)</li>
161162
<li>InternVL2 (1B-76B)</li>
162163
<li>InternVL2.5(MPO) (1B-78B)</li>
164+
<li>InternVL3 (1B-78B)</li>
163165
<li>Mono-InternVL (2B)</li>
164166
<li>ChemVLM (8B-26B)</li>
165167
<li>CogVLM-Chat (17B)</li>

docker/Dockerfile_aarch64_ascend

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,8 @@ RUN echo "source /usr/local/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc && \
110110
RUN --mount=type=cache,target=/root/.cache/pip \
111111
pip3 install torch==2.3.1 torchvision==0.18.1 torch-npu==2.3.1 && \
112112
pip3 install transformers timm && \
113-
pip3 install dlinfer-ascend
113+
pip3 install dlinfer-ascend && \
114+
pip3 install partial_json_parser shortuuid
114115

115116
# lmdeploy
116117
FROM build_temp as copy_temp
@@ -122,3 +123,47 @@ WORKDIR /opt/lmdeploy
122123

123124
RUN --mount=type=cache,target=/root/.cache/pip \
124125
LMDEPLOY_TARGET_DEVICE=ascend pip3 install -v --no-build-isolation -e .
126+
127+
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
128+
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/$(arch):$LD_LIBRARY_PATH
129+
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/tools/aml/lib64:${ASCEND_TOOLKIT_HOME}/tools/aml/lib64/plugin:$LD_LIBRARY_PATH
130+
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:$PYTHONPATH
131+
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${ASCEND_TOOLKIT_HOME}/tools/ccec_compiler/bin:$PATH
132+
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
133+
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
134+
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
135+
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}
136+
137+
ENV ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
138+
ENV LD_LIBRARY_PATH=${ATB_HOME_PATH}/lib:${ATB_HOME_PATH}/examples:${ATB_HOME_PATH}/tests/atbopstest:$LD_LIBRARY_PATH
139+
ENV PATH=${ATB_HOME_PATH}/bin:$PATH
140+
141+
ENV ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
142+
ENV ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
143+
ENV ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
144+
ENV ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
145+
ENV ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
146+
ENV ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
147+
ENV ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
148+
ENV ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
149+
ENV ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
150+
ENV ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
151+
ENV ATB_COMPARE_TILING_EVERY_KERNEL=0
152+
ENV ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
153+
ENV ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
154+
ENV ATB_SHARE_MEMORY_NAME_SUFFIX=""
155+
ENV ATB_LAUNCH_KERNEL_WITH_TILING=1
156+
ENV ATB_MATMUL_SHUFFLE_K_ENABLE=1
157+
ENV ATB_RUNNER_POOL_SIZE=64
158+
159+
ENV ASDOPS_HOME_PATH=${ATB_HOME_PATH}
160+
ENV ASDOPS_MATMUL_PP_FLAG=1
161+
ENV ASDOPS_LOG_LEVEL=ERROR
162+
ENV ASDOPS_LOG_TO_STDOUT=0
163+
ENV ASDOPS_LOG_TO_FILE=1
164+
ENV ASDOPS_LOG_TO_FILE_FLUSH=0
165+
ENV ASDOPS_LOG_TO_BOOST_TYPE=atb
166+
ENV ASDOPS_LOG_PATH=~
167+
ENV ASDOPS_TILING_PARSE_CACHE_DISABLE=0
168+
169+
ENV LCCL_DETERMINISTIC=0

docs/en/get_started/ascend/get_started.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,16 @@ lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu
158158

159159
Please check [supported_models](../../supported_models/supported_models.md) before use this feature.
160160

161+
### w8a8 SMOOTH_QUANT
162+
163+
Run the following commands to quantize weights on Atlas 800T A2.
164+
165+
```bash
166+
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu
167+
```
168+
169+
Please check [supported_models](../../supported_models/supported_models.md) before use this feature.
170+
161171
### int8 KV-cache Quantization
162172

163173
Ascend backend has supported offline int8 KV-cache Quantization on eager mode.

docs/en/get_started/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ pip install lmdeploy
2323
The default prebuilt package is compiled on **CUDA 12**. If CUDA 11+ (>=11.3) is required, you can install lmdeploy by:
2424

2525
```shell
26-
export LMDEPLOY_VERSION=0.7.2.post1
26+
export LMDEPLOY_VERSION=0.8.0
2727
export PYTHON_VERSION=38
2828
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
2929
```

docs/en/multi_modal/internvl.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22

33
LMDeploy supports the following InternVL series of models, which are detailed in the table below:
44

5-
| Model | Size | Supported Inference Engine |
6-
| :-----------: | :--------: | :------------------------: |
7-
| InternVL | 13B-19B | TurboMind |
8-
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
9-
| InternVL2 | 1B, 4B | PyTorch |
10-
| InternVL2 | 2B, 8B-76B | TurboMind, PyTorch |
11-
| Mono-InternVL | 2B | PyTorch |
5+
| Model | Size | Supported Inference Engine |
6+
| :-------------------: | :-----------: | :------------------------: |
7+
| InternVL | 13B-19B | TurboMind |
8+
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
9+
| InternVL2 | 4B | PyTorch |
10+
| InternVL2 | 1B-2B, 8B-76B | TurboMind, PyTorch |
11+
| InternVL2.5/2.5-MPO/3 | 1B-78B | TurboMind, PyTorch |
12+
| Mono-InternVL | 2B | PyTorch |
1213

1314
The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) as an example.
1415

docs/en/supported_models/supported_models.md

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
2222
| Qwen2<sup>\[2\]</sup> | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes |
2323
| Qwen2-MoE | 57BA14B | LLM | Yes | Yes | Yes | Yes |
2424
| Qwen2.5<sup>\[2\]</sup> | 0.5B - 72B | LLM | Yes | Yes\* | Yes\* | Yes |
25+
| Qwen3 | 0.6B-235B | LLM | Yes | Yes | Yes\* | Yes\* |
2526
| Mistral<sup>\[1\]</sup> | 7B | LLM | Yes | Yes | Yes | No |
2627
| Mixtral | 8x7B, 8x22B | LLM | Yes | Yes | Yes | Yes |
2728
| DeepSeek-V2 | 16B, 236B | LLM | Yes | Yes | Yes | No |
@@ -36,6 +37,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
3637
| InternVL | v1.1 - v1.5 | MLLM | Yes | Yes | Yes | Yes |
3738
| InternVL2<sup>\[2\]</sup> | 1 - 2B, 8B - 76B | MLLM | Yes | Yes\* | Yes\* | Yes |
3839
| InternVL2.5(MPO)<sup>\[2\]</sup> | 1 - 78B | MLLM | Yes | Yes\* | Yes\* | Yes |
40+
| InternVL3<sup>\[2\]</sup> | 1 - 78B | MLLM | Yes | Yes\* | Yes\* | Yes |
3941
| ChemVLM | 8B - 26B | MLLM | Yes | Yes | Yes | Yes |
4042
| MiniCPM-Llama3-V-2_5 | - | MLLM | Yes | Yes | Yes | Yes |
4143
| MiniCPM-V-2_6 | - | MLLM | Yes | Yes | Yes | Yes |
@@ -76,6 +78,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
7678
| QWen1.5-MoE | A2.7B | LLM | Yes | Yes | Yes | No | No |
7779
| QWen2 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
7880
| Qwen2.5 | 0.5B - 72B | LLM | Yes | Yes | No | Yes | Yes |
81+
| Qwen3 | 0.6B - 235B | LLM | Yes | Yes | Yes\* | - | Yes\* |
7982
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | No | No | Yes |
8083
| QWen2.5-VL | 3B - 72B | MLLM | Yes | No | No | No | No |
8184
| DeepSeek-MoE | 16B | LLM | Yes | No | No | No | No |
@@ -95,6 +98,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
9598
| InternVL(v1.5) | 2B-26B | MLLM | Yes | Yes | Yes | No | Yes |
9699
| InternVL2 | 1B-76B | MLLM | Yes | Yes | Yes | - | - |
97100
| InternVL2.5(MPO) | 1B-78B | MLLM | Yes | Yes | Yes | - | - |
101+
| InternVL3 | 1B-78B | MLLM | Yes | Yes | Yes | - | - |
98102
| Mono-InternVL<sup>\[1\]</sup> | 2B | MLLM | Yes | Yes | Yes | - | - |
99103
| ChemVLM | 8B-26B | MLLM | Yes | Yes | No | - | - |
100104
| Gemma2 | 9B-27B | LLM | Yes | Yes | Yes | - | - |
@@ -114,20 +118,25 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
114118

115119
## PyTorchEngine on Huawei Ascend Platform
116120

117-
| Model | Size | Type | FP16/BF16(eager) | FP16/BF16(graph) | W4A16(eager) |
118-
| :------------: | :------: | :--: | :--------------: | :--------------: | :----------: |
119-
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes |
120-
| Llama3 | 8B | LLM | Yes | Yes | Yes |
121-
| Llama3.1 | 8B | LLM | Yes | Yes | Yes |
122-
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes |
123-
| InternLM2.5 | 7B - 20B | LLM | Yes | Yes | Yes |
124-
| InternLM3 | 8B | LLM | Yes | Yes | Yes |
125-
| Mixtral | 8x7B | LLM | Yes | Yes | No |
126-
| QWen1.5-MoE | A2.7B | LLM | Yes | - | No |
127-
| QWen2(.5) | 7B | LLM | Yes | Yes | No |
128-
| QWen2-MoE | A14.57B | LLM | Yes | - | No |
129-
| DeepSeek-V2 | 16B | LLM | No | Yes | No |
130-
| InternVL(v1.5) | 2B-26B | MLLM | Yes | - | Yes |
131-
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes |
132-
| CogVLM2-chat | 19B | MLLM | Yes | No | - |
133-
| GLM4V | 9B | MLLM | Yes | No | - |
121+
| Model | Size | Type | FP16/BF16(eager) | FP16/BF16(graph) | W8A8(graph) | W4A16(eager) |
122+
| :------------: | :-------: | :--: | :--------------: | :--------------: | :---------: | :----------: |
123+
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes |
124+
| Llama3 | 8B | LLM | Yes | Yes | Yes | Yes |
125+
| Llama3.1 | 8B | LLM | Yes | Yes | Yes | Yes |
126+
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
127+
| InternLM2.5 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
128+
| InternLM3 | 8B | LLM | Yes | Yes | Yes | Yes |
129+
| Mixtral | 8x7B | LLM | Yes | Yes | No | No |
130+
| QWen1.5-MoE | A2.7B | LLM | Yes | - | No | No |
131+
| QWen2(.5) | 7B | LLM | Yes | Yes | Yes | Yes |
132+
| QWen2-VL | 2B, 7B | MLLM | Yes | Yes | - | - |
133+
| QWen2.5-VL | 3B - 72B | MLLM | Yes | Yes | - | - |
134+
| QWen2-MoE | A14.57B | LLM | Yes | - | No | No |
135+
| QWen3 | 0.6B-235B | LLM | Yes | Yes | No | No |
136+
| DeepSeek-V2 | 16B | LLM | No | Yes | No | No |
137+
| InternVL(v1.5) | 2B-26B | MLLM | Yes | - | Yes | Yes |
138+
| InternVL2 | 1B-40B | MLLM | Yes | Yes | Yes | Yes |
139+
| InternVL2.5 | 1B-78B | MLLM | Yes | Yes | Yes | Yes |
140+
| InternVL3 | 1B-78B | MLLM | Yes | Yes | Yes | Yes |
141+
| CogVLM2-chat | 19B | MLLM | Yes | No | - | - |
142+
| GLM4V | 9B | MLLM | Yes | No | - | - |

docs/zh_cn/get_started/ascend/get_started.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,16 @@ lmdeploy lite auto_awq $HF_MODEL --work-dir $WORK_DIR --device npu
154154

155155
支持的模型列表请参考[支持的模型](../../supported_models/supported_models.md)
156156

157+
### w8a8 SMOOTH_QUANT
158+
159+
运行下面的代码可以在Atlas 800T A2上对权重进行W8A8量化。
160+
161+
```bash
162+
lmdeploy lite smooth_quant $HF_MODEL --work-dir $WORK_DIR --device npu
163+
```
164+
165+
支持的模型列表请参考[支持的模型](../../supported_models/supported_models.md)
166+
157167
### int8 KV-cache 量化
158168

159169
昇腾后端现在支持了在eager模式下的离线int8 KV-cache量化。

docs/zh_cn/get_started/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ pip install lmdeploy
2323
默认的预构建包是在 **CUDA 12** 上编译的。如果需要 CUDA 11+ (>=11.3),你可以使用以下命令安装 lmdeploy:
2424

2525
```shell
26-
export LMDEPLOY_VERSION=0.7.2.post1
26+
export LMDEPLOY_VERSION=0.8.0
2727
export PYTHON_VERSION=38
2828
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
2929
```

docs/zh_cn/multi_modal/internvl.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22

33
LMDeploy 支持 InternVL 系列模型,具体如下:
44

5-
| Model | Size | Supported Inference Engine |
6-
| :-----------: | :--------: | :------------------------: |
7-
| InternVL | 13B-19B | TurboMind |
8-
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
9-
| InternVL2 | 1B, 4B | PyTorch |
10-
| InternVL2 | 2B, 8B-76B | TurboMind, PyTorch |
11-
| Mono-InternVL | 2B | PyTorch |
5+
| Model | Size | Supported Inference Engine |
6+
| :-------------------: | :-----------: | :------------------------: |
7+
| InternVL | 13B-19B | TurboMind |
8+
| InternVL1.5 | 2B-26B | TurboMind, PyTorch |
9+
| InternVL2 | 4B | PyTorch |
10+
| InternVL2 | 1B-2B, 8B-76B | TurboMind, PyTorch |
11+
| InternVL2.5/2.5-MPO/3 | 1B-78B | TurboMind, PyTorch |
12+
| Mono-InternVL | 2B | PyTorch |
1213

1314
本文将以[InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)为例,演示使用 LMDeploy 部署 InternVL 系列模型的方法。
1415

0 commit comments

Comments
 (0)