bump version to v0.5.0 (#1852)

lvhan028 · web-flow · commit 4cb3854086bd · 2024-07-01T15:19:40.000+08:00
* bump version to v0.5.0

* update news

* update news

* update supported models

* update

* fix lint

* set LMDEPLOY_VERSION 0.5.0
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
+- \[2024/06\] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
 - \[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
 - \[2024/05\] Support 4-bits weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
 - \[2024/04\] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
@@ -112,6 +113,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>QWen (1.8B - 72B)</li>
   <li>QWen1.5 (0.5B - 110B)</li>
   <li>QWen1.5 - MoE (0.5B - 72B)</li>
+  <li>QWen2 (0.5B - 72B)</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -121,6 +123,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>YI (6B-34B)</li>
   <li>Mistral (7B)</li>
   <li>DeepSeek-MoE (16B)</li>
+  <li>DeepSeek-V2 (16B, 236B)</li>
   <li>Mixtral (8x7B, 8x22B)</li>
   <li>Gemma (2B - 7B)</li>
   <li>Dbrx (132B)</li>
@@ -162,7 +165,7 @@ pip install lmdeploy
 Since v0.3.0, The default prebuilt package is compiled on **CUDA 12**. However, if CUDA 11+ is required, you can install lmdeploy by:
 
 ```shell
-export LMDEPLOY_VERSION=0.3.0
+export LMDEPLOY_VERSION=0.5.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -26,6 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
+- \[2024/06\] PyTorch engine 支持了 DeepSeek-V2 和若干 VLM 模型推理, 比如 CogVLM2，Mini-InternVL，LlaVA-Next
 - \[2024/05\] 在多 GPU 上部署 VLM 模型时，支持把视觉部分的模型均分到多卡上
 - \[2024/05\] 支持InternVL v1.5, LLaVa, InternLMXComposer2 等 VLMs 模型的 4bit 权重量化和推理
 - \[2024/04\] 支持 Llama3 和 InternVL v1.1, v1.2，MiniGemini，InternLM-XComposer2 等 VLM 模型
@@ -113,6 +114,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>QWen (1.8B - 72B)</li>
   <li>QWen1.5 (0.5B - 110B)</li>
   <li>QWen1.5 - MoE (0.5B - 72B)</li>
+  <li>QWen2 (0.5B - 72B)</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -122,6 +124,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>YI (6B-34B)</li>
   <li>Mistral (7B)</li>
   <li>DeepSeek-MoE (16B)</li>
+  <li>DeepSeek-V2 (16B, 236B)</li>
   <li>Mixtral (8x7B, 8x22B)</li>
   <li>Gemma (2B - 7B)</li>
   <li>Dbrx (132B)</li>
@@ -163,7 +166,7 @@ pip install lmdeploy
 自 v0.3.0 起，LMDeploy 预编译包默认基于 CUDA 12 编译。如果需要在 CUDA 11+ 下安装 LMDeploy，请执行以下命令：
 
 ```shell
-export LMDEPLOY_VERSION=0.3.0
+export LMDEPLOY_VERSION=0.5.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/en/get_started.md b/docs/en/get_started.md
@@ -13,7 +13,7 @@ pip install lmdeploy
 The default prebuilt package is compiled on **CUDA 12**. However, if CUDA 11+ is required, you can install lmdeploy by:
 
 ```shell
-export LMDEPLOY_VERSION=0.4.2
+export LMDEPLOY_VERSION=0.5.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/en/multi_modal/cogvlm.md b/docs/en/multi_modal/cogvlm.md
@@ -22,7 +22,7 @@ Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeplo
 ```shell
 # cuda 11.8
 # to get the latest version, run: pip index versions lmdeploy
-export LMDEPLOY_VERSION=0.4.2
+export LMDEPLOY_VERSION=0.5.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 # cuda 12.1
diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
@@ -13,7 +13,8 @@
 | InternLM-XComposer  |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 | InternLM-XComposer2 | 7B, 4khd-7B  |    Yes    |   Yes   |   Yes   |  Yes  |
 |        QWen         |  1.8B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
-|       QWen1.5       |  1.8B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|       QWen1.5       | 1.8B - 110B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|        QWen2        |  1.5B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
 |       Mistral       |      7B      |    Yes    |   Yes   |   Yes   |  No   |
 |       QWen-VL       |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |     DeepSeek-VL     |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
@@ -35,29 +36,31 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
 
 ## Models supported by PyTorch
 
-|        Model        |    Size    | FP16/BF16 | KV INT8 | W8A8 |
-| :-----------------: | :--------: | :-------: | :-----: | :--: |
-|        Llama        |  7B - 65B  |    Yes    |   No    | Yes  |
-|       Llama2        |  7B - 70B  |    Yes    |   No    | Yes  |
-|       Llama3        |  8B, 70B   |    Yes    |   No    | Yes  |
-|      InternLM       |  7B - 20B  |    Yes    |   No    | Yes  |
-|      InternLM2      |  7B - 20B  |    Yes    |   No    |  -   |
-|     InternLM2.5     |     7B     |    Yes    |   No    |  -   |
-|      Baichuan2      |  7B - 13B  |    Yes    |   No    | Yes  |
-|      ChatGLM2       |     6B     |    Yes    |   No    |  No  |
-|       Falcon        | 7B - 180B  |    Yes    |   No    |  No  |
-|         YI          |  6B - 34B  |    Yes    |   No    |  No  |
-|       Mistral       |     7B     |    Yes    |   No    |  No  |
-|       Mixtral       |    8x7B    |    Yes    |   No    |  No  |
-|        QWen         | 1.8B - 72B |    Yes    |   No    |  No  |
-|       QWen1.5       | 0.5B - 72B |    Yes    |   No    |  No  |
-|     QWen1.5-MoE     |   A2.7B    |    Yes    |   No    |  No  |
-|    DeepSeek-MoE     |    16B     |    Yes    |   No    |  No  |
-|        Gemma        |   2B-7B    |    Yes    |   No    |  No  |
-|        Dbrx         |    132B    |    Yes    |   No    |  No  |
-|     StarCoder2      |   3B-15B   |    Yes    |   No    |  No  |
-|     Phi-3-mini      |    3.8B    |    Yes    |   No    |  No  |
-|     CogVLM-Chat     |    17B     |    Yes    |   No    |  No  |
-|    CogVLM2-Chat     |    19B     |    Yes    |   No    |  No  |
-|   LLaVA(1.5,1.6)    |   7B-34B   |    Yes    |   No    |  No  |
-| InternVL-Chat(v1.5) |   2B-26B   |    Yes    |   No    |  No  |
+|        Model        |    Size     | FP16/BF16 | KV INT8 | W8A8 |
+| :-----------------: | :---------: | :-------: | :-----: | :--: |
+|        Llama        |  7B - 65B   |    Yes    |   No    | Yes  |
+|       Llama2        |  7B - 70B   |    Yes    |   No    | Yes  |
+|       Llama3        |   8B, 70B   |    Yes    |   No    | Yes  |
+|      InternLM       |  7B - 20B   |    Yes    |   No    | Yes  |
+|      InternLM2      |  7B - 20B   |    Yes    |   No    |  -   |
+|     InternLM2.5     |     7B      |    Yes    |   No    |  -   |
+|      Baichuan2      |  7B - 13B   |    Yes    |   No    | Yes  |
+|      ChatGLM2       |     6B      |    Yes    |   No    |  No  |
+|       Falcon        |  7B - 180B  |    Yes    |   No    |  No  |
+|         YI          |  6B - 34B   |    Yes    |   No    |  No  |
+|       Mistral       |     7B      |    Yes    |   No    |  No  |
+|       Mixtral       |    8x7B     |    Yes    |   No    |  No  |
+|        QWen         | 1.8B - 72B  |    Yes    |   No    |  No  |
+|       QWen1.5       | 0.5B - 110B |    Yes    |   No    |  No  |
+|     QWen1.5-MoE     |    A2.7B    |    Yes    |   No    |  No  |
+|        QWen2        | 0.5B - 72B  |    Yes    |   No    |  No  |
+|    DeepSeek-MoE     |     16B     |    Yes    |   No    |  No  |
+|     DeepSeek-V2     |  16B, 236B  |    Yes    |   No    |  No  |
+|        Gemma        |    2B-7B    |    Yes    |   No    |  No  |
+|        Dbrx         |    132B     |    Yes    |   No    |  No  |
+|     StarCoder2      |   3B-15B    |    Yes    |   No    |  No  |
+|     Phi-3-mini      |    3.8B     |    Yes    |   No    |  No  |
+|     CogVLM-Chat     |     17B     |    Yes    |   No    |  No  |
+|    CogVLM2-Chat     |     19B     |    Yes    |   No    |  No  |
+|   LLaVA(1.5,1.6)    |   7B-34B    |    Yes    |   No    |  No  |
+| InternVL-Chat(v1.5) |   2B-26B    |    Yes    |   No    |  No  |
diff --git a/docs/zh_cn/get_started.md b/docs/zh_cn/get_started.md
@@ -13,7 +13,7 @@ pip install lmdeploy
 LMDeploy的预编译包默认是基于 CUDA 12 编译的。如果需要在 CUDA 11+ 下安装 LMDeploy，请执行以下命令：
 
 ```shell
-export LMDEPLOY_VERSION=0.4.2
+export LMDEPLOY_VERSION=0.5.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/zh_cn/multi_modal/cogvlm.md b/docs/zh_cn/multi_modal/cogvlm.md
@@ -21,7 +21,7 @@ pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https:
 
 ```shell
 # cuda 11.8
-export LMDEPLOY_VERSION=0.4.2
+export LMDEPLOY_VERSION=0.5.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 # cuda 12.1
diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
@@ -13,7 +13,8 @@
 | InternLM-XComposer  |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 | InternLM-XComposer2 | 7B, 4khd-7B  |    Yes    |   Yes   |   Yes   |  Yes  |
 |        QWen         |  1.8B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
-|       QWen1.5       |  1.8B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|       QWen1.5       | 1.8B - 110B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|        QWen2        |  1.5B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
 |       Mistral       |      7B      |    Yes    |   Yes   |   Yes   |  No   |
 |       QWen-VL       |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |     DeepSeek-VL     |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
@@ -35,29 +36,31 @@ turbomind 引擎不支持 window attention。所以，对于应用了 window att
 
 ### PyTorch 支持的模型
 
-|        模型         |  模型规模  | FP16/BF16 | KV INT8 | W8A8 |
-| :-----------------: | :--------: | :-------: | :-----: | :--: |
-|        Llama        |  7B - 65B  |    Yes    |   No    | Yes  |
-|       Llama2        |  7B - 70B  |    Yes    |   No    | Yes  |
-|       Llama3        |  8B, 70B   |    Yes    |   No    | Yes  |
-|      InternLM       |  7B - 20B  |    Yes    |   No    | Yes  |
-|      InternLM2      |  7B - 20B  |    Yes    |   No    |  -   |
-|     InternLM2.5     |     7B     |    Yes    |   No    |  -   |
-|      Baichuan2      |  7B - 13B  |    Yes    |   No    | Yes  |
-|      ChatGLM2       |     6B     |    Yes    |   No    |  No  |
-|       Falcon        | 7B - 180B  |    Yes    |   No    |  No  |
-|         YI          |  6B - 34B  |    Yes    |   No    |  No  |
-|       Mistral       |     7B     |    Yes    |   No    |  No  |
-|       Mixtral       |    8x7B    |    Yes    |   No    |  No  |
-|        QWen         | 1.8B - 72B |    Yes    |   No    |  No  |
-|       QWen1.5       | 0.5B - 72B |    Yes    |   No    |  No  |
-|     QWen1.5-MoE     |   A2.7B    |    Yes    |   No    |  No  |
-|    DeepSeek-MoE     |    16B     |    Yes    |   No    |  No  |
-|        Gemma        |   2B-7B    |    Yes    |   No    |  No  |
-|        Dbrx         |    132B    |    Yes    |   No    |  No  |
-|     StarCoder2      |   3B-15B   |    Yes    |   No    |  No  |
-|     Phi-3-mini      |    3.8B    |    Yes    |   No    |  No  |
-|     CogVLM-Chat     |    17B     |    Yes    |   No    |  No  |
-|    CogVLM2-Chat     |    19B     |    Yes    |   No    |  No  |
-|   LLaVA(1.5,1.6)    |   7B-34B   |    Yes    |   No    |  No  |
-| InternVL-Chat(v1.5) |   2B-26B   |    Yes    |   No    |  No  |
+|        模型         |  模型规模   | FP16/BF16 | KV INT8 | W8A8 |
+| :-----------------: | :---------: | :-------: | :-----: | :--: |
+|        Llama        |  7B - 65B   |    Yes    |   No    | Yes  |
+|       Llama2        |  7B - 70B   |    Yes    |   No    | Yes  |
+|       Llama3        |   8B, 70B   |    Yes    |   No    | Yes  |
+|      InternLM       |  7B - 20B   |    Yes    |   No    | Yes  |
+|      InternLM2      |  7B - 20B   |    Yes    |   No    |  -   |
+|     InternLM2.5     |     7B      |    Yes    |   No    |  -   |
+|      Baichuan2      |  7B - 13B   |    Yes    |   No    | Yes  |
+|      ChatGLM2       |     6B      |    Yes    |   No    |  No  |
+|       Falcon        |  7B - 180B  |    Yes    |   No    |  No  |
+|         YI          |  6B - 34B   |    Yes    |   No    |  No  |
+|       Mistral       |     7B      |    Yes    |   No    |  No  |
+|       Mixtral       |    8x7B     |    Yes    |   No    |  No  |
+|        QWen         | 1.8B - 72B  |    Yes    |   No    |  No  |
+|       QWen1.5       | 0.5B - 110B |    Yes    |   No    |  No  |
+|        QWen2        | 0.5B - 72B  |    Yes    |   No    |  No  |
+|     QWen1.5-MoE     |    A2.7B    |    Yes    |   No    |  No  |
+|    DeepSeek-MoE     |     16B     |    Yes    |   No    |  No  |
+|     DeepSeek-V2     |  16B, 236B  |    Yes    |   No    |  No  |
+|        Gemma        |    2B-7B    |    Yes    |   No    |  No  |
+|        Dbrx         |    132B     |    Yes    |   No    |  No  |
+|     StarCoder2      |   3B-15B    |    Yes    |   No    |  No  |
+|     Phi-3-mini      |    3.8B     |    Yes    |   No    |  No  |
+|     CogVLM-Chat     |     17B     |    Yes    |   No    |  No  |
+|    CogVLM2-Chat     |     19B     |    Yes    |   No    |  No  |
+|   LLaVA(1.5,1.6)    |   7B-34B    |    Yes    |   No    |  No  |
+| InternVL-Chat(v1.5) |   2B-26B    |    Yes    |   No    |  No  |
diff --git a/lmdeploy/cli/utils.py b/lmdeploy/cli/utils.py
@@ -379,9 +379,8 @@ def cache_max_entry_count(parser):
             '--cache-max-entry-count',
             type=float,
             default=0.8,
-            help=
-            'The percentage of free gpu memory occupied by the k/v cache, excluding weights'
-        )
+            help='The percentage of free gpu memory occupied by the k/v '
+            'cache, excluding weights ')
 
     @staticmethod
     def adapters(parser):
diff --git a/lmdeploy/version.py b/lmdeploy/version.py
@@ -1,7 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Tuple
 
-__version__ = '0.4.2'
+__version__ = '0.5.0'
 short_version = __version__