bump version to v0.5.2 (#2143)

lvhan028 · web-flow · commit 7199b4ed2b2b · 2024-07-26T13:55:39.000+08:00
* bump version v0.5.2

* update the guide about llama3.1 tools calling

* update version

* update
diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
-- \[2024/07\] Support Llama3.1
+- \[2024/07\] 🎉🎉 Support Llama3.1 8B, 70B and its TOOLS CALLING
 - \[2024/07\] Support [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e) full-series models, [InternLM-XComposer2.5](docs/en/multi_modal/xcomposer2d5.md) and [function call](docs/en/serving/api_server_tools.md) of InternLM2.5
 - \[2024/06\] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
 - \[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
@@ -115,10 +115,10 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>InternLM (7B - 20B)</li>
   <li>InternLM2 (7B - 20B)</li>
   <li>InternLM2.5 (7B)</li>
-  <li>QWen (1.8B - 72B)</li>
-  <li>QWen1.5 (0.5B - 110B)</li>
-  <li>QWen1.5 - MoE (0.5B - 72B)</li>
-  <li>QWen2 (0.5B - 72B)</li>
+  <li>Qwen (1.8B - 72B)</li>
+  <li>Qwen1.5 (0.5B - 110B)</li>
+  <li>Qwen1.5 - MoE (0.5B - 72B)</li>
+  <li>Qwen2 (0.5B - 72B)</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -145,7 +145,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>QWen-VL (7B)</li>
   <li>DeepSeek-VL (7B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
-  <li>InternVL2 (1B-40B)</li>
+  <li>InternVL2 (1B-76B)</li>
   <li>MiniGeminiLlama (7B)</li>
   <li>CogVLM-Chat (17B)</li>
   <li>CogVLM2-Chat (19B)</li>
@@ -175,7 +175,7 @@ pip install lmdeploy
 Since v0.3.0, The default prebuilt package is compiled on **CUDA 12**. However, if CUDA 11+ is required, you can install lmdeploy by:
 
 ```shell
-export LMDEPLOY_VERSION=0.5.1
+export LMDEPLOY_VERSION=0.5.2
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -26,7 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
-- \[2024/07\] 支持 Llama3.1
+- \[2024/07\] 🎉🎉 支持 Llama3.1 8B 和 70B 模型，以及工具调用功能
 - \[2024/07\] 支持 [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e) 全系列模型，[InternLM-XComposer2.5](docs/zh_cn/multi_modal/xcomposer2d5.md) 模型和 InternLM2.5 的 [function call 功能](docs/zh_cn/serving/api_server_tools.md)
 - \[2024/06\] PyTorch engine 支持了 DeepSeek-V2 和若干 VLM 模型推理, 比如 CogVLM2，Mini-InternVL，LlaVA-Next
 - \[2024/05\] 在多 GPU 上部署 VLM 模型时，支持把视觉部分的模型均分到多卡上
@@ -116,10 +116,10 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>InternLM (7B - 20B)</li>
   <li>InternLM2 (7B - 20B)</li>
   <li>InternLM2.5 (7B)</li>
-  <li>QWen (1.8B - 72B)</li>
-  <li>QWen1.5 (0.5B - 110B)</li>
-  <li>QWen1.5 - MoE (0.5B - 72B)</li>
-  <li>QWen2 (0.5B - 72B)</li>
+  <li>Qwen (1.8B - 72B)</li>
+  <li>Qwen1.5 (0.5B - 110B)</li>
+  <li>Qwen1.5 - MoE (0.5B - 72B)</li>
+  <li>Qwen2 (0.5B - 72B)</li>
   <li>Baichuan (7B)</li>
   <li>Baichuan2 (7B-13B)</li>
   <li>Code Llama (7B - 34B)</li>
@@ -146,7 +146,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>QWen-VL (7B)</li>
   <li>DeepSeek-VL (7B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
-  <li>InternVL2 (1B-40B)</li>
+  <li>InternVL2 (1B-76B)</li>
   <li>MiniGeminiLlama (7B)</li>
   <li>CogVLM-Chat (17B)</li>
   <li>CogVLM2-Chat (19B)</li>
@@ -176,7 +176,7 @@ pip install lmdeploy
 自 v0.3.0 起，LMDeploy 预编译包默认基于 CUDA 12 编译。如果需要在 CUDA 11+ 下安装 LMDeploy，请执行以下命令：
 
 ```shell
-export LMDEPLOY_VERSION=0.5.1
+export LMDEPLOY_VERSION=0.5.2
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/en/get_started.md b/docs/en/get_started.md
@@ -13,7 +13,7 @@ pip install lmdeploy
 The default prebuilt package is compiled on **CUDA 12**. However, if CUDA 11+ is required, you can install lmdeploy by:
 
 ```shell
-export LMDEPLOY_VERSION=0.5.1
+export LMDEPLOY_VERSION=0.5.2
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/en/multi_modal/cogvlm.md b/docs/en/multi_modal/cogvlm.md
@@ -22,7 +22,7 @@ Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeplo
 ```shell
 # cuda 11.8
 # to get the latest version, run: pip index versions lmdeploy
-export LMDEPLOY_VERSION=0.5.1
+export LMDEPLOY_VERSION=0.5.2
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 # cuda 12.1
diff --git a/docs/en/serving/api_server_tools.md b/docs/en/serving/api_server_tools.md
@@ -1,8 +1,10 @@
 # Tools
 
+LMDeploy supports tools for InternLM2, InternLM2.5 and llama3.1 models.
+
 ## Single Round Invocation
 
-Currently, LMDeploy supports tools only for InternLM2, InternLM2.5 and llama3.1 models. Please start the service of models before running the following example.
+Please start the service of models before running the following example.
 
 ```python
 from openai import OpenAI
@@ -43,7 +45,7 @@ print(response)
 
 ## Multiple Round Invocation
 
-### InternLM demo
+### InternLM
 
 A complete toolchain invocation process can be demonstrated through the following example.
 
@@ -149,58 +151,96 @@ ChatCompletion(id='2', choices=[Choice(finish_reason='tool_calls', index=0, logp
 16
 ```
 
-### Llama3.1 demo
+### Llama 3.1
 
-```python
-from openai import OpenAI
+Meta announces in [Llama3's official user guide](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1) that,
 
-tools = [
-  {
-    "type": "function",
-    "function": {
-      "name": "get_current_weather",
-      "description": "Get the current weather in a given location",
-      "parameters": {
-        "type": "object",
-        "properties": {
-          "location": {
-            "type": "string",
-            "description": "The city and state, e.g. San Francisco, CA",
-          },
-          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
-        },
-        "required": ["location"],
-      },
-    }
-  }
-]
-messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
+```{text}
+There are three built-in tools (brave_search, wolfram_alpha, and code interpreter) can be turned on using the system prompt:
 
-client = OpenAI(api_key='YOUR_API_KEY',base_url='http://0.0.0.0:23333/v1')
-model_name = client.models.list().data[0].id
-response = client.chat.completions.create(
-    model=model_name,
-    messages=messages,
-    temperature=0.8,
-    top_p=0.8,
-    stream=False,
-    tools=tools)
-print(response)
-messages += [{"role": "assistant", "content": response.choices[0].message.content}]
-messages += [{"role": "ipython", "content": "Clouds giving way to sun Hi: 76° Tonight: Mainly clear early, then areas of low clouds forming Lo: 56°"}]
-response = client.chat.completions.create(
-    model=model_name,
-    messages=messages,
-    temperature=0.8,
-    top_p=0.8,
-    stream=False,
-    tools=tools)
-print(response)
+1. Brave Search: Tool call to perform web searches.
+2. Wolfram Alpha: Tool call to perform complex mathematical calculations.
+3. Code Interpreter: Enables the model to output python code.
 ```
 
-And the outputs would be:
+Additionally, it cautions: "**Note:** We recommend using Llama 70B-instruct or Llama 405B-instruct for applications that combine conversation and tool calling. Llama 8B-Instruct can not reliably maintain a conversation alongside tool calling definitions. It can be used for zero-shot tool calling, but tool instructions should be removed for regular conversations between the model and the user."
 
+Therefore, we utilize [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to show how to invoke the tool calling by LMDeploy `api_server`.
+
+On a A100-SXM-80G node, you can start the service as follows:
+
+```shell
+lmdeploy serve api_server /the/path/of/Meta-Llama-3.1-70B-Instruct/model --tp 4
 ```
-ChatCompletion(id='3', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='<function=get_current_weather>{"location": "Boston, MA", "unit": "fahrenheit"}</function>\n\nOutput:\nCurrent Weather in Boston, MA:\nTemperature: 75°F\nHumidity: 60%\nWind Speed: 10 mph\nSky Conditions: Partly Cloudy', role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"location": "Boston, MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function')]))], created=1721815546, model='llama3.1/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=58, prompt_tokens=349, total_tokens=407))
-ChatCompletion(id='4', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The current weather in Boston is mostly sunny with a high of 76°F and a low of 56°F tonight.', role='assistant', function_call=None, tool_calls=None))], created=1721815547, model='llama3.1/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=36, prompt_tokens=446, total_tokens=482))
+
+For an in-depth understanding of the api_server, please refer to the detailed documentation available [here](./api_server.md).
+
+The following code snippet demonstrates how to utilize the 'Wolfram Alpha' tool. It is assumed that you have already registered on the [Wolfram Alpha](https://www.wolframalpha.com) website and obtained an API key. Please ensure that you have a valid API key to access the services provided by Wolfram Alpha
+
+```python
+from openai import OpenAI
+import requests
+
+
+def request_llama3_1_service(messages):
+    client = OpenAI(api_key='YOUR_API_KEY',
+                    base_url='http://0.0.0.0:23333/v1')
+    model_name = client.models.list().data[0].id
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=messages,
+        temperature=0.8,
+        top_p=0.8,
+        stream=False)
+    return response.choices[0].message.content
+
+
+# The role of "system" MUST be specified, including the required tools
+messages = [
+    {
+        "role": "system",
+        "content": "Environment: ipython\nTools: wolfram_alpha\n\n Cutting Knowledge Date: December 2023\nToday Date: 23 Jul 2024\n\nYou are a helpful Assistant." # noqa
+    },
+    {
+        "role": "user",
+        "content": "Can you help me solve this equation: x^3 - 4x^2 + 6x - 24 = 0"  # noqa
+    }
+]
+
+# send request to the api_server of llama3.1-70b and get the response
+# the "assistant_response" is supposed to be:
+# <|python_tag|>wolfram_alpha.call(query="solve x^3 - 4x^2 + 6x - 24 = 0")
+assistant_response = request_llama3_1_service(messages)
+print(assistant_response)
+
+# Call the API of Wolfram Alpha with the query generated by the model
+app_id = 'YOUR-Wolfram-Alpha-API-KEY'
+params = {
+    "input": assistant_response,
+    "appid": app_id,
+    "format": "plaintext",
+    "output": "json",
+}
+
+wolframalpha_response = requests.get(
+    "https://api.wolframalpha.com/v2/query",
+    params=params
+)
+wolframalpha_response = wolframalpha_response.json()
+
+# Append the contents obtained by the model and the wolframalpha's API
+# to "messages", and send it again to the api_server
+messages += [
+    {
+        "role": "assistant",
+        "content": assistant_response
+    },
+    {
+        "role": "ipython",
+        "content": wolframalpha_response
+    }
+]
+
+assistant_response = request_llama3_1_service(messages)
+print(assistant_response)
 ```
diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
@@ -14,19 +14,19 @@
 |  InternLM-XComposer   |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |  InternLM-XComposer2  | 7B, 4khd-7B  |    Yes    |   Yes   |   Yes   |  Yes  |
 | InternLM-XComposer2.5 |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
-|         QWen          |  1.8B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
-|        QWen1.5        | 1.8B - 110B  |    Yes    |   Yes   |   Yes   |  Yes  |
-|         QWen2         |  1.5B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|         Qwen          |  1.8B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|        Qwen1.5        | 1.8B - 110B  |    Yes    |   Yes   |   Yes   |  Yes  |
+|         Qwen2         |  1.5B - 72B  |    Yes    |   Yes   |   Yes   |  Yes  |
 |        Mistral        |      7B      |    Yes    |   Yes   |   Yes   |  No   |
-|        QWen-VL        |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
+|        Qwen-VL        |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |      DeepSeek-VL      |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |       Baichuan        |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |       Baichuan2       |      7B      |    Yes    |   Yes   |   Yes   |  Yes  |
 |      Code Llama       |   7B - 34B   |    Yes    |   Yes   |   Yes   |  No   |
 |          YI           |   6B - 34B   |    Yes    |   Yes   |   Yes   |  No   |
 |    LLaVA(1.5,1.6)     |   7B - 34B   |    Yes    |   Yes   |   Yes   |  Yes  |
 |     InternVL-Chat     |  v1.1- v1.5  |    Yes    |   Yes   |   Yes   |  Yes  |
-|       InternVL2       |    2B-40B    |    Yes    |   Yes   |   Yes   |  Yes  |
+|       InternVL2       |    2B-76B    |    Yes    |   Yes   |   Yes   |  Yes  |
 |        MiniCPM        | Llama3-V-2_5 |    Yes    |   Yes   |   Yes   |  Yes  |
 |    MiniGeminiLlama    |      7B      |    Yes    |   No    |   No    |  Yes  |
 |         GLM4          |      9B      |    Yes    |   Yes   |   Yes   |  No   |
@@ -35,7 +35,7 @@
 "-" means not verified yet.
 
 ```{note}
-The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, QWen1.5 and etc., please choose the PyTorch engine for inference.
+The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
 ```
 
 ## Models supported by PyTorch
@@ -55,10 +55,10 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
 |         YI          |  6B - 34B   |    Yes    |   No    |  No  |
 |       Mistral       |     7B      |    Yes    |   No    |  No  |
 |       Mixtral       |    8x7B     |    Yes    |   No    |  No  |
-|        QWen         | 1.8B - 72B  |    Yes    |   No    |  No  |
-|       QWen1.5       | 0.5B - 110B |    Yes    |   No    |  No  |
-|     QWen1.5-MoE     |    A2.7B    |    Yes    |   No    |  No  |
-|        QWen2        | 0.5B - 72B  |    Yes    |   No    |  No  |
+|        Qwen         | 1.8B - 72B  |    Yes    |   No    |  No  |
+|       Qwen1.5       | 0.5B - 110B |    Yes    |   No    |  No  |
+|     Qwen1.5-MoE     |    A2.7B    |    Yes    |   No    |  No  |
+|        Qwen2        | 0.5B - 72B  |    Yes    |   No    |  No  |
 |    DeepSeek-MoE     |     16B     |    Yes    |   No    |  No  |
 |     DeepSeek-V2     |  16B, 236B  |    Yes    |   No    |  No  |
 |        Gemma        |    2B-7B    |    Yes    |   No    |  No  |
@@ -70,7 +70,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
 |    CogVLM2-Chat     |     19B     |    Yes    |   No    |  No  |
 |   LLaVA(1.5,1.6)    |   7B-34B    |    Yes    |   No    |  No  |
 | InternVL-Chat(v1.5) |   2B-26B    |    Yes    |   No    |  No  |
-|      InternVL2      |   1B-40B    |    Yes    |   No    |  No  |
+|      InternVL2      |   1B-76B    |    Yes    |   No    |  No  |
 |       Gemma2        |   9B-27B    |    Yes    |   No    |  No  |
 |        GLM4         |     9B      |    Yes    |   No    |  No  |
 |      CodeGeeX4      |     9B      |    Yes    |   No    |  No  |
diff --git a/docs/zh_cn/get_started.md b/docs/zh_cn/get_started.md
@@ -13,7 +13,7 @@ pip install lmdeploy
 LMDeploy的预编译包默认是基于 CUDA 12 编译的。如果需要在 CUDA 11+ 下安装 LMDeploy，请执行以下命令：
 
 ```shell
-export LMDEPLOY_VERSION=0.5.1
+export LMDEPLOY_VERSION=0.5.2
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/zh_cn/multi_modal/cogvlm.md b/docs/zh_cn/multi_modal/cogvlm.md
@@ -21,7 +21,7 @@ pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https:
 
 ```shell
 # cuda 11.8
-export LMDEPLOY_VERSION=0.5.1
+export LMDEPLOY_VERSION=0.5.2
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 # cuda 12.1
diff --git a/docs/zh_cn/serving/api_server_tools.md b/docs/zh_cn/serving/api_server_tools.md
diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
diff --git a/lmdeploy/version.py b/lmdeploy/version.py