Skip to content

Commit 7199b4e

Browse files
authored
bump version to v0.5.2 (#2143)
* bump version v0.5.2 * update the guide about llama3.1 tools calling * update version * update
1 parent b99909c commit 7199b4e

File tree

11 files changed

+218
-138
lines changed

11 files changed

+218
-138
lines changed

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ ______________________________________________________________________
2626
<details open>
2727
<summary><b>2024</b></summary>
2828

29-
- \[2024/07\] Support Llama3.1
29+
- \[2024/07\] 🎉🎉 Support Llama3.1 8B, 70B and its TOOLS CALLING
3030
- \[2024/07\] Support [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e) full-series models, [InternLM-XComposer2.5](docs/en/multi_modal/xcomposer2d5.md) and [function call](docs/en/serving/api_server_tools.md) of InternLM2.5
3131
- \[2024/06\] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
3232
- \[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
@@ -115,10 +115,10 @@ For detailed inference benchmarks in more devices and more settings, please refe
115115
<li>InternLM (7B - 20B)</li>
116116
<li>InternLM2 (7B - 20B)</li>
117117
<li>InternLM2.5 (7B)</li>
118-
<li>QWen (1.8B - 72B)</li>
119-
<li>QWen1.5 (0.5B - 110B)</li>
120-
<li>QWen1.5 - MoE (0.5B - 72B)</li>
121-
<li>QWen2 (0.5B - 72B)</li>
118+
<li>Qwen (1.8B - 72B)</li>
119+
<li>Qwen1.5 (0.5B - 110B)</li>
120+
<li>Qwen1.5 - MoE (0.5B - 72B)</li>
121+
<li>Qwen2 (0.5B - 72B)</li>
122122
<li>Baichuan (7B)</li>
123123
<li>Baichuan2 (7B-13B)</li>
124124
<li>Code Llama (7B - 34B)</li>
@@ -145,7 +145,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
145145
<li>QWen-VL (7B)</li>
146146
<li>DeepSeek-VL (7B)</li>
147147
<li>InternVL-Chat (v1.1-v1.5)</li>
148-
<li>InternVL2 (1B-40B)</li>
148+
<li>InternVL2 (1B-76B)</li>
149149
<li>MiniGeminiLlama (7B)</li>
150150
<li>CogVLM-Chat (17B)</li>
151151
<li>CogVLM2-Chat (19B)</li>
@@ -175,7 +175,7 @@ pip install lmdeploy
175175
Since v0.3.0, The default prebuilt package is compiled on **CUDA 12**. However, if CUDA 11+ is required, you can install lmdeploy by:
176176

177177
```shell
178-
export LMDEPLOY_VERSION=0.5.1
178+
export LMDEPLOY_VERSION=0.5.2
179179
export PYTHON_VERSION=38
180180
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
181181
```

README_zh-CN.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ ______________________________________________________________________
2626
<details open>
2727
<summary><b>2024</b></summary>
2828

29-
- \[2024/07\] 支持 Llama3.1
29+
- \[2024/07\] 🎉🎉 支持 Llama3.1 8B 和 70B 模型,以及工具调用功能
3030
- \[2024/07\] 支持 [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e) 全系列模型,[InternLM-XComposer2.5](docs/zh_cn/multi_modal/xcomposer2d5.md) 模型和 InternLM2.5 的 [function call 功能](docs/zh_cn/serving/api_server_tools.md)
3131
- \[2024/06\] PyTorch engine 支持了 DeepSeek-V2 和若干 VLM 模型推理, 比如 CogVLM2,Mini-InternVL,LlaVA-Next
3232
- \[2024/05\] 在多 GPU 上部署 VLM 模型时,支持把视觉部分的模型均分到多卡上
@@ -116,10 +116,10 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
116116
<li>InternLM (7B - 20B)</li>
117117
<li>InternLM2 (7B - 20B)</li>
118118
<li>InternLM2.5 (7B)</li>
119-
<li>QWen (1.8B - 72B)</li>
120-
<li>QWen1.5 (0.5B - 110B)</li>
121-
<li>QWen1.5 - MoE (0.5B - 72B)</li>
122-
<li>QWen2 (0.5B - 72B)</li>
119+
<li>Qwen (1.8B - 72B)</li>
120+
<li>Qwen1.5 (0.5B - 110B)</li>
121+
<li>Qwen1.5 - MoE (0.5B - 72B)</li>
122+
<li>Qwen2 (0.5B - 72B)</li>
123123
<li>Baichuan (7B)</li>
124124
<li>Baichuan2 (7B-13B)</li>
125125
<li>Code Llama (7B - 34B)</li>
@@ -146,7 +146,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
146146
<li>QWen-VL (7B)</li>
147147
<li>DeepSeek-VL (7B)</li>
148148
<li>InternVL-Chat (v1.1-v1.5)</li>
149-
<li>InternVL2 (1B-40B)</li>
149+
<li>InternVL2 (1B-76B)</li>
150150
<li>MiniGeminiLlama (7B)</li>
151151
<li>CogVLM-Chat (17B)</li>
152152
<li>CogVLM2-Chat (19B)</li>
@@ -176,7 +176,7 @@ pip install lmdeploy
176176
自 v0.3.0 起,LMDeploy 预编译包默认基于 CUDA 12 编译。如果需要在 CUDA 11+ 下安装 LMDeploy,请执行以下命令:
177177

178178
```shell
179-
export LMDEPLOY_VERSION=0.5.1
179+
export LMDEPLOY_VERSION=0.5.2
180180
export PYTHON_VERSION=38
181181
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
182182
```

docs/en/get_started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ pip install lmdeploy
1313
The default prebuilt package is compiled on **CUDA 12**. However, if CUDA 11+ is required, you can install lmdeploy by:
1414

1515
```shell
16-
export LMDEPLOY_VERSION=0.5.1
16+
export LMDEPLOY_VERSION=0.5.2
1717
export PYTHON_VERSION=38
1818
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
1919
```

docs/en/multi_modal/cogvlm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeplo
2222
```shell
2323
# cuda 11.8
2424
# to get the latest version, run: pip index versions lmdeploy
25-
export LMDEPLOY_VERSION=0.5.1
25+
export LMDEPLOY_VERSION=0.5.2
2626
export PYTHON_VERSION=38
2727
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
2828
# cuda 12.1

docs/en/serving/api_server_tools.md

Lines changed: 89 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
# Tools
22

3+
LMDeploy supports tools for InternLM2, InternLM2.5 and llama3.1 models.
4+
35
## Single Round Invocation
46

5-
Currently, LMDeploy supports tools only for InternLM2, InternLM2.5 and llama3.1 models. Please start the service of models before running the following example.
7+
Please start the service of models before running the following example.
68

79
```python
810
from openai import OpenAI
@@ -43,7 +45,7 @@ print(response)
4345

4446
## Multiple Round Invocation
4547

46-
### InternLM demo
48+
### InternLM
4749

4850
A complete toolchain invocation process can be demonstrated through the following example.
4951

@@ -149,58 +151,96 @@ ChatCompletion(id='2', choices=[Choice(finish_reason='tool_calls', index=0, logp
149151
16
150152
```
151153

152-
### Llama3.1 demo
154+
### Llama 3.1
153155

154-
```python
155-
from openai import OpenAI
156+
Meta announces in [Llama3's official user guide](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1) that,
156157

157-
tools = [
158-
{
159-
"type": "function",
160-
"function": {
161-
"name": "get_current_weather",
162-
"description": "Get the current weather in a given location",
163-
"parameters": {
164-
"type": "object",
165-
"properties": {
166-
"location": {
167-
"type": "string",
168-
"description": "The city and state, e.g. San Francisco, CA",
169-
},
170-
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
171-
},
172-
"required": ["location"],
173-
},
174-
}
175-
}
176-
]
177-
messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
158+
```{text}
159+
There are three built-in tools (brave_search, wolfram_alpha, and code interpreter) can be turned on using the system prompt:
178160
179-
client = OpenAI(api_key='YOUR_API_KEY',base_url='http://0.0.0.0:23333/v1')
180-
model_name = client.models.list().data[0].id
181-
response = client.chat.completions.create(
182-
model=model_name,
183-
messages=messages,
184-
temperature=0.8,
185-
top_p=0.8,
186-
stream=False,
187-
tools=tools)
188-
print(response)
189-
messages += [{"role": "assistant", "content": response.choices[0].message.content}]
190-
messages += [{"role": "ipython", "content": "Clouds giving way to sun Hi: 76° Tonight: Mainly clear early, then areas of low clouds forming Lo: 56°"}]
191-
response = client.chat.completions.create(
192-
model=model_name,
193-
messages=messages,
194-
temperature=0.8,
195-
top_p=0.8,
196-
stream=False,
197-
tools=tools)
198-
print(response)
161+
1. Brave Search: Tool call to perform web searches.
162+
2. Wolfram Alpha: Tool call to perform complex mathematical calculations.
163+
3. Code Interpreter: Enables the model to output python code.
199164
```
200165

201-
And the outputs would be:
166+
Additionally, it cautions: "**Note:** We recommend using Llama 70B-instruct or Llama 405B-instruct for applications that combine conversation and tool calling. Llama 8B-Instruct can not reliably maintain a conversation alongside tool calling definitions. It can be used for zero-shot tool calling, but tool instructions should be removed for regular conversations between the model and the user."
202167

168+
Therefore, we utilize [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to show how to invoke the tool calling by LMDeploy `api_server`.
169+
170+
On a A100-SXM-80G node, you can start the service as follows:
171+
172+
```shell
173+
lmdeploy serve api_server /the/path/of/Meta-Llama-3.1-70B-Instruct/model --tp 4
203174
```
204-
ChatCompletion(id='3', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='<function=get_current_weather>{"location": "Boston, MA", "unit": "fahrenheit"}</function>\n\nOutput:\nCurrent Weather in Boston, MA:\nTemperature: 75°F\nHumidity: 60%\nWind Speed: 10 mph\nSky Conditions: Partly Cloudy', role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"location": "Boston, MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function')]))], created=1721815546, model='llama3.1/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=58, prompt_tokens=349, total_tokens=407))
205-
ChatCompletion(id='4', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The current weather in Boston is mostly sunny with a high of 76°F and a low of 56°F tonight.', role='assistant', function_call=None, tool_calls=None))], created=1721815547, model='llama3.1/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=36, prompt_tokens=446, total_tokens=482))
175+
176+
For an in-depth understanding of the api_server, please refer to the detailed documentation available [here](./api_server.md).
177+
178+
The following code snippet demonstrates how to utilize the 'Wolfram Alpha' tool. It is assumed that you have already registered on the [Wolfram Alpha](https://www.wolframalpha.com) website and obtained an API key. Please ensure that you have a valid API key to access the services provided by Wolfram Alpha
179+
180+
```python
181+
from openai import OpenAI
182+
import requests
183+
184+
185+
def request_llama3_1_service(messages):
186+
client = OpenAI(api_key='YOUR_API_KEY',
187+
base_url='http://0.0.0.0:23333/v1')
188+
model_name = client.models.list().data[0].id
189+
response = client.chat.completions.create(
190+
model=model_name,
191+
messages=messages,
192+
temperature=0.8,
193+
top_p=0.8,
194+
stream=False)
195+
return response.choices[0].message.content
196+
197+
198+
# The role of "system" MUST be specified, including the required tools
199+
messages = [
200+
{
201+
"role": "system",
202+
"content": "Environment: ipython\nTools: wolfram_alpha\n\n Cutting Knowledge Date: December 2023\nToday Date: 23 Jul 2024\n\nYou are a helpful Assistant." # noqa
203+
},
204+
{
205+
"role": "user",
206+
"content": "Can you help me solve this equation: x^3 - 4x^2 + 6x - 24 = 0" # noqa
207+
}
208+
]
209+
210+
# send request to the api_server of llama3.1-70b and get the response
211+
# the "assistant_response" is supposed to be:
212+
# <|python_tag|>wolfram_alpha.call(query="solve x^3 - 4x^2 + 6x - 24 = 0")
213+
assistant_response = request_llama3_1_service(messages)
214+
print(assistant_response)
215+
216+
# Call the API of Wolfram Alpha with the query generated by the model
217+
app_id = 'YOUR-Wolfram-Alpha-API-KEY'
218+
params = {
219+
"input": assistant_response,
220+
"appid": app_id,
221+
"format": "plaintext",
222+
"output": "json",
223+
}
224+
225+
wolframalpha_response = requests.get(
226+
"https://api.wolframalpha.com/v2/query",
227+
params=params
228+
)
229+
wolframalpha_response = wolframalpha_response.json()
230+
231+
# Append the contents obtained by the model and the wolframalpha's API
232+
# to "messages", and send it again to the api_server
233+
messages += [
234+
{
235+
"role": "assistant",
236+
"content": assistant_response
237+
},
238+
{
239+
"role": "ipython",
240+
"content": wolframalpha_response
241+
}
242+
]
243+
244+
assistant_response = request_llama3_1_service(messages)
245+
print(assistant_response)
206246
```

docs/en/supported_models/supported_models.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,19 @@
1414
| InternLM-XComposer | 7B | Yes | Yes | Yes | Yes |
1515
| InternLM-XComposer2 | 7B, 4khd-7B | Yes | Yes | Yes | Yes |
1616
| InternLM-XComposer2.5 | 7B | Yes | Yes | Yes | Yes |
17-
| QWen | 1.8B - 72B | Yes | Yes | Yes | Yes |
18-
| QWen1.5 | 1.8B - 110B | Yes | Yes | Yes | Yes |
19-
| QWen2 | 1.5B - 72B | Yes | Yes | Yes | Yes |
17+
| Qwen | 1.8B - 72B | Yes | Yes | Yes | Yes |
18+
| Qwen1.5 | 1.8B - 110B | Yes | Yes | Yes | Yes |
19+
| Qwen2 | 1.5B - 72B | Yes | Yes | Yes | Yes |
2020
| Mistral | 7B | Yes | Yes | Yes | No |
21-
| QWen-VL | 7B | Yes | Yes | Yes | Yes |
21+
| Qwen-VL | 7B | Yes | Yes | Yes | Yes |
2222
| DeepSeek-VL | 7B | Yes | Yes | Yes | Yes |
2323
| Baichuan | 7B | Yes | Yes | Yes | Yes |
2424
| Baichuan2 | 7B | Yes | Yes | Yes | Yes |
2525
| Code Llama | 7B - 34B | Yes | Yes | Yes | No |
2626
| YI | 6B - 34B | Yes | Yes | Yes | No |
2727
| LLaVA(1.5,1.6) | 7B - 34B | Yes | Yes | Yes | Yes |
2828
| InternVL-Chat | v1.1- v1.5 | Yes | Yes | Yes | Yes |
29-
| InternVL2 | 2B-40B | Yes | Yes | Yes | Yes |
29+
| InternVL2 | 2B-76B | Yes | Yes | Yes | Yes |
3030
| MiniCPM | Llama3-V-2_5 | Yes | Yes | Yes | Yes |
3131
| MiniGeminiLlama | 7B | Yes | No | No | Yes |
3232
| GLM4 | 9B | Yes | Yes | Yes | No |
@@ -35,7 +35,7 @@
3535
"-" means not verified yet.
3636

3737
```{note}
38-
The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, QWen1.5 and etc., please choose the PyTorch engine for inference.
38+
The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
3939
```
4040

4141
## Models supported by PyTorch
@@ -55,10 +55,10 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
5555
| YI | 6B - 34B | Yes | No | No |
5656
| Mistral | 7B | Yes | No | No |
5757
| Mixtral | 8x7B | Yes | No | No |
58-
| QWen | 1.8B - 72B | Yes | No | No |
59-
| QWen1.5 | 0.5B - 110B | Yes | No | No |
60-
| QWen1.5-MoE | A2.7B | Yes | No | No |
61-
| QWen2 | 0.5B - 72B | Yes | No | No |
58+
| Qwen | 1.8B - 72B | Yes | No | No |
59+
| Qwen1.5 | 0.5B - 110B | Yes | No | No |
60+
| Qwen1.5-MoE | A2.7B | Yes | No | No |
61+
| Qwen2 | 0.5B - 72B | Yes | No | No |
6262
| DeepSeek-MoE | 16B | Yes | No | No |
6363
| DeepSeek-V2 | 16B, 236B | Yes | No | No |
6464
| Gemma | 2B-7B | Yes | No | No |
@@ -70,7 +70,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
7070
| CogVLM2-Chat | 19B | Yes | No | No |
7171
| LLaVA(1.5,1.6) | 7B-34B | Yes | No | No |
7272
| InternVL-Chat(v1.5) | 2B-26B | Yes | No | No |
73-
| InternVL2 | 1B-40B | Yes | No | No |
73+
| InternVL2 | 1B-76B | Yes | No | No |
7474
| Gemma2 | 9B-27B | Yes | No | No |
7575
| GLM4 | 9B | Yes | No | No |
7676
| CodeGeeX4 | 9B | Yes | No | No |

docs/zh_cn/get_started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ pip install lmdeploy
1313
LMDeploy的预编译包默认是基于 CUDA 12 编译的。如果需要在 CUDA 11+ 下安装 LMDeploy,请执行以下命令:
1414

1515
```shell
16-
export LMDEPLOY_VERSION=0.5.1
16+
export LMDEPLOY_VERSION=0.5.2
1717
export PYTHON_VERSION=38
1818
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
1919
```

docs/zh_cn/multi_modal/cogvlm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https:
2121

2222
```shell
2323
# cuda 11.8
24-
export LMDEPLOY_VERSION=0.5.1
24+
export LMDEPLOY_VERSION=0.5.2
2525
export PYTHON_VERSION=38
2626
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
2727
# cuda 12.1

0 commit comments

Comments
 (0)