Open
Description
System Info
Docker Command:
docker run --gpus all --shm-size 1g -p 80:80 -d -v /root/data:/data -e HUGGING_FACE_HUB_TOKEN='hf_###' -e MODEL_ID='${model_name}' -e TRUST_REMOTE_CODE='true' ghcr.io/predibase/lorax:main
Hardware:
AWS g6.xlarge
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:31:00.0 Off | 0 |
| N/A 35C P0 26W / 72W | 17449MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4284 C /opt/conda/bin/python3.10 17440MiB |
+-----------------------------------------------------------------------------------------+
OS:
Amazon Linux:
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.6.20241111"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
Model Used:
{
"model_id": "Qwen/Qwen2.5-0.5B-Instruct",
"model_sha": "7ae557604adf67be50417f59c2c2f167def9a775",
"model_dtype": "torch.bfloat16",
"model_device_type": "cuda",
"model_pipeline_tag": "text-generation",
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 4095,
"max_total_tokens": 4096,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 1327744,
"max_waiting_tokens": 20,
"validation_workers": 2,
"eager_prefill": false,
"version": "0.1.0",
"sha": null,
"docker_label": null,
"request_logger_url": null
}
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Running the model directly (no adapters) works fine. However, when I use any adapter (I have my own + tried several public adapters found on HuggingFace, listed below), I get the following error:
Request failed during generation: Server error: No suitable kernel. h_in=896 h_out=64 dtype=BFloat16
When I use streaming, I can actually see that the first token gets generated, and then it fails:
data: {"id":"null","object":"chat.completion.chunk","created":0,"model":"null","choices":[{"index":0,"delta":{"role":"assistant","content":"An"},"finish_reason":null}]}
data: {"error":"Request failed during generation: Server error: No suitable kernel. h_in=896 h_out=64 dtype=BFloat16","error_type":"generation"}
Some adapters I tried:
- https://huggingface.co/dimasik2987/7a2c287f-1ebe-405a-8274-6ba9675e1375
- https://huggingface.co/sn56m2/089c89ea-06be-4fbc-863e-8d7c867ef8d3
- https://huggingface.co/Sephfox/eudaimonic-qwen-0.5b
Expected behavior
Expecting model to properly generate with LoRA adapter...
Metadata
Metadata
Assignees
Labels
No labels