Getting error with LoRA adapters for Qwen 2.5 0.5B Instruct

### System Info


**Docker Command:**
```
docker run --gpus all --shm-size 1g -p 80:80 -d -v /root/data:/data -e HUGGING_FACE_HUB_TOKEN='hf_###' -e MODEL_ID='${model_name}' -e TRUST_REMOTE_CODE='true' ghcr.io/predibase/lorax:main
```

**Hardware:**
AWS g6.xlarge
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   35C    P0             26W /   72W |   17449MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4284      C   /opt/conda/bin/python3.10                   17440MiB |
+-----------------------------------------------------------------------------------------+
```

**OS:**
Amazon Linux:
```
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.6.20241111"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
```

**Model Used:**
```
{
  "model_id": "Qwen/Qwen2.5-0.5B-Instruct",
  "model_sha": "7ae557604adf67be50417f59c2c2f167def9a775",
  "model_dtype": "torch.bfloat16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 4095,
  "max_total_tokens": 4096,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 1327744,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "eager_prefill": false,
  "version": "0.1.0",
  "sha": null,
  "docker_label": null,
  "request_logger_url": null
}
```

### Information

- [X] Docker
- [ ] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

Running the model directly (no adapters) works fine. However, when I use any adapter (I have my own + tried several public adapters found on HuggingFace, listed below), I get the following error:

```
Request failed during generation: Server error: No suitable kernel. h_in=896 h_out=64 dtype=BFloat16
```

When I use streaming, I can actually see that the first token gets generated, and then it fails:
```
data: {"id":"null","object":"chat.completion.chunk","created":0,"model":"null","choices":[{"index":0,"delta":{"role":"assistant","content":"An"},"finish_reason":null}]}

data: {"error":"Request failed during generation: Server error: No suitable kernel. h_in=896 h_out=64 dtype=BFloat16","error_type":"generation"}
```

Some adapters I tried:
- https://huggingface.co/dimasik2987/7a2c287f-1ebe-405a-8274-6ba9675e1375
- https://huggingface.co/sn56m2/089c89ea-06be-4fbc-863e-8d7c867ef8d3
- https://huggingface.co/Sephfox/eudaimonic-qwen-0.5b

### Expected behavior

Expecting model to properly generate with LoRA adapter...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting error with LoRA adapters for Qwen 2.5 0.5B Instruct #733

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Getting error with LoRA adapters for Qwen 2.5 0.5B Instruct #733

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions