-
Notifications
You must be signed in to change notification settings - Fork 216
Cannot run a FP8 quantized model with LoraX #671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You need to add the following flag to your LoRAX launch command |
So, |
running model neuralmagic/Meta-Llama-3.1-8B-FP8 this the docker-compose file:
I ran into this error: |
What GPU are you running
…On Thu 6 Mar 2025 at 08:04, K M Asifur Rahman ***@***.***> wrote:
running model neuralmagic/Meta-Llama-3.1-8B-FP8
this the docker-compose file:
services:
lorax:
image: ghcr.io/predibase/lorax:main
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
shm_size: '1g'
ports:
- "8001:80"
volumes:
- ./data/llama3.1-fp8/hub:/data
# - ./data/Q_llama32_vision:/model
environment:
- SOURCE=hub
- QUANTIZE=fp8
- MODEL_ID=neuralmagic/Meta-Llama-3.1-8B-FP8
- MAX_INPUT_LENGTH=1024
- MAX_BATCH_PREFILL_TOKENS=2048
- MAX_TOTAL_TOKENS=2048
- MAX_BATCH_TOTAL_TOKENS=2464
I ran into this error:
raise ValueError("FP8 quantization is only supported on hardware that
supports FP8")\n\nValueError: FP8 quantization is only supported on
hardware that supports
FP8\n"},"target":"lorax_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
—
Reply to this email directly, view it on GitHub
<#671 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CR5ABH5JHMELGZF22L2S76RPAVCNFSM6AAAAABRSAXNP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBTGA4TQMJZHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
[image: Asif-droid]*Asif-droid* left a comment (predibase/lorax#671)
<#671 (comment)>
running model neuralmagic/Meta-Llama-3.1-8B-FP8
this the docker-compose file:
services:
lorax:
image: ghcr.io/predibase/lorax:main
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
shm_size: '1g'
ports:
- "8001:80"
volumes:
- ./data/llama3.1-fp8/hub:/data
# - ./data/Q_llama32_vision:/model
environment:
- SOURCE=hub
- QUANTIZE=fp8
- MODEL_ID=neuralmagic/Meta-Llama-3.1-8B-FP8
- MAX_INPUT_LENGTH=1024
- MAX_BATCH_PREFILL_TOKENS=2048
- MAX_TOTAL_TOKENS=2048
- MAX_BATCH_TOTAL_TOKENS=2464
I ran into this error:
raise ValueError("FP8 quantization is only supported on hardware that
supports FP8")\n\nValueError: FP8 quantization is only supported on
hardware that supports
FP8\n"},"target":"lorax_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
—
Reply to this email directly, view it on GitHub
<#671 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CR5ABH5JHMELGZF22L2S76RPAVCNFSM6AAAAABRSAXNP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBTGA4TQMJZHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Probably remove the fp8 quanting. That's only supported on Hopper and
later, or if there is dynamic dequantisation in the library, but lorax is
based on TGI and I don't think it supports that.
…On Thu, Mar 6, 2025 at 8:26 AM K M Asifur Rahman ***@***.***> wrote:
What GPU are you running
… <#m_-2939667637380089419_>
==============NVSMI LOG==============
Timestamp : Thu Mar 6 14:25:21 2025
Driver Version : 535.183.01
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA RTX A5000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1325123060665
GPU UUID : GPU-73d9272b-ca2b-5b93-9165-396c7102606a
Minor Number : 0
VBIOS Version : 94.02.6D.00.0D
MultiGPU Board : No
Board ID : 0x100
Board Part Number : 900-5G132-2500-000
GPU Part Number : 2231-850-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G132.0500.00.01
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x223110DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x147E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 30 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 24564 MiB
Reserved : 324 MiB
Used : 441 MiB
Free : 23798 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 16 MiB
Free : 32752 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 37 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 90 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 21.12 W
Current Power Limit : 230.00 W
Requested Power Limit : 230.00 W
Default Power Limit : 230.00 W
Min Power Limit : 100.00 W
Max Power Limit : 230.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1695 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1695 MHz
Memory : 8001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 668.750 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 567123
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 35 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 615933
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 81 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 616538
Type : G
Name : /snap/snapd-desktop-integration/253/usr/bin/snapd-desktop-integration
Used GPU Memory : 7 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3868791
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 64 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3868861
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 6 MiB
—
Reply to this email directly, view it on GitHub
<#671 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CTXEC7LLV56UOV5UXD2TABENAVCNFSM6AAAAABRSAXNP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBTGE2DCOBRGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
[image: Asif-droid]*Asif-droid* left a comment (predibase/lorax#671)
<#671 (comment)>
What GPU are you running
… <#m_-2939667637380089419_>
==============NVSMI LOG==============
Timestamp : Thu Mar 6 14:25:21 2025
Driver Version : 535.183.01
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA RTX A5000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1325123060665
GPU UUID : GPU-73d9272b-ca2b-5b93-9165-396c7102606a
Minor Number : 0
VBIOS Version : 94.02.6D.00.0D
MultiGPU Board : No
Board ID : 0x100
Board Part Number : 900-5G132-2500-000
GPU Part Number : 2231-850-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G132.0500.00.01
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x223110DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x147E10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 30 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 24564 MiB
Reserved : 324 MiB
Used : 441 MiB
Free : 23798 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 16 MiB
Free : 32752 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 37 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 90 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 21.12 W
Current Power Limit : 230.00 W
Requested Power Limit : 230.00 W
Default Power Limit : 230.00 W
Min Power Limit : 100.00 W
Max Power Limit : 230.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1695 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1695 MHz
Memory : 8001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 668.750 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 567123
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 35 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 615933
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 81 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 616538
Type : G
Name : /snap/snapd-desktop-integration/253/usr/bin/snapd-desktop-integration
Used GPU Memory : 7 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3868791
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 64 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3868861
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 6 MiB
—
Reply to this email directly, view it on GitHub
<#671 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CTXEC7LLV56UOV5UXD2TABENAVCNFSM6AAAAABRSAXNP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBTGE2DCOBRGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I faced error on gptq, bnb and awq quantization as well. what quantization can be used?? |
System Info
Lorax version:
Name: lorax-client
Version: 0.6.3
Summary: LoRAX Python Client
Home-page: https://github.com/predibase/lorax
Author: Travis Addair
Author-email: [email protected]
License: Apache-2.0
Location: /mnt/share/ai_studio/.venv/lib/python3.11/site-packages
Requires: aiohttp, certifi, huggingface-hub, pydantic
Required-by:
Platform:
linux, x86_64
nvidia-smi
output:Information
Tasks
Reproduction
Expected behavior
I am using an official script to run LoRAX via docker from the official LoRAX page (section Launch LoRAX Server) - the only modification is the model id - I'm using FP8 quantized Llama-3.1-8b.
However, it seems that LoRAX's backend does not support FP8 models, as I'm getting a FP8-related error:
Could you please investigate?
The text was updated successfully, but these errors were encountered: