-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meet error in serving with huggingface inference tutorial #16
Comments
Hi @JF-D! Thanks for trying this out. Can you tell me a bit more about your setup? Specifically:
|
BTW, loading checkpoints takes ~30min on my server, it's soooo long. |
Excellent, one quick follow-up question before diving into the other details. Are these 40GB or 80GB A100s? w.r.t. slow load times, we are working on uploading pre-quantized checkpoints to HF. Hopefully that will help reduce the load times a bit. |
They are 80GB A100s. I think with the quantization config, I should be able to run a simple example. |
Gotcha, yeah I think 8xA100-80GB should work here. We have not tested this exactly since I don't have immediate access to this hardware. I have seen that error message previously due to some tensors being moved to CPU by |
Also, if you haven’t already can a you try changing q_bits=6 in the quant config? |
Ok! Let me have a try and then get back to you. |
Unfortunately, setting
|
Thanks! @sfc-gh-reyazda I tried the PR you mentioned, and met the following error,
|
This is very strange! This means that the quantizer with which you are trying to dequantize the weight does not have the |
I checked the version of transformers, the latest commit is the same with you tried (6b1fe691bf8c34318f1beb5124db1162d93f047e). |
I find the error. When trying to quantize the weights, DS found the tensor is on meta device instead of GPU, so the tensor is not quantized (here). But I think I should be able to run arctic model with FP8 quantization and 8x80GB A100. It's quite strange. Maybe something wrong with huggingface accelerate? |
I guess I find the reason. The transformers cannot get aware the deepspeed quantization config, so it gives a wrong auto placement with accelerate (here). |
how about explicitly specifying it:
|
I have set the config as the following,
The transformers cannot capture the quantization config set by |
Ohh yes, you have to set the We are actively working on adding deepspeed quantization support into HFQuantizer instead of this current way. This should smooth out this path once it's live. |
Yes! I can run successfully after setting |
Excellent, glad to hear :) I'll close this for now then, please re-open if there are remaining issues though. |
Hi, Arctic team, Great work! I followed the Huggingface Inference Tutorial to do the inference. But I met the following error:
Can you help me resolve this? Thanks a lot!
The text was updated successfully, but these errors were encountered: