Skip to content

BUG: RTX 50XX nan returned by _fused.mean_scale_fuse_quant_cuda and _fused.scale_fuse_quant_cuda #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
deepbeepmeep opened this issue Apr 30, 2025 · 0 comments

Comments

@deepbeepmeep
Copy link

deepbeepmeep commented Apr 30, 2025

Hello

Sage attention works very well for most GPUs. However I have tried it recently on a RTX 5090 and occasionally one v token contains only nan values (in my case 512 nan) after being fp8 quantized . This nan then propagates to the rest of the attention that becomes completely nan.

I have tracked the issue to the call * _fused.mean_scale_fuse_quant_cuda* (same for _fused.scale_fuse_quant_cuda) in the per_channel_fp8 function. I dont know if this related but the v token that was entirely turned into a nan contained 512 identical values (here: 0.0010) and it was not the case for v tokens around (which had 511 identical values and one different value, as these are context null tokens of a cfg).

I am sure that is a sage bug as I dont have this nan problem if I replace it with sdpa. I have have this problem on Windows, I dont know if Linux is also concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant