Image Quality fixed, but where is the speed??
So, I got the updated NVFP4 model from today, and it definitely smoothed out the strange luminance blotching that was happening on the previous version, but after a bunch of tinkering around and trying a bunch of different variables, I couldn't manage the supposed speed increase that NVFP4 models typically claim to achieve. I did notice a slight reduction in VRAM usage, but the render times were only a hair faster than the FP8 model when not using sage attention. When using sage attention, the FP8 model was actually 77 seconds faster than the NVFP4 model.. For clarity, there was no sage attention used on the NVFP4 runs. Only on the FP8 model.. Is this normal?? I was under the impression that the NVFP4 models were obscenely fast.
Speed difference for me vs FP8 on my 5090, 361 frames.
FP8 = 1.97s/it
NVFP4 = 1.39s/it
Overall for the full prompt execution its like 7 seconds faster for 361 frames. Which isnt really a big difference.
Vram usage is the big difference tho, FP8 = 29.8GB, NVFP4 = 22GB.
So on a 5080 it would probably make a massive difference. On a 5090 its not going over the Vram and needing to swap in and out so there isnt much gain.
If its slower then your missing something like cuda 13 or kitchen since for it to be slower means its probably Type casting itself back into FP8 so you then have the overhead from that.
Iv been struggling to get sage attention to work on it also.
So, I've heard that you don't want to use sage attention on the nvfp4 models,.. that it actually slows it down worse than using the fp8 model. I am however never not worried about weather or not Comfy is suffering some dependency issue that is probably crippling performance. I did just recently install the Desktop version, it was late, I was tired... but I noticed is using python 3.12 where my previous Portable version was on 3.13.., but as far as I can tell everything is running smooth. Sage works when I turn it on.. It's easy to install on the desktop once you find the right wheel file for it.. It's a little more of a pain on the Portable version though.. It's fine for now. I guess I was expecting some crazy speed increase just based on the ease of processing 4bit over 16bit. I mean, theoretically it should be faster. But it was almost identical to fp8 times with a noticeable drop in quality...
As far as CUDA, I'm on 13.1.. Running a 5090 and a Pro 6000... I could see it making obvious massive gains for low VRAM circumstances, but that isn't really either of us.. Guess I was just hoping for some incredible compute sorcery that gave us 4bit speed with 8bit-ish quality.
Have you played with over clocking. There's some good gains in speed there on the 5090. Id imagine the 6000 pro would see even better gains being its even more power restricted.
I run +300 core +3000 memory with curve capped to 1.0v. Ends up faster then it is at stock letting it hit 1.1v
Oh no.... I just don't do that.. especially with Blackwell. Your gonna laugh, but I power limit both of them to 400w and restrict their clocks to 2700... The Pro 6000 stays under 60Β°C and the 5090 has never been over 63Β°.. I have overclocked my other cards, but the gains weren't really worth cranking so much more power through the silicon.. And with the Blackwell cards, you gotta watch out for their transitory voltage spikes.. But yeah, I either undervolt or power limit just to keep everything quiet, cool and consistent. And they still crush everything you throw at em.
So the complete opposite of me lol. I buy every Gen and have always pushed the hardware to the limit ever since I got my first high end card with a 8800 ultra. Never had anything die on me but iv always been sensible with not pushing voltage hard and keeping temps low. Over clocking now days is more like under volting since it will run the same voltage curve your just telling it to add extra clock offset for that same voltage.
My card doesn't see over 55c being water cooled. I do have the 800w matrix bios on it which is a 70% inference speed up. But I'm too scared of frying the plug so it stays around 580-680w with the 1.0v limit on the curve. Which is like 3250-3350mhz.
So, I got the updated NVFP4 model from today, and it definitely smoothed out the strange luminance blotching that was happening on the previous version, but after a bunch of tinkering around and trying a bunch of different variables, I couldn't manage the supposed speed increase that NVFP4 models typically claim to achieve. I did notice a slight reduction in VRAM usage, but the render times were only a hair faster than the FP8 model when not using sage attention. When using sage attention, the FP8 model was actually 77 seconds faster than the NVFP4 model.. For clarity, there was no sage attention used on the NVFP4 runs. Only on the FP8 model.. Is this normal?? I was under the impression that the NVFP4 models were obscenely fast.
please can you link that fixed version? it is the same for me downloading from here.
For everyone not having a speed boost. You MUST have a Blackwell GPU and you MUST use CUDA 13 or 13.1
So the complete opposite of me lol. I buy every Gen and have always pushed the hardware to the limit ever since I got my first high end card with a 8800 ultra. Never had anything die on me but iv always been sensible with not pushing voltage hard and keeping temps low. Over clocking now days is more like under volting since it will run the same voltage curve your just telling it to add extra clock offset for that same voltage.
My card doesn't see over 55c being water cooled. I do have the 800w matrix bios on it which is a 70% inference speed up. But I'm too scared of frying the plug so it stays around 580-680w with the 1.0v limit on the curve. Which is like 3250-3350mhz.
Lol,.. yeah.. complete opposite... I was already cringing at the thought of a 500w card let alone 580w. Then all of the melting 12vhpwr issues were just absurd.. Didn't know about this Matrix bios... a 70% increase is pretty damn substantial... I'll check that out.. 55c watercooled!! nice!!... mine are both the RTX FE design... And yeah, undervolting or offsetting the curve to a lower voltage is cool, but in monitoring, the cards do show some pretty insane transient spikes in voltage that I wouldn't put any silicone through.. I don't like thinking about that happening while I'm crankin out a render.. or ever, so I just limited the clocks.. Can they go faster, absolutely,.. do I need them to, not really.. I am curious about that matrix bios inference speed up though.. I'll check it out.
I prefer using the "Patch Sage Attention KJ node" (auto/true) over the --use-sage-attention flag.
Here are my results (8 steps, 121 frames, LCM sampler, ~1000x1000):
[00:56, 7.08s/it] FP8
[00:47, 5.95s/it] FP8 + sage attention
[00:40, 5.06s/it] NVFP4
[00:30, 3.83s/it] NVFP4 + sage attention
Total VRAM 24463 MB, total RAM 97644 MB
pytorch version: 2.10.0+cu130
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5090 Laptop GPU : cudaMallocAsync
Python version: 3.13.11 (tags/v3.13.11:6278944, Dec 5 2025, 16:26:58) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.17.0
@Yauhen2 Oh cool,.. So you used Sage attention with the NVFP4 model and it actually helped!!.. I was under the impression that it wasn't recommended to do so... I also use the KJnode method.. It's more convenient.. I also noticed your python version is 3.13.11.. Is that system python or Comfy python?.. Are you on the portable version or Desktop version of Comfy??.. I think Desktop, right??
I prefer using the "Patch Sage Attention KJ node" (auto/true) over the --use-sage-attention flag.
Here are my results (8 steps, 121 frames, LCM sampler, ~1000x1000):
[00:56, 7.08s/it] FP8
[00:47, 5.95s/it] FP8 + sage attention
[00:40, 5.06s/it] NVFP4
[00:30, 3.83s/it] NVFP4 + sage attentionTotal VRAM 24463 MB, total RAM 97644 MB
pytorch version: 2.10.0+cu130
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5090 Laptop GPU : cudaMallocAsync
Python version: 3.13.11 (tags/v3.13.11:6278944, Dec 5 2025, 16:26:58) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.17.0
Can you share your comfyui workflow in pastebin?
Comfy python, portable version. I had to manually compile the sageattention-2.2.0 wheel myself.
@Yauhen2 Yeah.. I found a good wheel awhile back that worked with both my portable and desktop versions.. But I blew up my portable the other day so just re-installed desktop real quick.. About a month ago, I was going through hell trying to sort out the right CUDA, pytorch, python,...ect with Comfy between 3.12 and 3.13... but either seems to work now. Desktop just still installs with 3.12... I thought maybe you updated Desktop to 3.13...
So the complete opposite of me lol. I buy every Gen and have always pushed the hardware to the limit ever since I got my first high end card with a 8800 ultra. Never had anything die on me but iv always been sensible with not pushing voltage hard and keeping temps low. Over clocking now days is more like under volting since it will run the same voltage curve your just telling it to add extra clock offset for that same voltage.
My card doesn't see over 55c being water cooled. I do have the 800w matrix bios on it which is a 70% inference speed up. But I'm too scared of frying the plug so it stays around 580-680w with the 1.0v limit on the curve. Which is like 3250-3350mhz.
Lol,.. yeah.. complete opposite... I was already cringing at the thought of a 500w card let alone 580w. Then all of the melting 12vhpwr issues were just absurd.. Didn't know about this Matrix bios... a 70% increase is pretty damn substantial... I'll check that out.. 55c watercooled!! nice!!... mine are both the RTX FE design... And yeah, undervolting or offsetting the curve to a lower voltage is cool, but in monitoring, the cards do show some pretty insane transient spikes in voltage that I wouldn't put any silicone through.. I don't like thinking about that happening while I'm crankin out a render.. or ever, so I just limited the clocks.. Can they go faster, absolutely,.. do I need them to, not really.. I am curious about that matrix bios inference speed up though.. I'll check it out.
Matrix bios wont work on a FE card. The 70% gain is because it has a 800W power limit and the 5090 die is basically choked at 600W. You have linear performance gains up until around 750W which is what the power limit should of really been but they didnt want to include 2 plugs on the card to support that so they gimped it to 600W.
Iv never had a issue with how much power something draws since my 4090 was on the 666W bios. And my 3090 before that was shunt modded to 540W.
As long as you can cool it, its not going to hurt it. I sold the 4090 to my brother and the 3090 is still going strong running my discord Ai bot 24/7 lol.
Your confusing transient current spikes with voltage. Theres no voltage spikes the VRM is very well over engineered even on the FE cards and Nvidia capped the voltage to 1.1V even tho TSMC 4N processing node is rated for 1.2V. Transient current spikes have been a big issue since the 3090 since the number of transistors started getting crazy for the power limit they were restricted to so it can blow past the power limit for a few milliseconds.
For example the 5090 has 92.2 billion transistors, If they were all to switch at the same time on 1V (to make the math easier) that would be 1200W it wants. Which it will spike up towards for a millisecond before the power limit can throttle the voltage and clocks to get it under control.
https://www.3dmark.com/sn/10120443
@bmgjet yeah,.. I mean, like I said, I know they could both handle more, I just prefer keeping things comfortably within their specs these days.. and really most of my paranoia stemmed from the 12vhpwr connector drama... So I just eliminated it and never looked back... Maybe that's a little "geezerish" of me, but if so, I am a happy geezer... with comfy transistors and cool cables.