djoldman 5 days ago | next |

This is one in a long line of posts saying "we took a model and made it smaller" and now it can run with different requirements.

It is important to keep in mind that modifying a model changes the performance of the resulting model, where performance is "correctness" or "quality" of output.

Just because the base model is very performant does not mean the smaller model is.

This means that another model that is the same size as the new quantized model may outperform the quantized model.

Suppose there are equal sized big models A and B with their smaller quantized variants a and b. A being a more performant model than B does not guarantee a being more performant than b.

ttul 5 days ago | root | parent | next |

While I think I agree that there are many posts here on HackerNews announcing a new model compression technique, your characterization above understates the technical innovations and practical impacts described in this MIT paper.

Unlike traditional model compression work that simply applies existing techniques, SVDQuant synthesizes several ideas in a comprehensive new approach to model quantization:

- Developing a novel outlier absorption mechanism using low-rank decomposition — this aspect alone seems quite novel, although the math is admittedly way beyond my level

- Combining SVD with smoothing in a way that specifically addresses the unique challenges of diffusion models

- Creating an innovative kernel fusion technique (they call it “Nunchaku”) that makes the theoretical benefits practically realizable, because without this, the extra computation required to implement the above steps would simply slow the model back down to baseline

This isn't just incremental improvement - the paper achieves several breakthrough results:

- First successful 4-bit quantization of both weights AND activations for diffusion models

- 3.5x memory reduction for 12B parameter models while maintaining image quality

- 3.0x speedup over existing 4-bit weight-only quantization approaches

- Enables running 12B parameter models on consumer GPUs that previously couldn't handle them

And, I’ll add, as someone who has been following the diffusion space quite actively for the last two years, the amount of creativity that can be unleashed when models are accessible to people with consumer GPUs is nothing short of astonishing.

The authors took pains to validate their approach by testing it against three models (Flux, PixArt-Sigma, and SDXL) and along several quality-comparison axes (FID score, Image Reward, LPIPS, and PSNR). They also did a proper ablation study to see the contribution of each component in their approach to image quality.

What particularly excites me about this paper is not the ability to run a model that eats 22GB of VRAM in just 7GB. The exciting thing is the prospect of running a 60GB model in 20GB of VRAM. I’m not sure whether anyone has or is planning to train such a monster, but I suspect that Midjourney, OpenAI, and Google all have significantly larger models running in their infrastructure than what can be run on consumer hardware. The more dimensions you can throw at image and video generation, the better things get.

djoldman 5 days ago | root | parent |

I definitely agree that there may be some interesting advancements here.

I am trying to call attention to the models used for evaluation comparison. There are 3 factors: inference speed/latency, model size in total loaded VRAM, and model performance in terms of output.

Comparisons should address all of these considerations, otherwise it's easy to hide deficiencies.

Jackson__ 5 days ago | root | parent |

The site literally has a quick visual comparison near the top, which shows that theirs is the closest to 16bit performance compared to the others. I don't get what more you'd want.

https://cdn.prod.website-files.com/64f4e81394e25710d22d042e/...

djoldman 5 days ago | root | parent |

These are comparisons to other quantizing methods. That is fine.

What I want to see is comparisons to NON-quantized models all with around the same VRAM along with associated inference latencies.

Also, we would want to see the same quantizing schemes applied to other base models.. because perhaps the paper's proposed quantizing scheme only beats others using a particular base model.

snovv_crash 5 days ago | root | parent | next |

They tested the quantisation on 3 different models.

They also show it has little to no effect relative to fp16 on these models.

IMO that's enough. Comparison against smaller models is much less useful because you can't use the same random seeds. So you end up with a very objective "this is worse" based purely on aesthetic preferences of one person vs another. You already see this with Flux Schnell vs. the larger Flux models.

djoldman 5 days ago | root | parent |

I disagree.

They report that their method produces a model that is 6.5 GB from flux (22.7GB). Why wouldn't you want to know how their 6.5GB model compares to other 6.5GB models?

Regarding aesthetic prefs: it's an open problem what an appropriate metric is for GenAI... LLM arena is widely regarded as a good way to measure LLMs and that's user preferences.

In any case, the authors report LPIPs etc. They could do the same for other small models.

snovv_crash 2 days ago | root | parent |

LPIPS and similar don't work if the scene is different, as happens if the random seed doesn't match. This is why they can use it to compare the quantised network, but not against networks with reduced numbers of weights.

refulgentis 5 days ago | root | parent | prev | next |

I'm really confused, this looks like concern trolling because there's a live demo for exactly this A/B testing, that IIRC was near the top of the article, close enough it was the first link I clicked.

But you're quite persistent in that they need to address this, so it seems much more likely they silently added it after your original post, or you didn't click through, concern trolling would stay more vague

Dylan16807 5 days ago | root | parent |

The demo is not what they're asking for. It compares original versus quantized. They want quantized versus a similar same-size in GB model.

boulos 5 days ago | root | parent | prev | next |

As others have replied, this is reasonable general feedback, but in this specific case the work was done carefully. Table 1 from the linked paper (https://arxiv.org/pdf/2411.05007) includes a variety of metrics, while an entire appendix is dedicated to quality comparisons.

By showing their work side-by-side with other quantization schemes, you can also see a great example of the flavor of different results you can get with these slight tweaks (e.g., ViDiT INT8) and that their quantization does a much better job in reproducing the "original" (Figure 15).

In this application, it's not strictly true that you care to have the same results, but this work does a pretty good job of it.

djoldman 5 days ago | root | parent |

Agreed.

Once a model has been trained, I believe the main metrics people care about are

1. inference speed

2. memory requirements

3. quality of output.

There are usually tradeoffs here. Generally you get a lower memory requirement (a good thing), sometimes faster inference (a good thing), but usually a lower quality of output.

I don't think reproduction of original output is the typical goal.

lostmsu 4 days ago | root | parent | prev | next |

This is a very real concern. I've seen quantized models outputting complete garbage in LLMs. In most cases it definitely felt that a smaller unquantized model would do better. They must be included in every comparison.

E.g. compare quantized LLaMA 70B to unquantized LLaMA 8B.

Even better if the test model has a smaller version with similar byte size to the quantized larger one.

superkuh 5 days ago | root | parent | prev | next |

Not really. They quantized the activations here with their inference program which decreased compute as well as RAM usage (and required bandwidth). That's a big step.

tbalsam 5 days ago | root | parent | prev |

Did you...did you read the technical details? This is almost all they talk about, this method was created to get around.

Take a look, it's good stuff! Basically a LoRA to reconstruct outliers lost by quantization, helping keep the performance of the original model.

mesmertech 5 days ago | prev | next |

Demo on actual 4090 with flux schnell for next few hours: https://5jkdpo3rnipsem-3000.proxy.runpod.net/

Its basically H100 speeds with 4090, 4.80it/s. 1.1 sec for flux schenll(4 steps) and 5.5 seconds for flux dev(25 steps). Compared to normal speeds(comfyui fp8 with "--fast" optimization") which is 3 seconds for schnell and 11.5 seconds for dev

qeternity 5 days ago | root | parent | prev | next |

The compute differential between an H100 and a 4090 is not huge. The main single GPU benefits are larger memory (and thus memory bandwidth) and native fp8. But these matter less for diffusion models.

mesmertech 5 days ago | root | parent |

Thats what I thought as well, but FP8 is much faster on h100, like 2x-3x. You can check it/s here: https://github.com/aredden/flux-fp8-api

Its why fal, replicate, pretty much all big diffusion api providers use h100

tldr; 4090 is max 3.51 it/s even with all the current optimizations. h100 is 11.5it/s with all optimizations, and even without its 6.1 it/s

boroboro4 3 days ago | root | parent |

Providers use h100 because using 4090 in DCs is grey area, since Nvidia doesn't permit it.

Paper discussing here is using 4 bit compute, which is 4x on 4090 in comparison with bf16 compute, while h100 doesn't have this at all (i.e. best you can get is 2x compute with fp8). So this paper will even out difference between those two to some extent. If to judge by theoretical numbers - H100 has 1979 TFLOPs fp8 compute, and 4090 has 1321 TOPS. Which puts it around ~65% of performance. Given the price of it ~$2K compared to H100s ~$30K this seems like a very good deal.

But again, no 4090 in DCs.

yakorevivan 5 days ago | root | parent | prev |

Hey, can you share the inference code please? Thanks..

oneshtein 5 days ago | root | parent |

Cannot compile it locally on Fedora 40:

  nunchaku/third_party/spdlog/include/spdlog/common.h(144): error: namespace "std" has no member "function"
  using err_handler = std::function<void(const std::string &err_msg)>;
                                   ^

mesmertech 5 days ago | root | parent |

Yea its a pain, I'm trying to make an api endpoint for a website I own, and working on a docker image. This is what I have for now that "just" works:

the conda always yes thing makes sure that you can just paste the script and it all works instead of having to press "y" for each install. Also if you don't feel like installing a wheel from random person on the internet, replace that step with "pip install -e ." as the repo suggests. I compiled that one with cuda 12.4 cause that was the part takes the most time and is what most often seems to be breaking.

Also I'm not sure if this will work on Fedora, I tried this on a runpod machine with 4090(apparently it only works on few cards, 3090, 4090, a100 etc) with Cuda 12.4 on host machine and "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" this image as base.

EDIT: using pastebin instead as HN doesn't seem to jive with code blocks: https://pastebin.com/zK1z0UdM

oneshtein 5 days ago | root | parent |

Almost working:

  [2024-11-09 19:33:55.214] [info] Initializing QuantizedFluxModel
  [2024-11-09 19:33:55.359] [info] Loading weights from ~/.cache/huggingface/hub/models--mit-han-lab--svdquant-models/snapshots/d2a46e82a378ec70e3329a2219ac4331a444a999/svdq-int4-flux.1-schnell.safetensors
  [2024-11-09 19:34:01.432] [warning] Unable to pin memory: invalid argument
  [2024-11-09 19:34:02.143] [info] Done.
  terminate called after throwing an instance of 'CUDAError'
    what():  CUDA error: pointer does not correspond to a registered memory region (at /nunchaku/src/Serialization.cpp:32)

mesmertech 5 days ago | root | parent |

prolly make sure your host machine cuda is also 12.4 and if not, update the other cuda versions I have on the pastebin to the one you have. I don't think it works with cuda 11.8 tho, remember trying it once

but yea, can't help you outside of runpod, I haven't even tried this on my home PCs yet. for my usecase of serverless API, it seems to work

gyrovagueGeist 3 days ago | prev | next |

This problem seems like it would be very similar to the Low-Rank + Sparse decompositions that used to be popular in audio-visual filtering.

notarealllama 5 days ago | prev | next |

I'm convinced the path to ubiquity (such as embedded in smartphones) is quantization.

I had to int4 a llama model to get it to properly run on my 3060.

I'm curious, how much resolution / significant digits do we actually need for most genAI work? If you can draw a circle with 3.14, maybe it's good enough for fast and ubiquitous usage.

sigmoid10 5 days ago | root | parent | next |

Earlier this year there was a paper from Microsoft where they trained a 1.58 bit (every parameter being ternary) LLM that matched the performance of 16 bit models. There's also other research that you can prune up to 50% of layers with minimal loss of performance. Our current training methods are just incredibly crude and we will probably look back on those in the future and wonder how this ever worked at all.

llm_trw 5 days ago | root | parent |

None of those papers actually use quantized training, they are all about quantized inference.

Which is rather unfortunate as it means that the difference between what you can train locally and what you can run locally is growing ever larger.

danielEM 5 days ago | root | parent | next |

Indeed. I think "AI gold rush" sucks anyone with any skills in this area into it with relatively good pay, so there are no, or almost no people outside of big tech and startups to counterbalance direction where it moves. And as a side note, big tech is and always was putting their agenda first in developing any tech or standards and that usually makes milking on investments as long as possible, not necessary moving things forward.

llm_trw 5 days ago | root | parent |

There's more to it than that.

If you could train models faster, you’d be able to build larger, more powerful models that outperform the competition.

The fact that Llama 3 is significantly over trained than what was considered ideal even three years ago shows there's a strong appetite for efficient training. The lack of progress isn’t due to a lack of effort. No one has managed to do this yet because no one has figured out how.

I built 1-trit quantized models as a side project nearly a decade ago. Back then, no one cared because models weren’t yet using all available memory, and on devices where memory was fully utilized, compute power was the limiting factor. I spend much longer trying to figure out how to get 1-trit training to work and I never could. Of all the papers and people in the field I've talked to, no one else has either.

p1esk 5 days ago | root | parent | next |

People did care back then. This paper had jumpstarted the whole model compression field (which used to be a hot area of research in early 90s): https://arxiv.org/abs/1511.00363

Before that, in 2012, Alexnet had to be partially split into two submodels, running on two GPUs (using a form of interlayer grouped convolutions) because it could not fit in 3GB of a single 580 card.

Ternary networks appeared in 2016. Unless you mean you actually tried to train in ternary precision - clearly not possible with any gradient based optimization methods.

sixfiveotwo 5 days ago | root | parent | prev |

> I spend much longer trying to figure out how to get 1-trit training to work and I never could.

What did you try? What were the research directions at the time?

llm_trw 5 days ago | root | parent |

This is a big question that needs a research paper worth of explanation. Feel free to email me if you care enough to have a more in-depth discussion.

sixfiveotwo 5 days ago | root | parent |

Sorry, I understand it was a bit intrusively direct. To bring some context, I toyed a little with neural networks a few years ago and wondered myself about this topic of training a so called quantized network (I wanted to write a small multilayer perceptron based library parameterized by the coefficient type - floating point or integer of different precision), but didn't implement it. Since you mentioned your own work in that area, it picked my interest, but I don't want to waste your time unnecessarily.

sigmoid10 5 days ago | root | parent | prev |

That's wrong. I don't know where you got that information from, because it is literally the opposite of what is shown in the Microsoft paper mentioned above. They explicitly introduced this extreme quantization during training from scratch and show how it can be made stable.

llm_trw 5 days ago | root | parent |

I got it from section 2.2

> The number of model parameters is slightly higher in the BitLinear setting, as we both have 1.58-bit weights as well as the 16-bit shadow weights. However, this fact does not change the number of trainable/optimized parameters in practice.

https://arxiv.org/html/2407.09527v1

buildbot 5 days ago | root | parent | next |

Exactly as xnornet was doing way back in 2016 - shadow 32bit weights, quantized to 1 bit during the forward pass.

https://arxiv.org/abs/1603.05279

I personally have a pretty negative opinion of the bitnet paper.

llm_trw 5 days ago | root | parent |

Thanks for the citation, I did my work in the area around 2014 and never looked back. That's a very good summary of the state of the field as I remember it.

sigmoid10 5 days ago | root | parent | prev |

What? That's the wrong paper. It is not even from Microsoft. This is it: https://www.microsoft.com/en-us/research/publication/bitnet-...

>we introduce BitLinear as a drop-in replacement of the the nn.Linear layer in order to train 1-bit weights from scratch

llm_trw 4 days ago | root | parent |

Section 2.2 from your paper, with less clarity and more obfuscation:

>While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and accuracy. Following the previous work [ LSL+21 ], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.

https://arxiv.org/pdf/2310.11453

The other paper had a much nicer and clearer introduction to bitlinear than the original Microsoft paper, which is why I used it. Uncharitably you might say that they aren't hiding the lead 10 paragraphs in.

sigmoid10 4 days ago | root | parent |

They are not hiding anything, because this is standard behaviour for all current optimisers. You still get a massive memory improvement from lower bit model weights during training.

halJordan 4 days ago | root | parent | prev |

Do you want a cookie for joining the overwhelming majority?

Necessary precision depends on, unsurprisingly, what you're truncating. Flux drops off around q6. Text generation around q4.

The llms apple are putting in iphones are q4 3b models.

atlex2 5 days ago | prev | next |

Seriously nobody thought to use SVD on these weight matrices before?

liuliu 5 days ago | root | parent |

I did try, but in a wrong way (try to SVD quantization error to recover quality (I.e. SVD(W - Q(W)))). The lightbulb moment in this paper is to do SVD on W and then quantize the remaining.

xrd 5 days ago | prev | next |

Can someone explain this sentence from the article:

  Diffusion models, however, are computationally bound, even for single batches, so quantizing weights alone yields limited gains.

llm_trw 5 days ago | root | parent | next |

Diffusion requires a lot more computation to get results compared to transformers. Naively when I'm running a transformer locally I get about 30% GPU utilization, when I'm running a diffusion model I'm getting 100%.

This means that the only saving you're getting in speed for a diffusion model is being able to do more effective flops since the floats are smaller, e.g. instead of doing one 32bit multiplication, you're doing 8 4bit ones.

By comparison for transformers you not only gain the flop increase, but also the improvement in memory shuffling that they do, e.g. it also takes you 8 times less time to load the memory into working memory from vram.

The above is a vast over simplification and in practice will have more asterisks than you can shake a stick at.

flutetornado 5 days ago | root | parent | prev | next |

GPU workloads are either compute bound (floating point operations) or memory bound (bytes being transferred across memory hierarchy.)

Quantizing in general helps with the memory bottleneck but does not help in reducing computational costs, so it’s not as useful for improving performance of diffusion models, that’s what it’s saying.

pkAbstract 5 days ago | root | parent |

Exactly. The smaller bit widths from quantization might marginally decrease the compute required for each operation, but they do not reduce the overall volume of operations. So, the effect of quantization is generally more impactful on memory use than compute.

superkuh 5 days ago | root | parent |

Except in this case they quantized both the parameters and the activations leading to decreased compute time too.

boulos 5 days ago | root | parent | prev |

The next sentence there:

> To achieve measured speedups, both weights and activations must be quantized to the same bit width; otherwise, the lower precision is upcast during computation, negating any performance benefits.

tries to explain that.

What it means though is that if you only store the inputs in lower precision, but still upcast to say bf16 or fp32 to perform the operation, you're not getting any computational speedup. In fact, you're paying for upconverting and then downconverting afterwards.

scottmas 5 days ago | prev | next |

Possible to run this in ComfyUI?

vergessenmir 5 days ago | root | parent | next |

The repo has sample code and it is fairly easy to create a node that will do it.

You won't however have access to usual sampler, latent image, Lora nodes to do anything beyond basic t2i

DeathArrow 5 days ago | prev |

But doesn't quantization give worse results? Don't you trade quality for memory footprint?

timnetworks 5 days ago | root | parent |

They're saying this method essential does not, even when mixed with low rank models on top. "Notably, while the original BF16 model requires per-layer CPU offloading on the 16GB laptop 4090, our INT4 model fits entirely in GPU memory, resulting in a 10.1× speedup by avoiding offloading."

This is the whole magic, the rest of the workflow doesn't need to unload and flush memory, causing big delays for jobs.