-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Warning] Merge lora module to 4-bit linear may get different generations
#2321
Comments
Merge lora module to 4-bit linear may get different generations
Merge lora module to 4-bit linear may get different generations
There is no way to avoid this. When merging weights in such a low precision regime, rounding errors are unavoidable. We give this warning purely to make users aware of that fact, not because they did anything wrong. What you can try for your use case is to load the base model without quantization, merge the LoRA weights into the unquantized model, and then quantize the merged model to 4 bit. Please verify afterwards whether this gives better results or not. But the overall issue with low precision will still remain. |
Hey @BenjaminBossan do you mean this?:
|
Exactly, just make sure in step 7 that you load the merged model with the intended quantization applied. Some users have reported that this yields better results for them than merging into the quantized weights. But please verify that this is true for your use case (and report back please!). |
@BenjaminBossan Thanks for note. I will try and report here. |
Step 7. can significantly degrade the results compared to just loading the LoRA adapter on top of the quantized model. That's because the merged LoRA weights are quantized. |
Thanks for sharing your findings @benjamin-marie. It is true that merging will degrade precision, but it improves runtime performance, so it's a trade off.
Do you mean without step 7 (merging) or do you mean that AWQ and GPTQ are better when merging the LoRA weights? |
I agree that all these steps are correct and yield a model that should perform the same as the adapter obtained at the end of SFT. However, quantizing the merged model with GPTQ or AWQ instead of bnb usually yields better results (perplexity) much closer to the unquantized merged model. |
Thanks for explaining further. This is probably a topic we should explore further, as it has come up a few times in the past. Ideally, we can collect some best practices and share them in the docs. I'm very interested in running some experiments with different steps and quantization techniques. If you have any code to share (or checkpoints), please feel free to do so. |
My experiments are almost one year old. I'll rerun some experiments with the updated packages and reevaluate everything. And I'll share the results and a notebook. |
Fantastic, thanks a lot! |
System Info
peft 0.14.0
transformers 4.48.0
bitsandbytes 0.45.0
Who can help?
@BenjaminBossan @sayakpaul
Information
Tasks
examples
folderReproduction
code:
Warning:
Expected behavior
merge_and_unload() correctly and without warning.
The text was updated successfully, but these errors were encountered: