-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orthogonal lora layer init #2389
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding an option for orthogonal initialization of LoRA weights.
Note that the OLoRA initialization is also aimed at orthogonal initialization. Maybe it would be worth it to compare the two. A disadvantage of OLoRA is, however, that the base weights are also modified, which requires users to take some extra steps if they want to load the model with other LoRA adapters, for instance. Pinging @tokenizer-decode just in case they wanna check this PR.
Before merging, we would also need some more additions to this PR:
- Update the docstring of
LoraConfig
, similar to the help. How about also adding a link to the blog post (AFAICT there is no paper?). - Add a unit test. Check out the tests in this test class.
- Let's run
make style
to satisfy the linter.
X = torch.randn(rank, rank) | ||
Q, _ = torch.linalg.qr(X) | ||
set1 = Q[0::2,:] # Odd rows | ||
set2 = Q[1::2,:] # Even rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r
needs to be even for this to work, right? Let's check it and raise an error with a helpful message if it's not.
This is just OLoRA but starting from random weights. How can starting from random weights, rather than getting that information from pretrained weights, converge faster? Did you actually run tests? Because in our research, and every other subsequent research showed that OLoRA and other derivatives like PISSA etc. perform better than any random initialization. For a list of studies see. |
@@ -369,6 +369,7 @@ class LoraConfig(PeftConfig): | |||
"nonnegative integer. " | |||
"Passing `'corda'` results in CorDA initialization. " | |||
"Pass `'loftq'` to use LoftQ initialization." | |||
"Pass `'orthogonal'` to use orthogonal initialization." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing to the user.
@tokenizer-decode Thanks for commenting. It would indeed by nice to see a comparison with OLoRA or PiSSA, which the linked blog post didn't test. I could see an argument for the proposed initialization method being easier to use, as the base weights are unchanged, so even if it's not as good, there could be some value. WDYT? |
I honestly don't see the performance benefit. But if you think there is an ease of use benefit, there could be some value. This goes for every other decomposition method, SVD e.g.. If the value is not updating the base weights, we can always let the user use the method with a parameter like But I might add, for future readers who are confused, updating base weights is generally where you get the performance. |
Thanks for running the tests 🎉 Is the script open so that we can check what might be going on with OLoRA?
In general, we would like to avoid this, even though it could be practical. The reason is that we wouldn't be able to serialize the In sum, I think we can still proceed with the orthogonal weight initialization method. As I mentioned, even if it did not outperform OLoRA or similar methods, it could still be valuable as a more user friendly option. |
see: https://datta0.github.io/posts/rethink-lora-init/