orthogonal lora layer init #2389

winglian · 2025-02-20T02:50:10Z

see: https://datta0.github.io/posts/rethink-lora-init/

BenjaminBossan

Thanks a lot for adding an option for orthogonal initialization of LoRA weights.

Note that the OLoRA initialization is also aimed at orthogonal initialization. Maybe it would be worth it to compare the two. A disadvantage of OLoRA is, however, that the base weights are also modified, which requires users to take some extra steps if they want to load the model with other LoRA adapters, for instance. Pinging @tokenizer-decode just in case they wanna check this PR.

Before merging, we would also need some more additions to this PR:

Update the docstring of LoraConfig, similar to the help. How about also adding a link to the blog post (AFAICT there is no paper?).
Add a unit test. Check out the tests in this test class.
Let's run make style to satisfy the linter.

BenjaminBossan · 2025-02-20T10:27:49Z

src/peft/tuners/lora/layer.py

+                    X = torch.randn(rank, rank)
+                    Q, _ = torch.linalg.qr(X)
+                    set1 = Q[0::2,:]  # Odd rows
+                    set2 = Q[1::2,:]  # Even rows


r needs to be even for this to work, right? Let's check it and raise an error with a helpful message if it's not.

tokenizer-decode · 2025-02-20T11:37:33Z

This is just OLoRA but starting from random weights. How can starting from random weights, rather than getting that information from pretrained weights, converge faster? Did you actually run tests? Because in our research, and every other subsequent research showed that OLoRA and other derivatives like PISSA etc. perform better than any random initialization. For a list of studies see.

tokenizer-decode · 2025-02-20T11:41:51Z

src/peft/tuners/lora/config.py

@@ -369,6 +369,7 @@ class LoraConfig(PeftConfig):
                "nonnegative integer. "
                "Passing `'corda'` results in CorDA initialization. "
                "Pass `'loftq'` to use LoftQ initialization."
+                "Pass `'orthogonal'` to use orthogonal initialization."


I think this is confusing to the user.

BenjaminBossan · 2025-02-20T11:50:51Z

@tokenizer-decode Thanks for commenting. It would indeed by nice to see a comparison with OLoRA or PiSSA, which the linked blog post didn't test. I could see an argument for the proposed initialization method being easier to use, as the base weights are unchanged, so even if it's not as good, there could be some value. WDYT?

tokenizer-decode · 2025-02-20T12:06:44Z

I honestly don't see the performance benefit. But if you think there is an ease of use benefit, there could be some value.

This goes for every other decomposition method, SVD e.g.. If the value is not updating the base weights, we can always let the user use the method with a parameter like no_update and we would turn off the part where we update the base weights.

But I might add, for future readers who are confused, updating base weights is generally where you get the performance.

winglian · 2025-02-21T03:38:00Z

here's GRPO + PEFT. olora initialization goes straight to 0.0 rewards after the first step. orthogonal outperforms dora too.

If it's easier, I can convert this so that the init_lora accepts a callable and users can provide their own initialization function

EDIT: something like

class InitLoraWeights(Protocol):
    def __call__(self, layer, adapter_name) -> None:
        pass

and the Config typing would look something like:

bool | Literal[...] | InitLoraWeights

BenjaminBossan · 2025-02-21T10:10:44Z

here's GRPO + PEFT. olora initialization goes straight to 0.0 rewards after the first step.

Thanks for running the tests 🎉 Is the script open so that we can check what might be going on with OLoRA?

If it's easier, I can convert this so that the init_lora accepts a callable and users can provide their own initialization function

In general, we would like to avoid this, even though it could be practical. The reason is that we wouldn't be able to serialize the LoraConfig into JSON with values that are Python code.

In sum, I think we can still proceed with the orthogonal weight initialization method. As I mentioned, even if it did not outperform OLoRA or similar methods, it could still be valuable as a more user friendly option.

orthogonal lora layer init

6c8fdbe

BenjaminBossan requested changes Feb 20, 2025

View reviewed changes

tokenizer-decode reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

orthogonal lora layer init #2389

orthogonal lora layer init #2389

winglian commented Feb 20, 2025

BenjaminBossan left a comment

BenjaminBossan Feb 20, 2025

tokenizer-decode commented Feb 20, 2025

tokenizer-decode Feb 20, 2025

BenjaminBossan commented Feb 20, 2025

tokenizer-decode commented Feb 20, 2025

winglian commented Feb 21, 2025 •

edited

Loading

BenjaminBossan commented Feb 21, 2025

orthogonal lora layer init #2389

Are you sure you want to change the base?

orthogonal lora layer init #2389

Conversation

winglian commented Feb 20, 2025

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Feb 20, 2025

Choose a reason for hiding this comment

tokenizer-decode commented Feb 20, 2025

tokenizer-decode Feb 20, 2025

Choose a reason for hiding this comment

BenjaminBossan commented Feb 20, 2025

tokenizer-decode commented Feb 20, 2025

winglian commented Feb 21, 2025 • edited Loading

BenjaminBossan commented Feb 21, 2025

winglian commented Feb 21, 2025 •

edited

Loading