Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add epsilon #47

Merged
merged 3 commits into from
Feb 12, 2025
Merged

Add epsilon #47

merged 3 commits into from
Feb 12, 2025

Conversation

SimJeg
Copy link
Collaborator

@SimJeg SimJeg commented Feb 11, 2025

Inspired by experiments on CriticalKVPress (#46) I noticed the most important parameter is epsilon. This parameter appears to be a key for big performances boost. In this PR I propose a very simple to the ExpectedAttentionPress to include this epsilon. I get even better perfs using ||WoV|| instead of ||V|| (see branch simon/update-vnorm).

@SimJeg SimJeg requested a review from maxjeblick February 12, 2025 08:11
@SimJeg
Copy link
Collaborator Author

SimJeg commented Feb 12, 2025

Below are additional results for x3 and x4 compression. I'm wondering if I should add CriticalKVPress.vwl1norm in this PR too 🤔 cc. @FFY0

image

Copy link
Collaborator

@maxjeblick maxjeblick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@FFY0
Copy link
Contributor

FFY0 commented Feb 12, 2025

I'm wondering if I should add CriticalKVPress.vwl1norm in this PR too 🤔 cc. @FFY0

In my opinion, it’s worth adding.😀

This operation provides clear benefits while introducing little overhead in real-world deployment. My co-author, Junlin Lv, and I have developed a Triton kernel that optimizes this computation through kernel fusion, reducing memory usage significantly while maintaining high computational efficiency. We plan to open-source this kernel soon. Even with a naive implementation, I believe its overhead in inference remains negligible.

Moreover, to my best knowledge, this is the first attempt to leverage pre-trained model parameters to identify critical KV cache entries, making it a promising new research direction. I believe this direction is worth exploring further, as incorporating additional pre-trained parameter information could drive meaningful advancements in the future.

@SimJeg
Copy link
Collaborator Author

SimJeg commented Feb 12, 2025

I will merge it as is and we'll investigate later ! Using ||WoV|| instead of ||V|| makes a lot of sense, but the main contribution of this PR is the addition of epsilon which is far more important. I believe it has still to be investigated why this epsilon works so well.

@SimJeg SimJeg merged commit cc4bf60 into main Feb 12, 2025
2 checks passed
@SimJeg SimJeg deleted the simon/update-vnorm-2 branch February 12, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants