Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support token level embeddings #64

Open
enjalot opened this issue Sep 20, 2024 · 1 comment
Open

Support token level embeddings #64

enjalot opened this issue Sep 20, 2024 · 1 comment
Labels
help wanted Extra attention is needed python
Milestone

Comments

@enjalot
Copy link
Owner

enjalot commented Sep 20, 2024

Our current approach embeds datasets using Sentence Transformers that give us one embedding per "chunk" of text (so if we pass in 500 tokens of text or 100 tokens of text we always get 1 embedding). Sentence Transformers "pool" token embeddings into a single one, usually by either averaging or just taking the first or last one.

There is another technique gaining popularity called ColBERT that instead gives you the embeddings for each token. A recent model is jina-colbert-v2

One could also imagine just getting back the hidden states from something like LLama-3.1-8B and working with those token-level embeddings.

When you don't pool you don't throw away a bunch of information, but of course this will explode the file size of the stored embeddings. It still may be worth it, and some things we could do to support them.

One thing to try would be using RAGatouille to handle the nearest neighbor search.

Another thing to try is to store the SAE top features of the tokens rather than their full embedding vectors. Theoretically if an SAE is "good" it will reconstruct the embedding pretty well, but we could cut 4096 Llama embeddings down to e.g. 128 dimensions (64 indices and 64 activations for a top-64 SAE).

@enjalot enjalot added help wanted Extra attention is needed python labels Sep 20, 2024
@enjalot enjalot added this to the 2.0 milestone Sep 20, 2024
@enjalot
Copy link
Owner Author

enjalot commented Sep 24, 2024

this is an interesting technique for reducing storage footprint of token level embeddings
https://arxiv.org/html/2409.14683v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed python
Projects
None yet
Development

No branches or pull requests

1 participant