Support token level embeddings #64

enjalot · 2024-09-20T17:07:33Z

Our current approach embeds datasets using Sentence Transformers that give us one embedding per "chunk" of text (so if we pass in 500 tokens of text or 100 tokens of text we always get 1 embedding). Sentence Transformers "pool" token embeddings into a single one, usually by either averaging or just taking the first or last one.

There is another technique gaining popularity called ColBERT that instead gives you the embeddings for each token. A recent model is jina-colbert-v2

One could also imagine just getting back the hidden states from something like LLama-3.1-8B and working with those token-level embeddings.

When you don't pool you don't throw away a bunch of information, but of course this will explode the file size of the stored embeddings. It still may be worth it, and some things we could do to support them.

One thing to try would be using RAGatouille to handle the nearest neighbor search.

Another thing to try is to store the SAE top features of the tokens rather than their full embedding vectors. Theoretically if an SAE is "good" it will reconstruct the embedding pretty well, but we could cut 4096 Llama embeddings down to e.g. 128 dimensions (64 indices and 64 activations for a top-64 SAE).

enjalot · 2024-09-24T20:55:33Z

this is an interesting technique for reducing storage footprint of token level embeddings
https://arxiv.org/html/2409.14683v1

enjalot added help wanted Extra attention is needed python labels Sep 20, 2024

enjalot added this to the 2.0 milestone Sep 20, 2024

enjalot mentioned this issue Nov 12, 2024

Binary quantization of embeddings #84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support token level embeddings #64

Support token level embeddings #64

enjalot commented Sep 20, 2024

enjalot commented Sep 24, 2024

Support token level embeddings #64

Support token level embeddings #64

Comments

enjalot commented Sep 20, 2024

enjalot commented Sep 24, 2024