I wanted to experiment in how quickly precise RAG lookups could be performed with a binary vector space.
Why binary?
It turns out the accuracy is very similar to a full 32-bit vector. But we save a lot in terms of server costs, and it makes in-memory retrieval more feasible.
A database that was once 1TB is now ~32GB, which can easily fit in RAM on a much cheaper setup.
Assuming each row is a 1024-dim float32 vector, that's 4096 bytes per row. or ~244 million rows.
For some added context Wikipedia, which is arguably the largest publicly available data to RAG might only be around 15 million rows.
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
- 6,818,523 articles
- average 668 words per article (4.55B billion words)
- According to OpenAI 750 words is roughly 1000 tokens. So let's say 6B tokens.
- if we use the mixedbread model we can represent 512 tokens in a vector, assuming overlap let’s say 400 tokens new tokens in each vector with 112 being used for overlap
6B / 400 = ~15,000,000 vectors
Chances are internal RAGs are going to be significantly smaller than this.
And so this opens up the question:
"At what point is the dataset too big that exact brute search over binary vectors is not entirely feasible?"
If our dataset is smaller than this, then we can just do the dumb thing in a very small amount of code and be completely fine.
We might not need a fancy vector DB or the newest HSNW library (HNSW is approximate search anyway, we're doing exact).
I'm doing this in Julia because:
- I like it. :-)
- It should be able to get comparable performance for this relative to C, C++ or Rust. So we're not missing out on potential gains.
julia> versioninfo()
Julia Version 1.11.0-beta1
Commit 08e1fc0abb9 (2024-04-10 08:40 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 4 default, 0 interactive, 2 GC (on 4 virtual cores)
Environment:
JULIA_STACKTRACE_MINIMAL = true
DYLD_LIBRARY_PATH = /Users/lunaticd/.wasmedge/lib
JULIA_EDITOR = nvim
Ok, let's begin!
Each vector is going to be n bits, where n in the size of the embedding. This will vary depending on the embedder used, there are several proprietary and open source variations. I'll be assuming we're using the mixedbread model: mixedbread-ai/mxbai-embed-large-v1
This model returns a 1024 element float32 vector. Through binarization, this is converted into a 1024 bit vector, which is then represented as a 128 element int8 vector or 128 bytes. Additionally, this can be further reduced to 64 bytes, which is our final representation.
A 512 bit vector represented as 64 int8 or uint8 elements.
julia> x = rand(Int8, 64)
64-element Vector{Int8}:
26
123
70
-2
⋮
Keep in mind that although the representation is in bytes we're operating at the bit level.
Comparing two bit vectors will require using the Hamming distance. Luckily the definition is very straightforward, each time a bit in the vectors doesn't match add 1 to the distance. So two bit vectors of length 512 that are entirely different will have a hamming distance of 512, and if they are the same a distance of 0.
An easy to do this is to turn the integer representation into a bitstring and then compute the distance on that.
julia> bitstring(Int8(125))
"01111101"
julia> bitstring(Int8(33))
"00100001"
julia> function hamming_distance(s1::AbstractString, s2::AbstractString)::Int
s = 0
for (c1, c2) in zip(s1, s2)
if c1 != c2
s += 1
end
end
s
end
julia> hamming_distance(bitstring(Int8(125)), bitstring(Int8(33))) # should be 4
4
Julia has a lovely benchmarking package called Chairmarks
. Let's see how fast this implementation is.
julia> using Chairmarks
julia> @be hamming_distance(bitstring(Int8(125)), bitstring(Int8(33)))
Benchmark: 6053 samples with 317 evaluations
min 36.672 ns (4 allocs: 128 bytes)
median 38.514 ns (4 allocs: 128 bytes)
mean 46.525 ns (4 allocs: 128 bytes, 0.13% gc time)
max 8.737 μs (4 allocs: 128 bytes, 98.30% gc time)
This is honestly pretty good. For comparison let's do a simple addition:
julia> @be 125 + 33
Benchmark: 7274 samples with 8775 evaluations
min 1.239 ns
median 1.339 ns
mean 1.453 ns
max 6.990 ns
We're not going to get lower than this for an operation. The nanosecond is basically the atomic measurement unit for time when working with CPUs.
However, this does not mean each operation has to take ~1ns. There are various optimizations that hardware can do such that several operations are done each clock cycle. Each core of a CPU has a number of arithmetic units which are used to process elementary operations (addition, subtraction, xor, or, and, etc), the arithmetic units are fed by what is known as pipelining. The point is there is more than 1 arithmetic unit, so we can conceivably do several additons or bit operations in the same clock cycle as long as they do not depend on each other. Let's rewrite hamming_distance
using only elementary operations so that we might take advantage of this.
Let's consider comparing two bits:
- 0, 0 -> 0
- 1, 0 -> 1
- 0, 1 -> 1
- 1, 1 -> 0
This is the XOR operation. We want to XOR all the bits and keep track of which ones are 1, and return the sum.
In order to do this over all the bits we can shift the integer by the number of bits, for an 8-bit number we would perform the shift 7 times.
Here's what this looks like:
BYTE = Union{Int8,UInt8}
@inline function hamming_distance(x1::T, x2::T)::Int where {T<:BYTE}
r = x1 ⊻ x2 # xor
# how many xored bits are 1
c = 0
for i in 0:7
c += (r >> i) & 1
end
return Int(c)
end
# benchmark again
julia> @be hamming_distance(Int8(33), Int8(125))
Benchmark: 4797 samples with 8966 evaluations
min 2.175 ns
median 2.189 ns
mean 2.205 ns
max 5.279 ns
This is huge improvement, almost as fast as doing an addition operation!
Now we need to do this for an vector:
@inline function hamming_distance1(x1::AbstractArray{T}, x2::AbstractArray{T})::Int where {T<:BYTE}
s = 0
for i in 1:length(x1)
s += hamming_distance(x1[i], x2[i])
end
s
end
We expect this to take around 128ns (~2 * 64).
julia> @be hamming_distance1(q1, q2)
Benchmark: 4469 samples with 256 evaluations
min 69.500 ns
median 75.035 ns
mean 80.240 ns
max 189.941 ns
There are a few more things we can do do further optimize this loop, the compiler sometimes does these automatically but we can manually annotate it to be certain.
@inline function hamming_distance(x1::AbstractArray{T}, x2::AbstractArray{T})::Int where {T<:BYTE}
s = 0
@inbounds @simd for i in eachindex(x1, x2)
s += hamming_distance(x1[i], x2[i])
end
s
end
@inbounds
removes any boundary checking.@simd
tells the compiler to use SIMD instructions if possible.
julia> @be hamming_distance(q1, q2)
Benchmark: 4441 samples with 323 evaluations
min 52.759 ns
median 56.245 ns
mean 61.721 ns
max 163.827 ns
That's a decent improvement. This benchmark is 20-30ns on a fresh REPL session. I think it depends current background usage and how well SIMD can be utilized at that time.
julia> q1 = rand(Int8, 64);
julia> q2 = rand(Int8, 64);
julia> hamming_distance(q1, q2);
julia> @be hamming_distance(q1, q2)
Benchmark: 3546 samples with 1000 evaluations
min 23.333 ns
median 23.459 ns
mean 24.461 ns
max 59.542 ns
julia> @be hamming_distance(q1, q2)
Benchmark: 4653 samples with 749 evaluations
min 23.308 ns
median 25.088 ns
mean 26.595 ns
max 62.473 ns
It's good to know in a perfect environment, such as an isolated machine, you could make really good use of the hardware.
Let's say each comparison takes 50ns, then we can search over 20M rows in a second, or 80M rows if we used 4 cores.
1M rows should take 12ms over 4 cores. And Wikipedia at 15M rows 180ms.
In practice it turns out it's faster than this:
mutable struct MaxHeap
const data::Vector{Pair{Int,Int}}
current_idx::Int # add pairs until current_idx > length(data)
const k::Int
function MaxHeap(k::Int)
new(fill((typemax(Int) => -1), k), 1, k)
end
end
function insert!(heap::MaxHeap, value::Pair{Int,Int})
if heap.current_idx <= heap.k
heap.data[heap.current_idx] = value
heap.current_idx += 1
if heap.current_idx > heap.k
makeheap!(heap)
end
elseif value.first < heap.data[1].first
heap.data[1] = value
heapify!(heap, 1)
end
end
function makeheap!(heap::MaxHeap)
for i in div(heap.k, 2):-1:1
heapify!(heap, i)
end
end
function heapify!(heap::MaxHeap, i::Int)
left = 2 * i
right = 2 * i + 1
largest = i
if left <= length(heap.data) && heap.data[left].first > heap.data[largest].first
largest = left
end
if right <= length(heap.data) && heap.data[right].first > heap.data[largest].first
largest = right
end
if largest != i
heap.data[i], heap.data[largest] = heap.data[largest], heap.data[i]
heapify!(heap, largest)
end
end
function _k_closest(
db::AbstractVector{V},
query::AbstractVector{T},
k::Int;
startind::Int = 1,
) where {T<:Union{Int8,UInt8},V<:AbstractVector{T}}
heap = MaxHeap(k)
@inbounds for i in eachindex(db)
d = hamming_distance(db[i], query)
insert!(heap, d => startind + i - 1)
end
return heap.data
end
function k_closest(
db::AbstractVector{V},
query::AbstractVector{T},
k::Int;
startind::Int = 1,
) where {T<:Union{Int8,UInt8},V<:AbstractVector{T}}
data = _k_closest(db, query, k; startind=startind)
return sort!(data, by = x -> x[1])
end
function k_closest_parallel(
db::AbstractArray{V},
query::AbstractVector{T},
k::Int,
) where {T<:Union{Int8,UInt8},V<:AbstractVector{T}}
n = length(db)
t = nthreads()
task_ranges = [(i:min(i + n ÷ t - 1, n)) for i = 1:n÷t:n]
tasks = map(task_ranges) do r
Threads.@spawn _k_closest(view(db, r), query, k; startind=r[1])
end
results = fetch.(tasks)
sort!(vcat(results...), by = x -> x[1])[1:k]
end
The heap structure can be ignored, it's just a max heap with a maximum size. Each section of the database being searched has it's own heap and then sort and pick the top k results at the end.
1M rows benchmark:
julia> X1 = [rand(Int8, 64) for _ in 1:(10^6)];
julia> k_closest(X1, q1, 1)
1-element Vector{Pair{Int64, Int64}}:
200 => 770015
julia> @be k_closest(X1, q1, 1)
Benchmark: 7 samples with 1 evaluation
min 14.767 ms (3 allocs: 112 bytes)
median 14.811 ms (3 allocs: 112 bytes)
mean 15.209 ms (3 allocs: 112 bytes)
max 16.196 ms (3 allocs: 112 bytes)
julia> k_closest_parallel(X1, q1, 1)
1-element Vector{Pair{Int64, Int64}}:
200 => 770015
julia> @be k_closest_parallel(X1, q1, 1)
Benchmark: 24 samples with 1 evaluation
min 4.061 ms (54 allocs: 3.422 KiB)
median 4.097 ms (54 allocs: 3.422 KiB)
mean 4.187 ms (54 allocs: 3.422 KiB)
max 5.060 ms (54 allocs: 3.422 KiB)
Each vector comparison is ~15ns rather than ~50ns.
But wait, There's more! We can do even better using StaticArrays
:
using StaticArrays
julia> q1 = SVector{64, Int8}(rand(Int8, 64));
julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(10^6)];
julia> @be k_closest(X1, q1, 1)
Benchmark: 9 samples with 1 evaluation
min 11.551 ms (3 allocs: 112 bytes)
median 11.555 ms (3 allocs: 112 bytes)
mean 11.669 ms (3 allocs: 112 bytes)
max 12.272 ms (3 allocs: 112 bytes)
julia> @be k_closest_parallel(X1, q1, 1)
Benchmark: 31 samples with 1 evaluation
min 3.132 ms (54 allocs: 3.672 KiB)
median 3.136 ms (54 allocs: 3.672 KiB)
mean 3.161 ms (54 allocs: 3.672 KiB)
max 3.385 ms (54 allocs: 3.672 KiB)
Now each vector comparison is taking ~11ns!
However, we don't want just closest vector, but the "k" closest ones, so let's set "k" to something more realistic:
julia> @be k_closest(X1, q1, 20)
Benchmark: 8 samples with 1 evaluation
min 11.559 ms (5 allocs: 832 bytes)
median 11.759 ms (5 allocs: 832 bytes)
mean 12.243 ms (5 allocs: 832 bytes)
max 14.431 ms (5 allocs: 832 bytes)
julia> @be k_closest_parallel(X1, q1, 20)
Benchmark: 31 samples with 1 evaluation
min 3.139 ms (64 allocs: 9.391 KiB)
median 3.149 ms (64 allocs: 9.391 KiB)
mean 3.165 ms (64 allocs: 9.391 KiB)
max 3.228 ms (64 allocs: 9.391 KiB)
julia> @be k_closest(X1, q1, 100)
Benchmark: 9 samples with 1 evaluation
min 11.598 ms (5 allocs: 3.281 KiB)
median 11.604 ms (5 allocs: 3.281 KiB)
mean 11.644 ms (5 allocs: 3.281 KiB)
max 11.831 ms (5 allocs: 3.281 KiB)
julia> @be k_closest_parallel(X1, q1, 100)
Benchmark: 31 samples with 1 evaluation
min 3.174 ms (66 allocs: 30.406 KiB)
median 3.190 ms (66 allocs: 30.406 KiB)
mean 3.206 ms (66 allocs: 30.406 KiB)
max 3.295 ms (66 allocs: 30.406 KiB)
Similar timings since we're using a heap to order the entries.
Just for fun let's search over 15M rows (Wikipedia).
julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(15 * 10^6)];
julia> @be k_closest_parallel(X1, q1, 10)
Benchmark: 3 samples with 1 evaluation
46.547 ms (56 allocs: 5.625 KiB)
46.563 ms (56 allocs: 5.625 KiB)
47.186 ms (56 allocs: 5.625 KiB)
julia> @be k_closest_parallel(X1, q1, 50)
Benchmark: 3 samples with 1 evaluation
46.547 ms (58 allocs: 14.062 KiB)
46.621 ms (58 allocs: 14.062 KiB)
46.624 ms (58 allocs: 14.062 KiB)
And over 100,000 rows:
julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(10^5)];
julia> @be k_closest_parallel(X1, q1, 30)
Benchmark: 281 samples with 1 evaluation
min 330.792 μs (56 allocs: 10.000 KiB)
median 331.625 μs (56 allocs: 10.000 KiB)
mean 340.101 μs (56 allocs: 10.000 KiB)
max 729.500 μs (56 allocs: 10.000 KiB)
Under 1ms!
usearch appears to be state of the art for in-memory vector similarity searches.
In [8]: from usearch.index import search, MetricKind
...: import numpy as np
In [12]: X = np.random.randint(-128, 128, size=(10**6, 64), dtype=np.int8)
In [13]: q = np.random.randint(-128, 128, 64, dtype=np.int8)
In [14]: %timeit search(X, q, 1, MetricKind.Hamming, exact=True)
62.4 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit search(X, q, 1, MetricKind.Hamming, exact=True, threads=4)
32.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The Julia implementation is ~6x faster single threaded and ~10x faster when using 4 cores.
Maybe there's a way to make usearch
faster, and in all fairness exact search is likely not the priority to get as fast a possible.
That being said this is around 100 lines of Julia code and we're not doing anything fancy. Even using StaticArrays onlygave around a 25-33% improvement, not anything mindblowing.
With ~100 lines of Julia code we're able to achieve state of the art results in exact search similarity search for binary vector spaces.
I don't know about you but I think this is pretty damn awesome!
I realized I never answered the original question (my bad).
"At what point is the dataset too big that exact brute search over binary vectors is not entirely feasible?"
Well it depends, how many RPS do you need to handle? What is the upper limit on how long the query should take?
Right now were at ~3ms for 1M rows. 10M rows would be ~30ms.
julia> X1 = [SVector{64, Int8}(rand(Int8, 64)) for _ in 1:(20 * 10^6)];
julia> @be k_closest_parallel(X1, q1, 100)
Benchmark: 2 samples with 1 evaluation
62.155 ms (58 allocs: 23.906 KiB)
63.807 ms (58 allocs: 23.906 KiB)
Even at 20M rows this query is unlikely to be the bottleneck of the application. 20M rows would be 76GB of the original encoded data vectors: 1024 element 32-bit vector. That's a big dataset, do you have that dataset? Probably not.