This is an unconventional mathematical overview of the PinSketch algorithm without references to coding theory[1].
A sketch, for the purpose of this description, can be seen as a "set checksum" with two peculiar properties:
- Sketches have a predetermined capacity, and when the number of elements in the set is not higher than the capacity, minisketch will always recover the entire set from the sketch. A sketch of b-bit elements with capacity c can be stored in bc bits.
- The sketches of two sets can be combined by adding them (XOR) to obtain a sketch of the symmetric difference between the two sets (i.e., all elements that occur in one but not both input sets).
This overview explains how sets can be converted into a sketch and how a set can be recovered from a sketch.
Data entries as field elements
Every integer in the range [1...2b-1] (the acceptable data elements for a Minisketch sketch with field size b) can be mapped to a nonzero field element of GF(2b). In this finite field, we can add and multiply elements together, with many of the expected properties for those operations. Addition (and subtraction!) of field elements corresponds to bitwise XOR of the integers they correspond to, though multiplication is more involved.
Sets as power series
We define a function S which maps field elements m to the following formal power series (similar to a polynomial, except there can be an infinite number of terms, and we don't care about concepts like convergence as we're never going to actually evaluate it for a specific value of x):
- S(m) = 1 + mx + m2x2 + m3x3 + ....
We then extend this function to operate on sets of field elements, by adding together the images of every set element. If M = {m1, m2, ... }:
- S(M) = S({m1,m2,...}) = S(m1) + S(m2) + ... = (1 + 1 + ...) + (m1 + m2 + ...)x + (m12 + m22 + ...)x2 + (m13 + ...
Because in our field addition corresponds to XOR of integers, it holds for every a that a + a = 0. This carries over to the S function, meaning that S(a) + S(a) = 0 for every a. This means that the coefficients of these power series have the second of the properties we desire from a sketch, namely that an efficient operation exists to combine two sketches such that the result is a sketch of the symmetric difference of the sets. It holds that S({m1,m2}) + S({m2,m3}) = S(m1) + (S(m2) + S(m2)) + S(m3) = S(m1) + S(m3) = S({m1,m3}). The question is whether we can also efficiently recover the elements from their power series' coefficients.
An infinity of coefficients is hard
To make reasoning about these power series easier, notice that the series for a single element is in fact a geometric series. If we were working over real numbers rather than a finite field and |mx| < 1, it would converge to (1 - mx)-1. Convergence has no meaning in formal power series, however it is still the case that:
- (1 - mx) S(m) = 1
You can verify this by seeing that every coefficient except the constant one gets cancelled out by the multiplication. This can be generalized to the series for multiple set elements. For two elements we have:
- (1 - m1x) (1 - m2x) S({m1,m2}) = (1 - m1x) (1 - m2x) (S(m1) + S(m2)) = (1 - m2x) + (1 - m1x)
And for three:
- (1 - m1x) (1 - m2x) (1 - m3x) S({m1,m2,m3}) = (1 - m1x) (1 - m2x) (1 - m3x) (S(m1) + S(m2) + S(m3)) = (1 - m2x)(1 - m3x) + (1 - m1x)(1 - m3x) + (1 - m1x)(1 - m2x)
In each case, we notice that multiplying S(M) with (1 - mix) for each element mi ∈ M results in a polynomial of degree n-1.
Solving for the set elements
The above insight lets us build a solver that extracts the set elements from the coefficients of a power series. If we can find a polynomial L that is the product of n different (1 - mix) factors for various values of mi, such that P = S(M)L is an n-1 degree polynomial, then those values mi are the elements of M.
The coefficients of P are nontrivial expressions of the set elements themselves. However, we can just focus on the coefficients of degree n and higher in P, as those are all 0. Let si be the coefficients of S(M), and li the coefficients of L. In other words, S(M) = s0 + s1x + s2x2 + s3x3 + ... and L = l0 + l1x + l2x2 + l3x3 + ... + lnxn. Note that l0 = 1, as it is the product of all the 1 terms in the (1 - mix) factors.
Here are the equations for the coefficients of S(M)L of degree n+1 through 2n:
- sn+1 + sn+0l1 + sn-1l2 + sn-2l3 + ... + s1ln = 0
- sn+2 + sn+1l1 + sn+0l2 + sn-1l3 + ... + s2ln = 0
- sn+3 + sn+2l1 + sn+1l2 + sn+0l3 + ... + s3ln = 0
- ...
- s2n + s2n-1l1 + s2n-2l2 + s2n-3l3 + ... + snln = 0
These are n linear equations with n unknowns (the li values, for i=1..n), which can be solved using Gaussian elimination. After doing so, we have the coefficients of L, which can then be factored into first degree factors of the form (1 - mix). The resulting m values are our set elements.
Putting it all together
Interestingly, only 2n coefficients of S(M) were needed for solving the set of equations above. This means we have our answer: the coefficients 1 through 2n of S(M), or the list [m1 + m2 + ..., m12 + m22 + ..., ..., m12n + m22n + ...] functions as a sketch, satisfying the two properties we want:
- Sketches can be combined to form the sketch of their symmetric difference, by simply pairwise adding the list elements together.
- With 2n list elements we can efficiently recover n elements from a sketch.
Capacity and difference
The approach above only works when the number of elements n in the sketch is known. Of course we want to support cases where only an upper bound on the number of elements in the sketch is known, the capacity c. Given that we can reconstruct a set of size c from a sketch with 2c terms, we should be able to reconstruct a set of size n too as long as n ≤ c. This is simply a matter of trying to solve the above set of equations assuming values of n that count down from c until a solution is found for one. This is known as the Peterson-Gorenstein-Zierler algorithm.
Halving the sketch size
We can in fact only include the odd terms in the sketch, and reconstruct the even ones before solving the equation to find L. This means the size of a sketch becomes just c field elements, the same size as would be needed to send its contents naively.
To see how this is possible, we need the Frobenius endomorphism, which in short states that in fields where x + x = 0 it holds that (x + y)2 = x2 + y2 for every x and y (the dream of every high school math student!). This means that:
- s2 = m12 + m22 + ... = (m1 + m2 + ...)2 = s12.
- s4 = m14 + m24 + ... = (m12 + m22 + ...)2 = s22.
- s6 = m16 + m26 + ... = (m13 + m23 + ...)2 = s32.
- ...
In other words, we only need to send s1, s3, s5, ..., s2n-1 to recover all 2n si values, and proceed with reconstruction.
Quadratic performance rather than cubic
Using Gaussian elimination to solve the set of equations above for the li values requires O(n3) field operations. However, due to the special structure in the equations (look at the repeated si values), it can be solved in O(n2) time using a number of techniques, including the Berlekamp-Massey algorithm (BM).
Roots instead of factorization
As explained above, the polynomial L can be factored into (1 - mix) factors, where the values mi are the set elements. However, since we know that a decodable sketch must result in a polynomial that is fully factorizable into degree-1 factors, we can instead use a more efficient root-finding algorithm rather than a factorization algorithm. As the root of each (1 - mix) factor is mi-1, we conclude that the set elements are in fact the inverses of the roots of L.
Avoiding inversions
As inversions are a relatively expensive operation, it would be useful to avoid them.
Say that we're trying to find the inverses of the roots of L = 1 + l1x + l2x2 + ... + lnxn, then we're really interested in the solutions y for 1 + l1y-1 + l2y-2 + ... + lny-n = 0. By multiplying both sides in the equations with yn, we find ln + ln-1y + ln-2y2 + ... + yn = 0.
In other words, we can find the inverses of the roots of L by instead factoring the polynomial with the coefficients of L in reverse order.
- [1] For those familiar with coding theory: PinSketch communicates a set difference by encoding the set members as errors in a binary BCH codeword 2bits in size and sends the syndromes. The linearity of the syndromes provides all the properties needed for a sketch. Sketch decoding is simply finding the error locations. Decode is much faster than an ordinary BCH decoder for such a large codeword because the need to take a discrete log is avoided by storing the set in the roots directly instead of in an exponent (logically permuting the bits of the codeword).