Skip to content

Efficient Pairwise Cosine Similarity Computation

License

Notifications You must be signed in to change notification settings

ngshya/effcossim

Repository files navigation

Efficient Pairwise Cosine Similarity Computation

The (i, j)-entry of the output matrix is the cosine distance between the i-th row of A and the j-th row of B. This function is only a wrapper, it uses the implementation of cosine_similarity from scikit-learn and the implementation of awesome_cossim_topn from sparse_dot_topn. For more details, please check:

To install this package:

pip install effcossim

Sample code:

from numpy import array
from effcossim.pcs import pairwise_cosine_similarity, pp_pcs

A = array([
    [1, 2, 3], 
    [0, 1, 2],
    [5, 1, 1]
])

B = array([
    [1, 1, 2], 
    [0, 1, 2],
    [5, 0, 1], 
    [0, 0, 4]
])

# scikit-learn implementation
M1 = pairwise_cosine_similarity(
    A=A, B=B, 
    efficient=False, 
    dense_output=True
)

# sparse_dot_topn implementation
M2 = pairwise_cosine_similarity(
    A=A, B=B, 
    efficient=True, 
    n_top=4, 
    lower_bound=0.5, 
    n_jobs=2, 
    dense_output=True
)

When efficient=True, in each row of the output matrix only the top n_top entries above lower_bound are retained (lower memory impacts). Furthermore, if n_jobs is larger than 1, parallel computations are applied (higher speed).

If multiple comparisons are required, the parallel implementation can be used.

l1 = [random(m=10000, n=1000, density=0.3,) for _ in range(6)]
l2 = [random(m=10000, n=1000, density=0.3,) for _ in range(6)]

L = pp_pcs(
    l1=l1, 
    l2=l2, 
    n_workers=2, 
    efficient=True, 
    n_top=10, 
    lower_bound=0.3, 
    n_jobs=2, 
    dense_output=False
)

The output is a list where the k-th element is the output of

pairwise_cosine_similarity(l1[k], l2[k])

For further examples, check the notebook.