A micro-package for computing cosine similarity of a dense matrix quickly. If you can do this faster, I'd love to know how.
The performance should scale fairly well, and will use multiple threads (if your compiler supports OpenMP) when the matrix has more than 2500 columns. You will see the biggest performance improvements, in decreasing order of value, by using:
- Good BLAS.
- A compiler supporting OpenMP (preferably version 4 or better).
If you're on Linux, you can very easily use OpenBLAS with R. Users on other platforms might consider using Revolution R Open, which ships with Intel MKL.
Given an m
xn
matrix x
(input) and an n
xn
matrix cos
(preallocated output):
- Compute the upper triangle of the crossproduct
cos = t(x) %*% x
using a symmetric rank-k update (the_syrk
BLAS function). - Iterate over the upper triangle of
cos
:- Divide its off-diagonal values by the square root of the product of its
i
'th andj
'th diagonal entries. - Replace its diagonal values with 1.
- Divide its off-diagonal values by the square root of the product of its
- Copy the upper triangle of
cos
onto its lower triangle.
The total number of floating point operations is:
m*n*(n+1)
for the symmetric rank-k update.3/2*(n+1)*n
for the rescaling operation.
The algorithmic complexity is O(mn^2)
, and is dominated by the symmetric rank-k update.
Given two n
-length vectors x
and y
(inputs):
- Compute
crossprod = t(x) %*% y
(using the_gemm
BLAS function). - Compute the square of the Euclidean norms of
x
andy
(using the_syrk
BLAS function). - Divide
crossprod
from 1 by the square root of the product of the norms from 2.
The total number of floating point operations is:
2n-1
for the crossproduct.4*n-2
for the two (square) norms.3
for the division and square root/product.
The algorithmic complexity is O(n)
.
All benchmarks were performed using:
- R 3.2.2
- OpenBLAS
- gcc 5.2.1
- 4 cores of a Core i5-2500K CPU @ 3.30GHz
- Linux kernel 4.2.0-16
Compared to the version in the lsa package (as of 27-Oct-2015), this implementation performs quite well:
library(rbenchmark)
reps <- 100
cols <- c("test", "replications", "elapsed", "relative")
m <- 2000
n <- 200
x <- matrix(rnorm(m*n), m, n)
benchmark(fastcosim::cosine(x), lsa::cosine(x), columns=cols, replications=reps)
## test replications elapsed relative
## 1 fastcosim::cosine(x) 100 0.177 1.000
## 2 lsa::cosine(x) 100 113.543 641.486
Here the two perform identically:
library(rbenchmark)
reps <- 100
cols <- c("test", "replications", "elapsed", "relative")
n <- 1000000
x <- rnorm(n)
y <- rnorm(n)
benchmark(fastcosim::cosine(x, y), lsa::cosine(x, y), columns=cols, replications=reps)
## test replications elapsed relative
## 1 fastcosim::cosine(x, y) 100 0.757 1.000
## 2 lsa::cosine(x, y) 100 0.768 1.015
devtools::install_github("wrathematics/fastcosim")