This repository contains the code and supplementary material for the paper: A Simple Algorithm For Scaling Up Kernel Method.
Conventional wisdom suggests kernel methods are unsuitable for large samples due to their computational complexity and memory requirements. We introduce a novel random feature regression algorithm that allows us (when necessary) to scale to virtually infinite numbers of random features. We illustrate the performance of our method on the CIFAR-10 dataset.
We've run all the experiments using Python3.7+.
Before starting, please create a virtual environment, then:
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements
python3 setup.py develop
To reproduce all the results shown in the paper:
python3 experiments/sklearn_vs_spectral_speed.py
# FABR
./train_all.sh
# Tables
./tables.sh
# Plots
./voc_curves.sh
In this section, we show FABR's training and prediction time with respect to the number of features
Note that if you're an AMD user, you might incur inconsistent results. That is because numpy relies heavily on the C libraries, such as Basic Linear Algebra Subprograms (BLAS), Linear Algebra Package (LAPACK), etc. It happens that numpy on Intel machines highly relies on the Math Kernel Library (MKL), while AMD relies on OpenBLAS. Hence, if you're an AMD user and you incur incosistent result please follow this thread:
Why is Numpy with Ryzen Threadripper so much slower than Xeon?.
To understand your backend C linear algebra library simply run:
import numpy
numpy.show_config()
Update We've run the experiments on a M1 Pro chip. These new Apple's chips use OpenBlas and the results hold.
For further evaluation, we assess FABR's performance on both small and big dataset regimes. For all experiments, we perform a random features kernel ridge regression with respect to demeaned one-hot labels and solve the optimization problem using FABR, FABR-$\nu$, and FABR with mini-batches.
We compare FABR's performance on the subsampled CIFAR-10 datasets with ResNet-34 and CNTK as shown in (Arora et al., 2019). We are well aware of the work of (Shankar et al., 2020) . However this paper aims to offer a highly scalable ridge regression scheme based on random features. Computing the random features underlying the Neural Kernels would require developing non-trivial numerical algorithms based on the recursive iteration of non-linear functions. We leave this as an important direction for future
research.
The table below shows FABR outperforming both ResNet-34 and CNTKs in the subsampled CIFAR-10. For further experiment details, see the paper's Section 4.
n | ResNet-34 | 14-layer CNTK | z=1 | z=100 | z=10000 | z=100000 |
---|---|---|---|---|---|---|
10 | 14.59% |
15.33% |
18.50% |
18.50% |
18.42% |
18.13% |
20 | 17.50% |
18.79% |
20.84% |
20.85% |
20.78% |
20.13% |
40 | 19.52% |
21.34% |
25.09% |
25.10% |
25.14% |
24.41% |
80 | 23.32% |
25.48% |
29.61% |
29.60% |
29.62% |
28.63% |
160 | 28.30% |
30.48% |
34.86% |
34.87% |
35.02% |
33.54% |
320 | 33.15% |
36.57% |
40.46% |
40.47% |
40.66% |
39.34% |
640 | 41.66% |
42.63% |
45.68% |
45.68% |
46.17% |
44.91% |
1280 | 49.14% |
48.86% |
50.30% |
50.32% |
51.05% |
49.74% |
$n = 10$ | $n = 20$ | $n = 40$ | $n = 80$ |
---|
$n = 160$ | $n = 320$ | $n = 640$ | $n = 1280$ |
---|
These figures show FABR’s test accuracy increases with the model’s complexity
We run the same experiment on the expanded dataset for both FABR-$\nu$ and mini-batch FABR. We observe that the double descent phenomenon naturally appears for both FABR-$\nu$ and the mini-batch FABR but only when
FABR-$\nu$ trained on the subsampled and full sample CIFAR-10 dataset, respectively. We use
$n = 2560$ | $n = 50000$ |
---|
Mini-batch FABR trained on the subsampled and full sample CIFAR-10 dataset, respectively. We use
$n = 2560$ | $n = 50000$ |
---|