review_iclr_bench/iclr_parsed/4lLyoISm9M.txt

# RANGE-NET: A HIGH PRECISION NEURAL SVD

**Anonymous authors**
Paper under double-blind review

ABSTRACT

For Big Data applications, computing a rank-r Singular Value Decomposition
(SVD) is restrictive due to the main memory requirements. Recently introduced
streaming Randomized SVD schemes work under the restrictive assumption that
the singular value spectrum of the data has an exponential decay. This is seldom
true for any practical data. Further, the approximation errors in the singular vectors
and values are high due to the randomized projection. We present Range-Net as a
low memory alternative to rank-r SVD that satisfies the lower bound on tail-energy
given by Eckart-Young-Mirsky (EYM) theorem at machine precision. Range-Net is
a deterministic two-stage neural optimization approach with random initialization,
where the memory requirement depends explicitly on the feature dimension and
desired rank, independent of the sample dimension. The data samples are read
in a streaming manner with the network minimization problem converging to
the desired rank-r approximation. Range-Net is fully interpretable where all the
network outputs and weights have a specific meaning. We provide theoretical
guarantees that Range-Net extracted SVD factors satisfy EYM tail-energy lower
bound with numerical experiments on real datasets at various scales that confirm
these bounds. A comparison against the state-of-the-art streaming Randomized
SVD shows that Range-Net is six orders of magnitude more accurate in terms of
tail energy while correctly extracting the singular values and vectors.

1 INTRODUCTION

Singular Value Decomposition (SVD) is pivotal to exploratory data analysis in identifying an invariant
structure under a minimalistic representation (assumptions on the structure) containing the span
of resolvable information in the dataset. Finding a low rank structure is a fundamental task in
applications including Image Compression (de Souza et al., 2015), Image Recovery (Brand, 2002),
Background Removal (Wang et al., 2018), Recommendation Systems (Zhang et al., 2005) and as a
pre-processing step for Clustering (Drineas et al., 2004) and Classification (Jing et al., 2017). With
the advent of digital sensors and modern day data acquisition technologies, the sheer amount of data
now requires that we revisit the solution scheme with reduced memory consumption as the target. In
this work, we reformulate SVD with special emphasis on the main memory requirement, with no loss
in accuracy, that precludes it’s use for big data applications.

It is well known that natural data matrices have a decaying spectrum wherein saving the data in
memory in its original form is either redundant or not required from an application point of view.
However, any assumption on the decay rate can only be validated if the singular value decomposition
is known a priori, which is seldom the case in exploratory data analysis (see Fig. 3). Visually
assessing a rank-r approximation for image processing applications might seem correct qualitatively
(see Fig. 4) but are still prone to large errors due to limited human vision acuity (see Fig. 6). This is
further exacerbated when the application at hand is associated with scientific computations wherein
the anomalies or unaccounted phenomena are still being explored from large scale datasets. The reader
is preemptively referred to Fig. 19 where the high frequency features related to turbulence cannot
be disregarded. Furthermore, for classification and clustering problems where feature dimension
reduction is desirable it is imperative that a low-rank approximation of a dataset contains most
(≥ 90%) of the original information content without altering the subspace information. In this case,
an over-sampled rank can exceed the feature dimension of the data itself (see Section 4.2).


-----

1.1 PROBLEM STATEMENT

Let us denote the raw data matrix as X ∈ R[m][×][n] of rank f ≤ min(m, n) and its approximation as
_Xr ∈_ R[m][×][n]. The SVD of X is X = U ΣV _[T]_, where U ∈ R[m][×][f] = [u1, · · ·, uf ] and V ∈ R[n][×][f] =

[v1, · · ·, vf ] are its left and right singular vectors respectively, and Σ ∈ R[f] _[×][f]_ = diag(σ1, · · ·, σf )
are its largest non-zero singular values. The rank r (r ≤ _f_ ) truncation of X is then Xr = UrΣrVr[T] [,]
where Σr = Σ[1:r] are the top r singular values, and Ur = U[1:r] and Vr = V[1:r] are the left and
right singular vectors. In other words, X = U ΣV _[T]_ = UrΣrVr[T] [+][ U]f _\r[Σ]f_ _\r[V][ T]f_ _\r_ [=][ X][r][ +][ X][f] _[\][r][.]_
Here, Uf _r, Vf_ _r are the trailing f_ _r left and right singular vectors._
_\_ _\_ _−_

**Theorem 1. Eckart-Young-Mirsky Theorem (Eckart & Young, 1936; Mirsky, 1960): Let X ∈** R[m][×][n]
_be a real, rank-f, matrix with m ≥_ _n with the singular value decomposition as X = U_ ΣV _[T]_ _, where_
_the orthonormal matrices U, V contain the left and right singular vectors of X and Σ is a diagonal_
_matrix of singular values. Then for an arbitrary rank-r, r ≤_ _f matrix Br ∈_ R[m][×][n],

_X_ _Br_ _F_ _X_ _Xr_ _F_
_∥_ _−_ _∥_ _≥∥_ _−_ _∥_

_where Xr = UrΣrVr with Σr is the diagonal matrix of the largest r singular values and Ur, Vr are_
_the corresponding left and right singular vector matrices._

The problem statement is then: Given X ∈ R[m][×][n] find _X[ˆ] such that,_

arg min _X_ _X_ _F_ (1)
_Xˆ_ _∈R[m][×][n], rank( X[ˆ]_ )≤r _∥_ _−_ [ˆ] _∥_

In effect, the minimizer _X[ˆ]_ of the above problem gives us the rank-r approximation of X such that
_∗_
_Xr = X[ˆ]_ . In this work we utilize the minimizer of the above problem to extract the top rank-r SVD
_∗_
factors of X without loading the entire data matrix into the main memory. Note that the minimizer
_naturally gives the lower bound on this tail energy in addition to being a rank-r approximation._

1.2 MAIN CONTRIBUTIONS

**Data and Representation Driven Neural SVD: The representation driven network loss terms en-**
sures that the data matrix X is decomposed into the desired SVD factors such that X = U ΣV _[T]_ . In
the absence of the representation enforcing loss term, the minimizer of Eq. 1 results in an arbitrary
decomposition such that X = ABC different from SVD factors.
**A Deterministic Approach with GPU Bit-precision Results: The network can be initialized with**
weights drawn from a random distribution where the iterative minimization is deterministic. The
streaming order of the samples is of no consequence and the user is free to choose the order in which
the samples are processed in a batch-wise manner (indexed or randomized).
**First Streaming Architecture with Exact Low Memory Cost: Range-net requires an exact mem-**
ory specification based upon the desired rank-r and data dimensions X ∈ R[m][×][n] given by r(n + r),
independent of the sample dimension m. This is the first streaming algorithm that does require the
user to wait until the streaming step is complete, contrary to randomized streaming algorithms.
**Layer-wise Fully Interpretable: Range-Net is a low-weight, fully interpretable, dense neural net-**
work where all the network weights and outputs have a precise definition. The network weights are
placeholders for the right (or left) orthonormal vectors upon convergence of the network minimization
problems (see Appendix D). The user can explicitly plug a ground truth solution to verify the network
design and directly arrive at the tail energy bound.

2 RELATED WORKS

The core idea behind randomized matrix decomposition is to make one or more passes over the data
and compute efficient sketches. They can be broadly categorized into four branches: 1) Sampling
based methods (Subset Selection (Boutsidis et al., 2014) and Randomized CUR (Drineas et al.,
2006)); 2) Random Projection based QR (Halko et al., 2011); 3) Randomized SVD (Halko et al.,
2011); and 4) Power iteration methods (Musco & Musco, 2015). The sketches can represent any
combination of row space, column space or the space generated by the intersection of rows and
columns (core space). However, all of these methods require loading the entire data in memory.
Readers are referred to Kishore Kumar & Schneider (2017); Ye et al. (2019) for an expanded survey.


-----

Conventional SVD although deterministic and accurate, becomes expensive when the data size
increases and requires r passes over the data (see Table 1). The two branches of interest to us are
the randomized SVD and Power iteration Methods for extracting SVD factors. Randomized SVD
algorithms (Halko et al., 2011) are generally a two stage process: 1) Randomized Sketching uses
random sampling to obtain a reduced matri(x/ces) which covers any combination of the row, column
and core space of the data; and 2) Deterministic Post-processing performs conventional SVD on
the reduced system from Randomized Sketching stage. These approaches make only one pass over
the data assuming that the singular value spectrum decays rapidly.

Power iteration based approach (Musco & Musco, 2015) requires multiple passes over the data
and are used when the singular value spectrum decays slowly. This class of algorithm constructs a
Krylov matrix inspired by Block Lanczos (Golub & Underwood, 1977) to obtain a polynomial series
expansion of the sketch. Although these algorithms achieve lower tail-energy errors, they cannot be
used in big-data applications when X itself is too large to be retained in the main memory. Here,
constructing a Krylov matrix with higher order terms such as (AA[T] or A[T] _A) is not feasible[1]._

Table 1: Current best randomized SVD Methods. k, l, s are overestimated sketch sizes for a rank-r estimate s.t.
_k, l, s > r. Note that Range-Net has an exact memory requirement (as conventional SVD), unlike order bounded_
Randomized methods. Note that deterministic implies the solution obtained upon convergence is deterministic.

|Method|Halko et al. (2011) Upadhyay (2016) Tropp et al. (2017b) Tropp et al. (2019)|Range-Net Conventional SVD|
|---|---|---|
|Space Complexity # Passes Type|O(k(m + n)) O(k(m + n) + s2) O(km + nl) O(k(m + n) + s2) 1 1 1 1 Randomized Randomized Randomized Randomized|r(n + r) n(m + 2n) ≤5 r Deterministic Deterministic|


Due to main-memory restrictions on remote compute machines, streaming (Clarkson & Woodruff,
2009; Liberty, 2013) algorithms became popular. For low-rank SVD approximations these involve
streaming the data and updating low-memory sketches covering the row, column and core spaces.
Existing randomized SVD capable of streaming include Halko et al. (2011); Upadhyay (2016); Tropp
et al. (2017a; 2019), each with different sketch sizes and upper bounds on approximation errors
(Table 1).

SketchySVD (Tropp et al., 2019) is the state of the art streaming randomized SVD, with sketch sizes
comparable to it’s predecessors and tighter upper bounds on the tail energy and lower errors. As a
two stage approach, SketchySVD (Alg. 1) constructs an overestimated rank-(k, s) sketch of the data
based on row, column and core projections. A QR decomposition on the row and column sketches
gives an estimate of the rank-k subspace. This is followed by a conventional SVD on the core matrix
to extract it’s singular values and vectors. Finally, the singular vectors are returned after projecting
them back to the original row and column space. The time cost of SketchySVD is O(k[2](m + n))
with memory cost O(k(m + n) + s[2]) with oversamling parameters k = 4r + 1 and s = 2k + 1.

2.1 LIMITATIONS OF RANDOMIZED SVD APPROACHES

We discuss a few limitations that led us to reformulate the problem in the spirit of EYM theorem.
The reader is referred to Appendix A for a detailed discussion and supporting numerical examples to
further enunciate these limitations.

**Tall and Skinny Matrices: For a rank-r approximation of X ∈** R[m][×][n] of rank-f, Randomized SVD
methods rely upon rank-k (k > r) sketches of X. However, these methods are useful only when
_k ≥_ _f but for practical datasets f ≤_ min(m, n) and therefore the memory requirement can still be
overbearing. The reader is referred to Section 4.2 for a practical example.
**Exponential Decay of Singular Values: Assuming exponential decay implies the rank of X itself**
is such that f ≪ min(m, n). For real world applications, the data matrices are almost full rank
_f ≤_ min(m, n), where a rank r truncation is chosen such that the desired dominant features are
accounted for. Appendix A shows a synthetic case with non-exponential decay of singular values
where sketching accrues substantial errors. Further, it is difficult to assume that the decay rate will
follow a strict functional form: mixture of linear, exponential and others (see Fig. 3).
**Upper Bound on Tail Energy: The problem statement in Eq. 1 suggests finding a minimum with**
the minimizer providing a lower bound on the tail-energy. Even if the solution scheme is upper
bounded (Halko et al., 2011; Upadhyay, 2016; Tropp et al., 2017a; 2019), the minimizer _X[ˆ]_ in Eq. 1
_∗_

1Readers are referred to Fig. 5 (c) and Fig. 8 for a performance comparison between power iteration schemes
and Range-Net for a mid-sized real and a synthetic dataset that can be loaded on our compute machine.


-----

or equivalently achieving the lower bound is necessary.
**Approximation Errors: A low-rank SVD solver that does not iteratively compute the projection**
(left or right) while solving Eq. 1 cannot extract SVD factors with low errors even with multiple
passes over the data matrix or multiple runs. As shown later in Theorem 2, any subspace projection
(left or right) of the original data matrix that does not correspond to the minimizer in Eq. 1 increases
the tail energy and therefore results in incorrect low-rank SVD factors (singular values and vectors).
**Memory Requirement: Randomized SVD requires an optimal choice of hyper-parameters (sketch**
sizes etc.) that are subjective to the dataset being processed. In a practical, limited memory scenario,
this entails tuning the hyper-parameters for optimal trade-off between memory requirement, compute
time and an approximation error that does not violate the upper bound on the tail energy.

**Remark. A low relative error in tail energy does not imply the extracted singular values and vectors**
_will have similar relative errors at scale. The issue has been raised by Musco & Musco (2015), that_
_merely upper bounding the tail energy equipped with Frobenius or spectral norm does not bound the_
_approximation errors in the extracted singular values or vectors. Therefore, for a fair comparison we_
_show error metrics on the extracted singular factors for all our numerical experiments._

3 RANGE-NET: A 2-STAGE SVD SOLVER

We present Range-Net that explicitly relies upon solving the minimization problem in Eq. 1 to
achieve the lower bound on the tail-energy for a desired rank-r approximation of a data matrix X
under a streaming setting. The readers are referred to Appendix B for the preliminaries followed by
proofs of theorems and lemmas associated with each of the two stages.

**Data** ⌃r

**Rank r Approximation**

_X_ _XV⇤V⇤[T]_

_V˜_ _XV⇤_ _XV⇤_ ⇥ _XV⇤⇥_

**Stream** **Dump** **Stream** **Dump**

Image/Graph/Matrix

Figure 1: An overview of the low-memory, two-stage Range-Net SVD for Big Data Applications. Stage 1
identifies the span of the desired rank-r approximation. Stage 2 rotates this span to align with the singular
vectors while extracting the singular values of the data. The input data can be streamed from either a server or
secondary memory. If the target is just a rank-r compression then Stage-2 can be discarded without any loss in
accuracy. Stage-2 only orders the rank-r features based upon their respective tail energies.

3.1 NETWORK ARCHITECTURE

The proposed network architecture is divided into two stages: (1) Projection, and (2) Rotation, each
containing only one dense layer of neurons and linear activation functions with no biases. Fig. 2
shows an outline of the this two-stage network architecture where all the weights and outputs have a
specific meaning enforced using representation and data driven loss terms. Contrary to randomized
SVD algorithms the subspace projection (Stage 1) is not specified preemptively (consequently no
assumptions) but is computed by solving an iterative minimization problem following EYM theorem
corresponding to Eq. 1. The rotation stage (Stage 2) then reuses the EYM theorem again in a modified
form to extract the singular vectors and values.

**Stage 1: Rank-r Sub-space Identification: The projection stage constructs an orthonormal basis**
that spans the r-dimensional sub-space of a data matrix X ∈ R[m][×][n] of an unknown rank f ≤
_min(m, n). This orthonormal basis (V[˜] ) is extracted as the stage-1 network weights once the network_
minimization problem converges to a fixed-point. The representation loss ∥V[˜] _[T][ ˜]V −_ _Ir∥F in stage-1_
enforces the orthonormal requirement on the projection space (even when r > f ) while the datadriven loss ∥X − _XV[˜]_ _V[˜]_ _[T]_ _∥F minimizes the tail energy. Although the minimization problem is_
non-convex, the tail-energy is guaranteed to converge to the minimum at machine precision. The
reader is referred to Appendix C for a discussion on the minimization problem (bi-quadratic loss
function with 2[r] global minima) for details regarding the convergence behavior.

**Theorem 2. For any r, f ∈** Z[+], 0 < r ≤ _f_ _, if the tail energy of a rank-f matrix X ∈_ R[m][×][n], f ≤
_min(m, n), with respect to an arbitrary rank-r matrix Br = XV[˜]rV[˜]r[T]_ _[is bounded below by the tail]_


-----

_Y = XV[˜] ⇥_


|Col1|Linear Linear Projection X V˜ Rotation Y V˜ ⇥ n⇥r r⇥r Loss =kY diag(Y T Y )⇥T −X V˜ k2|
|---|---|


_Data_

_Loss = kV[˜]_ _[T][ ˜]V −_ **IrkF[2]** + kXV[˜] _V[˜]_ _[T]_ _−_ _XkF[2]_ |kY _[T]_ _Y −_ _diag{z(Y_ _[T]_ _Y )kF[2]_ }

_Representation_ _Data_ _Representation_

| {z } | {z } | {z }

Figure 2: Network Architecture: Projection (Stage-1) and Rotation (Stage-2) for a 2-stage SVD

_energy of X with respect to it’s rank-r approximation Xr = XVrVr[T]_ _[as,][ ∥][X][ −]_ _[B][r][∥][F]_
_where, Vr = span{v1, v2, · · ·, vr} and vis are the right singular vectors corresponding to the largest[≥∥][X][ −]_ _[X][r][∥][F]_
_r singular values then the minimizer of arg minV˜ ∈R[(][n][×][r][)]∥X −_ _XV[˜]_ _V[˜]_ _[T]_ _∥F is V∗_ _such that V∗V∗[T]_ [=][ V][r][V][ T]r _[.]_

As per Theorem 2, the equality holds true only when span{Br} = span{Xr}. Further, we define
_Br as, Br = XV[˜]_ _V[˜]_ _[T]_ then, V∗V∗[T] [=][ V][r][V][ T]r [and][ V][ T]∗ _[V][∗]_ [=][ I][r] [where][ V][r] [is a rank-][r][ matrix with column]
vectors as top-r right singular vectors of X. The minimization problem then reads,

min _X_ _XV[˜]_ _V[˜]_ _[T]_ _F_ _s.t._ _V˜_ _[T][ ˜]V = Ir_ (2)
_V˜_ _∥_ _−_ _∥_

with a minimum at the fixed point V = span _v1, . . ., vr_ where vi=1,...,r are the right singular
_∗_ _{_ _}_
vectors of Xr. This minimization problem describes the Stage 1 loss function of our network
architecture. Upon convergence, the minimizer _V[˜]∗_ is such that V∗V∗[T] [=][ V][r][V][ T]r [following][ Theorem 2]
where Vr is the matrix with columns as right singular vectors of X corresponding to the largest r
singular values of X.
**Lemma 2.1. If Vr[T]** _[V][r]_ [=][ I][r] _[and][ V][r][V][ T]r_ [=][ V][∗][V][ T]
_∗_ _[then][ V][ T]∗_ _[V][∗]_ [=][ I][r][.]

**Lemma 2.2. If X ∈** R[m][×][n] _is a rank f matrix, then for any rank r > f_ _, where {r, f_ _} ≤_ min(m, n),
_if V_ _[T]_ _[and][ V][∗][V][ T]_ _r_ _[then][ V][ T]r_ _[V][r]_ [=][ I][r][.]
_∗_ _[V][∗]_ [=][ I][r] _∗_ [=][ V][r][V][ T]

**Remark. Note that for r ≤** _f_ _, the orthonormality constraint is trivially satisfied as shown in Lemma_
**_2.1. However for r > f_** _, the orthonormality constraint ensures that the column vectors in V_ _are_
_∗_
_orthonormal (see Lemma 2.2) allowing us to extract orthonormal right singular column vectors of_
_Vr from the Stage 2 minimization problem._

**Stage 2: Singular Value and Vector Extraction: The rotation stage then extracts the singular values**
by rotating the orthonormal vectors (V ) to align with the right singular vectors (Vr = V Θr). From
_∗_ _∗_
the fixed point of the Stage-1 minimization problem Eq. 2 we have V∗V∗[T] [=][ V][r][V][ T]r [. According to]
the EYM theorem the tail energy of a rank-r matrix XV _Cr, where Cr is an arbitrary rank-r, real_
_∗_
valued, square matrix, with respect to XV∗ is now bounded below by 0 or ∥XV∗ _−_ _XV∗Cr∥F ≥_ 0.
**Theorem 3. Given a rank-r matrix XV** R[m][×][r] _and an arbitrary, rank-r matrix C_ R[m][×][r], follow_∗_ _∈_ _∈_
_ingwhere the equality holds true if and only if Theorem 1, the tail energy of XV∗_ _with respect to C = Ir._ _XV∗C is bounded as, ∥XV∗_ _−_ _XV∗C∥F ≥_ 0,
**Lemma 3.1. If C = ΘrΘ[T]r** _[, where][ Θ][r]_ _[is a]_
_real-valued unitary matrix in an r-dimensional Euclidean space.[∈]_ [R][r][×][r][ is a rank-][r][ matrix such that][ C][ =][ I][r][, then][ Θ][r]
**Theorem 4. Given a rank-r matrix XV∗** _∈_ R[m][×][r], such that V∗V∗[T] [=][ V][r][V][ T]r _[where][ V][r]_ _[is a matrix]_
_with column vectors as the top-r right singular vectors of X, and a real-valued unitary matrix_
Θandr ∈ σiRs are the top-[r][×][r] _then (XVr singular values of∗Θr)[T]_ (XV∗Θr) is a diagonal matrix X if and only if V Θ Σr =[2]r _V[where]r._ [ Σ]r[2] [= diag(][σ]1[2][, σ]2[2][,][ · · ·][, σ]r[2][)]
_∗_

From Theorem 3 and Lemma 3.1 we know that, Cr = ΘrΘ[T]r [, where][ Θ][r] [is a rank-][r][, unitary matrix]
in an r-dimensional Euclidean space. Further, from Theorem 4 we have that (XV Θr)[T] (XV Θr) is
_∗_ _∗_
a diagonal matrix Σ[2]r [= diag(][σ]1[2][,][ · · ·][, σ]r[2][)][, where][ σ][i][s are the top-r singular values of][ X][ if and only]
if V Θr = Vr. Assuming Y = XV Θr, the minimization problem now reads:
_∗_ _∗_

min _Y Θ[T]r_ _s.t._ _Y_ _[T]_ _Y_ diag(Y _[T]_ _Y ) = 0_ (3)
Θr _∥_ _[−]_ _[XV][∗][∥][F]_ _−_


-----

**Remark. Note that stage 1 can be verified numerically independently of stage 2 by checking**
_whether the orthonormality condition is met in addition to minimization problem converging to_
_the tail-energy bound. Similarly, stage 2 minimization problem will return a rotation matrix Θr_
(Θ[T]r [Θ][r] [= Θ][r][Θ][T]r [=][ I][r][, det][(Θ][r][) =][ ±][1)][ upon convergence that can again be verified numerically.]

As discussed previously, this choice of loss terms equipped with a Frobenius norm ensures a rank-r
approximation in accord with the Eckart-Young-Mirsky (EYM) theorem. We therefore state that
the expected values of the stage-1 loss term at the minimum must correspond to the rank (n − _r)_
tail energy. Further, the stage-2 loss is expected to reach a machine precision zero at the minimum.
Once, the network minimization problems converge, the singular values are extracted from Stage 2
network weights Θr as Σ[2]r [= (][XV][∗][Θ][r][)][T][ (][XV][∗][Θ][r][)][. The right singular vectors can now be extracted]
using Stage 2 layer weights given by Vr = V Θr. Once Vr and Σr are known, left singular vectors
_∗_
_Ur = XV∗ΘrΣ[−]r_ [1][. Please note that for][ r > f] [,][ f][ −] _[r][ singular values are zero and therefore][ Σ]r[−][1]_
implies inverting the non-singular values using a threshold of 10[−][8].

**Implementation Details: A detailed discussion and justification of our implementation choices are**
given in Appendix E. These include a) Activation: all the activation functions are linear with no
biases, since SVD requires linearly separable orthogonal features; b) Data Split: we do not perform
any split of the training data, since SVD factors are unique to the full data only; c) Data Streaming:
we stream data from HDD into main memory to avoid data sample load; d) Training and Setup; e)
Error Metrics; and f) Loss Profiles. Note that Range-Net has no hyper-parameters and therefore does
not require any post-hoc tuning or adjustments.

4 RESULTS


We present numerical experiments for three datasets: (a) Parrot image, (b) MNIST, and (c) hurricane
Sandy. The reader is referred to Appendix E.5 for the definitions of the error metrics used. Note that,
these error metrics rely upon conventional SVD as the baseline for a fair comparison. For additional
numerical experiments on sparse graph datasets and other low rank approximations see Appendix F.
SketchySVD’s algorithmic implementation (Alg. 1) can be found in Appendix G.


10[0] 10[1] 10[2] 10[3]

|103 Value Singular 102 101|Col2|
|---|---|


Index

|102 Value 101 Singular 100 101 102|Col2|
|---|---|

|103 100 Value 103 Singular 106 109 1012|Col2|
|---|---|


10[0] 10[1] 10[2] 10[3]

index


10[0] 10[1] 10[2] 10[3]

index


(a) Parrot (b) MNIST (c) Sandy

Figure 3: Singular value spectrum for the three practical datasets considered in this work. One can visually
assess that the decay rate of the singular values is non-exponential.

4.1 IMAGE COMPRESSION: PARROTS (SVD)


As an example for SVD of natural images, we use the well known Parrots image from the image
processing domain. The original image is in an RGB format, converted to a gray scale for demonstration purposes followed by normalization between [0, 1]. This 1024 × 1536 data matrix is then used
to compute a rank r = 20 approximation for comparison and numerical analysis. Fig. 4 shows the
result of the low rank reconstruction for SVD, SketchySVD and Range-Net. Visually one can verify
that Fig. 4 (b), (d) are similar while Fig. 4 (c) is different. To make the error in approximation more
clear, we plot the absolute difference of SketchySVD and our net from the truncated rank image. Fig.
**4 (e), (f) shows the the corresponding plots with heatmaps imposed for clarity. Notice that while**
the reconstruction error for our network (≈ 10[−][7]) is close to the GPU precision, SketchySVD has
significantly higher error scale (≈ 10[−][1]), validating the artifacts in the approximated image.

The singular value spectrum does not decay exponentially (Fig. 3) and the data matrix is near-full
rank (f ≈ 1024). Fig. 5 (a, b) shows the scree error as the absolute difference between the predicted
and the true singular values. For SketchySVD, the error fluctuates across the top r = 20 values, but
also the scale of fluctuations is around 1. Comparably, Range-Net incurs significantly lower errors
in singular values (scale of 10[−][4]). Fig. 5 (c) shows the reconstruction errors in Frobenius norm for
SketchySVD (Tropp et al., 2019) (red line), Block Lanczos with Power Iteration (Musco & Musco,
2015) (black line), Sklearn’s randomized SVD (skr) implementation Halko et al. (2011) with (solid


-----

(a) True Image (b) SketchySVD r = 20 (c) Range-Net r = 20

4.0e-01

3.0e-01

2.0e-01

1.0e-01


5.0e-07

4.0e-07

3.0e-07

2.0e-07

1.0e-07

0.0e+00


(d) Xr for r = 20 (e) |X[˜] _[sketchy]_ _−_ _Xr|_ (f) |X[˜] _[net]_ _−_ _Xr|_

Figure 4: (a) True image, rank-20 reconstruction using (b) SktechySVD (oversampled rank k = 81), (c)
Range-Net (5-passes), (d) conventional SVD. Note that Sketchy SVD reconstruction error (10[−][1]) is 6 orders of
magnitude apart from Range-Net’s reconstruction error (10[−][7]) w.r.t. to the true Xr from conventional SVD.
cyan line) and without power iteration (dashed blue line), and Range-Net (green line) over 1000 runs
on the Parrot data. This shows that in order to gain lower reconstruction errors a power iteration is
necessary that quickly becomes intangible in a big-data setting. Further, note that the expected error
(upper bound) over multiple runs of Randomized SVD algorithms does not contract (reduce).


2.5

2.0

1.5

1.0

0.5

0.0


0.0005

0.0004

0.0003

0.0002

0.0001

0.0000


0.0 0 5 10 15 20 0.0000 0 5 10 15 20 10[0] 10[1] 10[2] 10[3]

|1 )|Col2|Col3|
|---|---|---|
|101 (Frob) 101 Error|SketchySVD Lanczos Power Iteration (iter=5) Sklearn RandSVD (iter=0) Sklearn RandSVD (iter=5) Range-Net(pass=5)||
|||SketchySVD Lanczos Power Iteration (iter=5)|
|103 uction 5||Sklearn RandSVD (iter=0) Sklearn RandSVD (iter=5)|
|10 Reconstr 107|||


SketchySVD
Lanczos Power Iteration (iter=5)
Sklearn RandSVD (iter=0)
Sklearn RandSVD (iter=5)
Range-Net(pass=5)

(a) SketchySVDi (b) Range-Neti (c) Reconstruction ErrorsRun

Figure 5: Scree error in the extracted singular values from (a) SketchySVD (≈ 1) and (b) Range-Net (≈ 10[−][4]).
Notice the scale of errors. (c) Reconstruction errors (rank r = 20) for Range-Net and randomized SVD schemes
(with and without power iterations) for Parrot image over 1000 runs.


10 15 20


10 15 20


**Fig. 6 shows the cross-correlation between extracted**
right singular vectors from SketchySVD (left) and
Range-Net (right) against conventional SVD for a
rank-20 approximation. SketchySVD oversampled
rank is k = 81 and still the extracted right singular vectors deviate substantially. This implies that
the extracted vectors from SketchySVD do not span
the top rank-20 subspace of X as opposed to RangeNet where stage-1 explicitly ensures this span without any oversampling. The higher the vector index,
the higher the spread owing to random projections.
Range-Net has a near-perfect cross-correlation with
the true vectors, indicated by the solid diagonal and
zero off-diagonal (near GPU-precision) entries.


0.8

0.6

0.4

0.2


0.8

0.6

0.4

0.2


10

15

19


10

15

19


SVD vectors (V)10 15 19


SVD vectors (V)10 15 19


(a) SketchySVD (b) Range-Net

Figure 6: Cross-correlation between true (conventional SVD) and extracted right singular vectors
from (a) SketchySVD (b) Range-Net for a rank_r = 20 approximation of the Parrot image._


We tabulate error metrics (see Appendix E for definitions) for SketchySVD and Range-Net for
various ranks in Table. 2. Note that all errors are reported w.r.t. the true SVD, where errfr and errsp
denote the frobenius and spectral errors respectively. One can easily see that as the rank increases,
SketchySVD’s performance keeps on deteriorating for the χ[2]err [metric (measure of correlation]
between true and estimated vectors), also evident from Fig. 6. Range-Net on the other hand has
consistent lower errors in all metrics while simultaneously being memory efficient.

Table 2: Metric Performance and Memory of SketchySVD vs. Range-Net, for Parrot image


|Rank|SketchySVD errfr errsp χ2 Mem (MB) Time (s) err|RangeNet errfr errsp χ2 Mem (MB) Time (s) err|
|---|---|---|

|r = 10 r = 20 r = 50 r = 100|27.904 18.071 0.492 10.6 10 11.974 2.201 0.662 21.69 14 2.772 0.181 0.762 59.87 30 0.614 2.14e-2 0.923 139.9 63|0.0 0.0 0.018 0.33 21 0 0 0.023 0.69 26 1.91e-7 0 0.027 1.72 43 2.32e-7 1.08e-7 0.033 3.6 72|
|---|---|---|


-----

4.2 DIMENSION REDUCTION: MNIST (EIGEN / PCA)

Principal Component Analysis (PCA) is a special variant of Eigen decomposition, where the samples
are mean corrected before constructing a feature covariance matrix followed by Eigen decomposition.
Note that Range-Net does not require construction of the feature covariance matrix and can directly
extract the eigenvectors and values without any modification. This is due to the fact that the
Eigenvalues are the square of the singular values for any non-square data matrix X ∈ R[m][×][n] where
right singular vectors are exactly the same as the eigenvectors.

MNIST has 60k images of size 28 × 28 in the training set. We reshape each image into a 784-dim
vector to obtain the data matrix X ∈ R[60000][×][784], as a tall skinny matrix. In a streaming setting, the
mean feature vector computation requires one pass over the data matrix. This can be subsequently
used during the network training (Stage-1) to mean correct streamed input vectors.

Table 3: Metric Performance and Memory of SketchySVD vs. Range-Net, for MNIST

|Rank|SketchySVD errfr errsp χ2 Mem (GB) Time (s) err|RangeNet errfr errsp χ2 Mem (MB) Time (s) err|
|---|---|---|


|r = 20 r = 50 r = 100 r = 200|1.09e3 8.29e2 1.34 0.47 216 1.03e3 8.25e2 2.14 1.18 384 1.05e3 8.47e2 2.03 2.38 702 1.14e3 8.62e2 1.0 4.84 1452|0 0 0.012 0.51 416 0 0 0.025 1.33 552 1.12e-7 1.01e-7 0.052 2.83 776 2.36e-7 2.52e-7 0.071 6.29 1256|
|---|---|---|


For this dataset, it is well known that r = 200 captures ≥ 90% variance in the dataset. For
SketchySVD, this results in projection matrices of ranks k = 4r + 1 = 801 and s = 2k + 1 = 1602.
Since MNIST only has n = 784 features, SketchySVD (Alg. 1) is almost equal if not more memory
intensive than conventional SVD for such tall and skinny matrices. Table. 3 shows the error metrics
under different rank setting, where even with oversampling, SketchySVD’s errors are high. Range-Net
on the other hand with an exact memory requirement (r(n + r)) can handle much larger full rank tall
and skinny matrices without incurring extraneous memory load. As discussed before, since this data
matrix is tall and skinny (60k × 784) we already know that for SketchySVD any rank-r s.t. (r ≥ 196)
will result in the oversampling parameters k ≥ 784 and s ≥ 1569. SketchySVD will now extract
lower-error SVD factors since the oversampled rank redundantly exceeds the feature dimension.

4.3 SCIENTIFIC COMPUTING: SANDY BIG DATA (SVD)

Satellite data gathered by NASA for Hurricane Sandy over the Atlantic ocean represents the big data
counter-part for scientific computations. The data-set is openly available[2] and comprises of RGB
snapshots captured at approximately one-minute interval. The full data-set consists of 896 × 719
pixel images for 1208 time-instances is of size 24 GB. We chose this particular big data so that
a conventional SVD can be performed on our machine (16 GB RAM) for benchmarking. Please
note that this restriction is imposed by conventional SVD method due to its high main memory
requirements. In contrast, our neural SVD solver can handle data sets that are orders of magnitude
larger in size with the same hardware specification.

Range-Net can not only handle larger datasets than SketchySVD, but also ensures lower errors in
approximating the SVD factors. Tab. 4 shows the error metrics for Range-Net with a comparison of
peak main-memory load between SketchySVD and Range-Net for ranks r = [10, 50, 100]. Appendix
**F.3 shows a comparison of dynamic mode reconstructions obtained from SketchySVD, conventional**
full rank SVD, and Range-Net and the associated scree errors in the computed singular values.

Table 4: Metric Performance and Memory of SketchySVD vs. Range-Net, for Sandy

|Rank|SketchySVD errfr errsp χ2 Mem (GB) Time (s) err|RangeNet errfr errsp χ2 Mem (MB) Time (s) err|
|---|---|---|


|r = 10 r = 50 r = 100|1.43e3 7.72e2 0.47 2.56 325 7.18e2 1.81e2 2.04 12.48 507 4.32e2 6.68e1 3.022 24.91 792|0 0 0.011 0.39 371 0 0 0.018 2.01 607 1.12e-7 1.24e-7 0.021 4.19 779|
|---|---|---|


Randomized SVD extracted factors deviate quite substantially when the user specified rank r is such
that the oversampled rank k is much lower than the unknown rank f of a given data matrix. The
reader is also referred to the additional experiment in Appendix F.4 where a low rank (r = 10)
approximation is extracted. Here, SketchySVD deviates quite substantially after rank-3 while RangeNet still remains in excellent agreement with the baseline singular vectors and values.

2https://www.nasa.gov/mission_pages/hurricanes/archives/2012/h2012_Sandy.html


-----

4.4 STORAGE COMPLEXITY ANALYSIS

To estimate the memory efficiency of Range-Net, let us consider the peak main memory (RAM)
requirement for the compuation of SVD factors. Range-Net has two layers in succession, one
corresponding to the low-rank projector _V[˜]_ _[r][×][n]_ and the rotation matrix Θ[r][×][r]. For Sketchy SVD, the
peak memory load occurs during the construction of a core matrix C _[s][×][s]_ (see Alg. 1). This requires
that the two projection matrices Φ[m][×][s], Ψ[n][×][s], one projected data matrix Z _[s][×][s], and two rank-k_
decomposition Q[m][×][k], P _[n][×][k]_ and the core matrix C _[s][×][s], be present in the memory simultaneously._
The memory efficiency factor (seff ) for a rank-r approximation with k = 4r + 1, s = 2k + 1 is:


_seff =_ [SketchySVD]Range-Net = _[ns][ +][ ms][ + 2]rn +[s][2] r[ +][2][ mk][ +][ nk]_ = [(][m][ +][ n]r[)(](n[k] +[ +] r[ s])[) + 2][s][2] _≈_ [12(][m][ +](n[ n] +[) + 128] r) _[r]_


_≈_ 7.67e2 for MNIST(m = 60k, n = 784, r = 200)

10[4]

To validate the ratio, we constructed a synthetic dataset of
_m = 50k rows and the number of columns were varied starting_ 10[3]
at n = 10k with increments of 10k. The expected rank was held Memory (MB)
at r = 200. Fig. 7 shows the memory allocation (in MegaBytes 10[2] SketchySVDRange-Net
(MB)) of SketchySVD vs. Range-Net, while n varies between 0.0 2.5 5.0 7.5 10.0 12.5

SketchySVD
Range-Net

[10k 150k]. When m = 50k, n = 150k and r = 200, Feature size (n+1) * 10000
_−_ Figure 7: Peak memory load of RangeSketchySVD has a peak memory consumption of 14GB due to

Net vs. SketchySVD on synthetic data.

oversampling parameters of k = 801, s = 1603, while RangeNet only requires 916MB. Since Range-Net has an exact memory requirement of r(n + r) for a
rank-r SVD, it always occupies main lesser memory than oversampling of SketchySVD (or any other
randomized method) by two orders of magnitude, and scalable to larger datasets.


DISCUSSION


From the numerical experiments it is evident that Range-Net performs at par with conventional SVD
with GPU bit precision results, while being extremely memory efficient. Range-Net constructs the
rank-r projector V V _[T]_ iteratively to reach the tail energy lower bound given by EYM. Alternatively,
the rank-r minimizer V of Eq. 1 gives the correct rank-r right projector V V _[T]_ of X. Any arbitrary
(random) projection of X onto an oversampled rank-k subspace (k > r), rather than the tail energy
minimizing subspace V V _[T]_, can inadvertently annihilate the desired top ranking singular features
resulting in irreducible approximation errors.

This issue is specially important in exploratory data analysis for scientific computing, where one is
not only interested in the top-most singular values of the dataset but also the dominant phenomena
(singular vectors). Furthermore, with increasing digital sensor resolution the focus now is to isolate
and study the lower energy (high frequency/ small spatial scale) features as in the case of Hurricane
Sandy, where turbulence manifests as low rank features (see Figs. 19, 24 in Appendix).

We again point out that accuracy is gained by achieving (or finding) this minimizer or lower bound of
the rank-r tail-energy of X and therefore the upper bound on this tail-energy is of no consequence
to Range-Net. As pointed out by Musco & Musco (2015), upper bounding the tail energy does not
ensure that the approximation errors in the extracted singular values or vectors will reduce. Therefore,
small relative tail energy errors can mislead the reader that the singular factors are equivalently
accurate wherein the absolute errors can still be substantially large.

6 CONCLUSION


We present Range-Net as a low-weight, high-precision, fully interpretable neural SVD solver for
big data applications that is independently verifiable without performing a full SVD. We show that
our solution approach achieves lower errors metrics for the extracted singular vectors and values
compared to Randomized SVD methods. A discussion is also provided on the limiting assumptions
and practical consequences of using Randomized SVD schemes for big data applications. We also
verify that our network minimization problems converges to the EYM tail energy bound in Frobenius
norm at machine precision. A number of practical problems are considered, where SVD or Eigen
decompositions are required, that demonstrate the applicability of Range-Net to large scale datasets.
A fair comparison is provided against the state of the art randomized, streaming SVD algorithm with
conventional SVD solution as the baseline for benchmarking and verification.


-----

REFERENCES

[Gephi sample data sets. http://wiki.gephi.org/index.php/Datasets.](http://wiki.gephi.org/index.php/Datasets)

[Lev muchnik’s data sets web page. http://www.levmuchnik.net/Content/Networks/](http://www.levmuchnik.net/Content/Networks/NetworkData.html)
[NetworkData.html.](http://www.levmuchnik.net/Content/Networks/NetworkData.html)

[sklearn.utils.extmath.randomized_svd. https://scikit-learn.org/stable/modules/](https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html)
[generated/sklearn.utils.extmath.randomized_svd.html.](https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html)

James Baglama and Lothar Reichel. Augmented implicitly restarted lanczos bidiagonalization
methods. SIAM Journal on Scientific Computing, 27(1):19–42, 2005.

Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix
reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014.

Matthew Brand. Incremental singular value decomposition of uncertain data with missing values. In
_European Conference on Computer Vision, pp. 707–720. Springer, 2002._

[François Chollet. keras. https://github.com/fchollet/keras, 2015.](https://github.com/fchollet/keras)

Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In
_Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 205–214, 2009._

Julio Cesar Stacchini de Souza, Tatiana Mariano Lessa Assis, and Bikash Chandra Pal. Data
compression in smart distribution systems via singular value decomposition. IEEE Transactions
_on Smart Grid, 8(1):275–284, 2015._

Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and V Vinay. Clustering large graphs
via the singular value decomposition. Machine learning, 56(1-3):9–33, 2004.

Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii:
Computing a low-rank approximation to a matrix. SIAM Journal on computing, 36(1):158–183,
2006.

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychome_trika, 1(3):211–218, 1936._

Gene H Golub and Richard Underwood. The block lanczos method for computing eigenvalues. In
_Mathematical software, pp. 361–377. Elsevier, 1977._

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):
217–288, 2011.

Liping Jing, Chenyang Shen, Liu Yang, Jian Yu, and Michael K Ng. Multi-label classification by
semi-supervised singular value decomposition. IEEE Transactions on Image Processing, 26(10):
4612–4625, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
_arXiv:1412.6980, 2014._

N Kishore Kumar and Jan Schneider. Literature survey on low rank approximation of matrices.
_Linear and Multilinear Algebra, 65(11):2212–2244, 2017._

[Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http:](http://snap.stanford.edu/data)
[//snap.stanford.edu/data, June 2014.](http://snap.stanford.edu/data)

Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD
_international conference on Knowledge discovery and data mining, pp. 581–588, 2013._

Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of
_mathematics, 11(1):50–59, 1960._

Cameron Musco and Christopher Musco. Stronger approximate singular value decomposition via the
block lanczos and power methods. arXiv preprint arXiv:1504.05477, 16:27, 2015.


-----

Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.

Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation of a
positive-semidefinite matrix from streaming data. In Advances in Neural Information Processing
_Systems, pp. 1225–1234, 2017a._

Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms
for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):
1454–1485, 2017b.

Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Streaming low-rank matrix
approximation with an application to scientific simulation. SIAM Journal on Scientific Computing,
41(4):A2430–A2463, 2019.

Jalaj Upadhyay. Fast and space-optimal low-rank factorization in the streaming model with application
in differential privacy. arXiv preprint arXiv:1604.01429, 2016.

Shuqin Wang, Yongli Wang, Yongyong Chen, Peng Pan, Zhipeng Sun, and Guoping He. Robust pca
using matrix factorization for background/foreground separation. IEEE Access, 6:18945–18953,
2018.

Haishan Ye, Shusen Wang, Zhihua Zhang, and Tong Zhang. Fast generalized matrix regression with
applications in machine learning. arXiv preprint arXiv:1912.12008, 2019.

Sheng Zhang, Weihong Wang, James Ford, Fillia Makedon, and Justin Pearlman. Using singular
value decomposition approximation for collaborative filtering. In Seventh IEEE International
_Conference on E-Commerce Technology (CEC’05), pp. 257–264. IEEE, 2005._


-----

A NEED FOR RANGE-NET

To illustrate the limitations of current streaming randomized SVD approaches, we consider a synthetic
data matrix X with slow decay in the singular value. The numerical results section later presents a
number of these singular value spectra for different practical datasets (Fig. 3) to demonstrate that the
decay rates are subjective to the problem at hand.

_X = diag(450, 449, · · ·, 2, 1, 0, · · ·, 0)_
_f_ =450 _n−f_

Here X ∈ R[m][×][n] is a strictly diagonal matrix with| {z _m = n}_ | {z } = 500 with rank f = 450, where the
singular value spectrum decays linearly. Fig. 8 shows a comparison of reconstruction errors (see
Metrics in Appendix E.5 for the definition) for SketchySVD (Tropp et al., 2019) (red line), Block
Lanczos with Power Iteration (Musco & Musco, 2015) (black line), Sklearn’s randomized SVD (skr)
implementation (Halko et al., 2011) with (solid cyan line) and without power iteration (dashed blue


10[0] 10[1] 10[2] 10[3]

|en lin|ne) over 1000 runs for this synthetic dataset|
|---|---|
|103|SketchySVD Lanczos Power Iteration (iter=5) Sklearn RandSVD (iter=0) Sklearn RandSVD (iter=5) Range-Net(pass=5)|
|1||
|10 10 1 10 3 10 5 10 7||
|||


Run

Figure 8: Reconstruction errors (r = 20) for Range-Net and randomized SVD schemes (with and without
power iterations) for the non-exponentially decaying singular value spectrum over 1000 runs.
Please note that although power iteration improves the reconstruction error for both Block Lanczos
(Musco & Musco, 2015) and Sklearn’s RandSVD (Halko et al., 2011), power iteration itself requires
a persistent presence of the data matrix X in the main memory. For a practical big data scenario,
power iteration is therefore not a feasible alternative when the data matrix X or it’s sketch is itself too
big to be loaded into the main memory. Note that the error expectation (upper bound) over multiple
_runs of Randomized SVD algorithms does not reduce. We further identify the following requirements_
for SketchySVD to return SVD factors with relatively lower approximation errors:

1. Decay rate of singular values of a dataset must be exponential.


2. For a rank-f matrix, the desired rank r must be chosen such that the oversampled rank k is strictly
greater than f (k ≥ _f_ ) to achieve lower errors at scale compared to other runs.

We suggest that the reader also attempt the case where all the diagonal entries are strictly ones and
zeros under a high rank setting.


|103 (r) Values 102 r|Col2|Tru k =|th 41|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|

|Col1|Tru k =|th 41|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|k =|
|---|---|---|---|---|---|---|---|


10[1]

10[0]


10[0] 10[1] 10[2]

)r 6 × 10[3]

4 × 10[3] Truth

k = 41

3 × 10[3] k = 81

Tail Energy ( k = 121

2 × 10[3] k = 161

k = 201

Rank(r)

(b) Tail Energy


10[0] 10[1] 10[2]

Truth
k = 41
k = 81
k = 121
k = 161
k = 201

Rank(r)

(a) Estimated Singular Values


10[0]

10 1 k = 41

k = 81
k = 121

Relative Error (Frob) k = 161

10 2 k = 201

10[0] 10[1] 10[2]

Rank(r)


(c) Relative Tail Energy


Figure 9: SketchySVD approximation errors for a synthetic dataset with linear decay in singular value spectrum,
corresponding to r = [10, 20, 30, 40, 50] with corresponding oversampled rank k = [41, 81, 121, 161, 201].
Since the decay is non-exponential, SketchySVD accrues large approximation errors, hence impractical for real
datasets with similar behavior.

**Fig. 9 shows the singular values extracted by SketchySVD for a linearly decaying spectrum with**
corresponding errors in absolute and relative tail energies. The reader is referred to Appendix


-----

**B.1 for the definitions of tail energy and relative tail energy. Note that the synthetic data is a**
diagonal matrix chosen specifically so that the exact tail energies can be computed using Frobenius

[1]

_k_ 2
norm as _x[2]ii_ . For a rank-r approximation, SketchySVD suggests oversampling by a

_i=r+1_

 

factor of k = 4P _r + 1 to extract the rank-r factors correctly. Hence for an oversampled rank_
_k = [41, 81, 121, 161, 201] the corresponding top rank r = [10, 20, 30, 40, 50] extracted singular_
values and vectors will have the lowest approximation errors. However as shown in the Fig. 9 (a) the
extracted singular values have an order of magnitude difference w.r.t. the ground truth. Consequently,
**Fig. 9 (b) shows that the absolute tail-energies of the extracted features deviate quite substantially**
from the true tail-energy. Furthermore, we also notice that the deviations remains large as long as the
oversampled rank-k is such that k < f . For a practical dataset f is either unknown or almost full
rank f ≤ min(m, n) or both and can only be detected by performing a full SVD on the dataset. This
poses a serious restriction on SketchySVD’s reliability for a realistic big data application, due to an
exponential decay assumption.

We also notice that for smaller values of r, the accrued error in both the extracted singular values and
tail energy error is worse. Fig. 9 (b) shows that for different rank approximations SketchySVD tail
energy deviates from the truth quite substantially. This is due to the fact that the oversampled rank
_k = 4r + 1 ≪_ _f_, as pointed out before. Note that for oversampling parameter k < f, although the
error decreases as k → _f the memory requirement increases correspondingly as O(k) for extracting a_
low rank-r approximation. This implies that for slow decaying spectrum optimal values of k are such
that k ≥ _f even when a rank-1 approximation is desired. In Fig. 9 (c), one can observe relative errors_
between 10[−][2] and 10[0]. Although this implies that the actual rank-r tail-energy approximation error
is off by 1% at the best, the extracted singular values and vectors are off by one order of magnitude.
As a consequence, the extracted singular vectors no longer represent the features of the dataset.
**Remark. For a randomized SVD algorithm to converge (without power iterations) to a rank-r**
_approximation Xr over multiple runs, we posit that a rank-r sketch matrix_ _X[ˆ] for a given rank-f_
_dataset X, for f_ _r, be such that P_ (span _X[ˆ]_ _span_ _Xr_ = span _Xr_ ) 0.5. However,
_≥_ _{_ _} ∩_ _{_ _}_ _{_ _}_ _≥_
_ensuring this requires substantial amount of prior knowledge or intelligent sampling (a multipass_
_iterative algorithm)._

Range-Net with it’s explicit minimization of tail energy is capable of intelligent sampling on an
arbitrary matrix without requiring any prior information. The key point to note here is that Range-Net
relies upon an iterative computation of a near optimal projector instead of arbitrary/user-specified
projectors used in Randomized SVD schemes. Even if the tail energy is theoretically upper bounded
for some of the Randomized SVD schemes, the target is to find the lower-bound (minimizer) on the
tail energy as discussed in Section 1.1. Furthermore, since none of the randomized SVD schemes
construct the projector in an iterative manner while minimizing Eq. 1, the relative error in the tail
energy remains high. Even if multiple runs of SketchySVD or Sklearn’s RandSVD are performed,
the reconstruction errors in tail energies remain the same at scales shown in Fig. 8. We would also
like to point out that although one must strive for lower errors (relative or otherwise) and tighter
theoretical upper and lower bounds, in practice we should also closely monitor if these theoretical
bounds deliver us the desired solution.

B THEORETICAL GUARANTEES

B.1 PRELIMINARIES

The Frobenius norm of a matrix A is given by,

1
2

[1] [1]

_A_ _F =_ _a[2]ij_ = Tr(A[T] _A)_ 2 = Tr(AA[T] ) 2
_∥_ _∥_  

_i_ _j_

X      

[X] 

Further, the Frobenius norm can be used to bound Trefethen & Bau III (1997) the norm of a matrix
product as,


_AB_ _F_ _A_ _F_ _B_ _F_
_∥_ _∥_ _≤∥_ _∥_ _∥_ _∥_


-----

For a Frobenius norm we have that,

_∥A + B∥F ≤∥A∥F + ∥B∥F_
_A_ _B_ _F_ _A_ _F_ _B_ _F_
_∥_ _−_ _∥_ _≥_ _∥_ _∥_ _−∥_ _∥_

Also the Frobenius norm of a rank-f matrix A,

_∥A∥F = ∥Σf_ _∥F_
where Σf = diag(σ1, σ2, · · ·, σf ) and σis are the f non-zero, singular values of A.

Let A, B, C be matrices such that the following matrix products are feasible. The cyclic property of
the linear trace operator is,


Tr(ABC) = Tr(BCA) = Tr(CAB)


**Definition 1. The tail energy of an arbitrary matrix B ∈** R[m][×][n] _with respect to a given matrix_
_X ∈_ R[m][×][n] _equipped with a Frobenius norm is defined as,_

_τ =_ _X_ _B_ _F_
_∥_ _−_ _∥_

**Definition 2. Let r, f ∈** Z[+] _be positive integers such that 0 < r ≤_ _f_ _, then a rank-r truncation Xr_
_of a rank-f matrix X is defined as,_

_Xr = UrΣrVr[T]_ [=][ XV][r][V][ T]r
_where, Σr = diag(σ1, σ2, · · ·, σr) and σis are the top r singular values of X and Vr =_

[v1, v2, · · ·, vr], and Ur = [u1, u2, · · ·, ur] are matrices such that vis and uis are the corresponding
_right and left singular vectors, respectively._

The relative tail energy of a rank-r matrix Br with respect to a rank-f matrix X (r ≤ _f_ ) is then
defined as,

_τrel,r =_ 1

_[∥]X[X][ −]_ _X[B][r]r[∥][F]F_ _−_

_∥_ _−_ _∥_


B.2 STAGE 1

**Proof of Theorem 2. For any r, f ∈** Z[+], 0 < r ≤ _f_ _, if the tail energy of a rank-f matrix X ∈_ R[m][×][n],
_f ≤_ _min(m, n), with respect to an arbitrary rank-r matrix Br = XV[˜]rV[˜]r[T]_ _[is bounded below by the]_
_tail energy of X with respect to it’s rank-r approximation Xr = XVrVr[T]_ _[as,]_

_X_ _Br_ _F_ _X_ _Xr_ _F_
_∥_ _−_ _∥_ _≥∥_ _−_ _∥_

_where, Vr = span{v1, v2, · · ·, vr} and vis are the right singular vectors corresponding to the largest_
_r singular values then the minimizer of arg minV˜ ∈R[(][n][×][r][)]∥X −_ _XV[˜]_ _V[˜]_ _[T]_ _∥F is V∗_ _such that V∗V∗[T]_ [=][ V][r][V][ T]r _[.]_

From Theorem 1 we have,

_X_ _Br_ _F_ _X_ _Xr_ _F_ 0 (4)
_∥_ _−_ _∥_ _−∥_ _−_ _∥_ _≥_

Let Vr = {v1, v2, . . ., vr} be the top-r, right-singular vectors of X corresponding to the largest
singular values,

_Xr = XVrVr[T]_ [=][ U] [Σ][V][ T][ V][r][V][ T]r [=][ U] [Σ][V][ T]r (5)

Also let Br = XV[˜]rV[˜]r[T] [where][ ˜]Vr is an arbitrary rank-r matrix. From triangle inequality we have that,


_X(VrVr[T]_ _VrV[˜]r[T]_ [)][∥][F] _VrV[˜]r[T]_ [)][∥][F] _r_ [)][∥][F] (6)
_∥_ _[−]_ [˜] _[≥∥][X][(][I][n]_ _[−]_ [˜] _[−∥][X][(][I][n]_ _[−]_ _[V][r][V][ T]_

Combining Eq. 4 and Eq. 6 we get,

_X(VrVr[T]_ _VrV[˜]r[T]_ [)][∥][F] (7)
_∥_ _[−]_ [˜] _[≥]_ [0]


-----

Additionally,
_X_ _F_ _VrVr[T]_ _VrV[˜]r[T]_ _r_ _VrV[˜]r[T]_ [)][∥][F] (8)
_∥_ _∥_ _∥_ _[−]_ [˜] _[∥][F]_ _[≥∥][X][(][V][r][V][ T]_ _[−]_ [˜]

Using the above two inequalities we arrive at,

_X_ _F_ _VrVr[T]_ _VrV[˜]r[T]_ (9)
_∥_ _∥_ _∥_ _[−]_ [˜] _[∥][F]_ _[≥]_ [0]

Since _X_ _F > 0, equality is achieved when_ _V[˜]rV[˜]r[T]_ [=][ V][∗][V][ T] _r_ [. In other words,][ span][{][V][r][}][ =]
_∥_ _∥_ _∗_ [=][ V][r][V][ T]
_span_ _V_ since (V Θr)(V Θr)[T] = V _V_ _[T]_ [spanning]
_{_ _∗}_ _∗_ _∗_ _∗_ _∗_ [for any rank-][r][, real valued, unitary matrix][ Θ][r]
the top rank-r subspace of X.

**Remark. Theorem 2 also implies that any matrix** _V[˜]rV[˜]r[T]_ _[that does not span the same rank-][r][ subspace]_
_of X as VrVr[T]_ _[will result in a higher tail-energy than given by the EYM tail-energy bound equipped]_
_with a Frobenius norm._

**Proof of Lemma 2.1. If Vr[T]** _[V][r]_ [=][ I][r] _[and][ V][r][V][ T]r_ [=][ V][∗][V][ T]∗ _[then][ V][ T]∗_ _[V][∗]_ [=][ I][r][.]

Let us assume Vr[T] _[V][r]_ [=][ I][r] [then,]


_∥Vr[T]_ _[V][r]_ _[−]_ _[I][r][∥]F[2]_ [= 0]

Tr _Vr[T]_ _[V][r][V][ T]r_ _[V][r]_ + Tr (Ir) − 2 Tr _Vr[T]_ _[V][r]_ = 0

Using the cyclic property of the trace operator we have,     

Tr _VrVr[T]_ _[V][r][V][ T]r_ + Tr (Ir) − 2 Tr _VrVr[T]_ = 0

Using Theorem 2, V∗V∗[T] [=][ V]  _[r][V][ T]r_ [,]    

Tr _V∗V∗[T]_ _[V][∗][V][ T]∗_ + Tr (Ir) − 2 Tr _V∗V∗[T]_ = 0

Again using the cyclic property of the trace operator we now get,     

Tr _V∗[T]_ _[V][∗][V][ T]∗_ _[V][∗]_ + Tr (Ir) − 2 Tr _V∗[T]_ _[V][∗]_ = 0

Hence, V _[T]_      
_∗_ _[V][∗]_ [=][ I][r][. This shows that following][ Theorem 2][, the matrix][ V][∗] [comprises of orthonormal]
column vectors spanning the same top rank-r subspace of X as the orthonormal column vectors Vr.

**Proof of Lemma 2.2.** _If X ∈_ R[m][×][n] _is a rank f matrix, then for any rank r > f_ _, where_
_r, f_ min(m, n), if V _[T]_ _[and][ V][∗][V][ T]_ _r_ _[then][ V][ T]r_ _[V][r]_ [=][ I][r][.]
_{_ _} ≤_ _∗_ _[V][∗]_ [=][ I][r] _∗_ [=][ V][r][V][ T]


_V_ _[T]_ _F_ [= 0]
_∥_ _∗_ _[V][∗]_ _[−]_ _[I][r][∥][2]_

Tr _V∗[T]_ _[V][∗][V][ T]∗_ _[V][∗]_ + Tr (Ir) − 2 Tr _V∗[T]_ _[V][∗]_ = 0

Using the cyclic property of the trace operator we have,     

Tr _V∗V∗[T]_ _[V][∗][V][ T]∗_ + Tr (Ir) − 2 Tr _V∗V∗[T]_ = 0

Using Theorem 2, V∗V∗[T] [=][ V]  _[r][V][ T]r_ [,]    

Tr _VrVr[T]_ _[V][r][V][ T]r_ + Tr (Ir) − 2 Tr _VrVr[T]_ = 0

Again using the cyclic property of the trace operator,     


Tr _Vr[T]_ _[V][r][V][ T]r_ _[V][r]_ + Tr (Ir) 2 Tr _Vr[T]_ _[V][r]_ = 0
_−_
  Tr _Vr[T]_ _[V][r][V][ T]r_ _[V][r]_ [+][ I][r]   _r_ _[V][r]_ = 0

_[−]_ [2][V][ T]
  _Vr[T]_ _[V][r]_  [= 0]

_∥_ _[−]_ _[I][r][∥][F]_

Hence Vr[T] _[V][r]_ [=][ I][r][.]

**Remark. Lemma 2.2 shows that for a rank-r approximation of a rank-f matrix X such that r > f** _,_
_the extracted right singular vectors are orthonormal when V_ _[T]_
_∗_ _[V][∗]_ [=][ I][r][. This justifies the constraint]
_V˜_ _[T][ ˜]V = Ir for the stage-1 minimization problem in Eq. 2 and is numerically verified in Appendix D._

**Remark. Note that Lemma 2.1 and 2.2 does not imply that V** = Vr instead V Θr = Vr, where Θr
_∗_ _∗_
_is any real valued unitary matrix for the equality to hold true._


-----

B.3 STAGE 2

**Proof of Theorem 3.** _Given a rank-r matrix XV_ R[m][×][r] _and an arbitrary, rank-r matrix_
_∗_ _∈_
_C_ R[m][×][r], following Theorem 1, the tail energy of XV _with respect to XV_ _C is bounded as,_
_∈_ _∗_ _∗_
_∥XV∗_ _−_ _XV∗C∥F ≥_ 0
_where the equality holds true if and only if C = Ir._

_∥XV∗(Ir −_ _C)∥F ≥_ 0
_∥XV∗∥F ∥Ir −_ _C∥F ≥∥XV∗(Ir −_ _C)∥F_
_∥XV∗∥F ∥Ir −_ _C∥F ≥_ 0

Since XV∗ _> 0, this implies equality is achieved if and only if C = Ir._

**Proof of Lemma 3.1 If C = ΘrΘ[T]r** _[, where][ Θ][r]_
Θr is a real-valued unitary matrix in an r-dimensional Euclidean space.[∈] [R][r][×][r][ is a rank-][r][ matrix such that][ C][ =][ I][r][, then]


ΘrΘ[T]r [=][ I][r]
_∥ΘrΘ[T]r_ _[−]_ _[I][r][∥]F[2]_ [= 0]

Tr ΘrΘ[T]r [Θ][r][Θ]r[T] [+][ I][r] _[−]_ [2Θ][r][Θ]r[T] = 0

Tr ΘrΘ[T]r [Θ][r] [Θ]r[T] + Tr (Ir) 2 Tr ΘrΘ[T]r  = 0
_−_
Using the cyclic property of the trace operator,
     
Tr Θ[T]r [Θ][r][Θ]r[T] [Θ][r] + Tr (Ir) − 2 Tr Θ[T]r [Θ][r] = 0
  Tr (Θ[T]r [Θ][r] _r_ [Θ][r]    = 0

_[−]_ _[I][r][)(Θ][T]_ _[−]_ _[I][r][)][T][ ]_
  Θ[T]r [Θ][r] _F_ [= 0]

_∥_ _[−]_ _[I][r][∥][2]_
This implies Θ[T]r [Θ][r] [=][ I][r][. Since][ Θ][T]r [Θ][r] [= Θ][r][Θ][T]r [=][ I][r] [this implies that][ Θ][r] [is a real-valued unitary]
matrix in the r-dimensional Euclidean space.

**Proof of Theorem 4. Given a rank-r matrix XV∗** _∈_ R[m][×][r], such that V∗V∗[T] [=][ V][r][V][ T]r _[where][ V][r]_ _[is a]_
_matrix with column vectors as the top-r right singular vectors of X, and a real-valued unitary matrix_
Θandr ∈ σiRs are the top-[r][×][r] _then (XVr singular values of∗Θr)[T]_ (XV∗Θr) is a diagonal matrix X if and only if V Θ Σr =[2]r _V[where]r._ [ Σ]r[2] [= diag(][σ]1[2][, σ]2[2][,][ · · ·][, σ]r[2][)]
_∗_

(XV∗Θr)[T] (XV∗Θr) = Σ[2]r
Θ[T]r _[V][ T]_ [= Σ][2]r
_∗_ [(][X] _[T][ X][)][V][∗][Θ][r]_
_V∗[T]_ _[X]_ _[T][ XV][∗]_ [= Θ][r][Σ]r[2][Θ][T]r
_V_ _V_ _[T]_ _r[Θ][T]r_ _[V][ T]_
_∗_ _∗_ _[X]_ _[T][ XV][∗][V][ T]∗_ [=][ V][∗][Θ][r][Σ][2] _∗_
Using V∗V∗[T] [=][ V][r][V][ T]r [from][ Theorem 2][,]
_VrVr[T]_ _[X]_ _[T][ XV][r][V][ T]r_ [=][ V][∗][Θ][r][Σ]r[2][Θ][T]r _[V][ T]_
_∗_
_VrΣ[2]r[V][ T]r_ [= (][V][∗][Θ][r][)Σ]r[2][(][V][∗][Θ][r][)][T]

(Vr − _V∗Θr)Σ[2]r[(][V][r]_ _[−]_ _[V][∗][Θ][r][)][T][ = 0]_

_∥(Vr −_ _V∗Θr)Σ[2]r[(][V][r]_ _[−]_ _[V][∗][Θ][r][)][T][ ∥][F]_ [= 0]


Using Frobenius norm to bound the matrix product,
Σ[2]r[∥][F] _F_
_∥_ _[∥][V][r]_ _[−]_ _[V][∗][Θ][r][∥][2]_ _[≥]_ [0]
Since Σ[2]r[∥][F] _[>][ 0][, equality is achieved if and only if][ V][∗][Θ][r]_ [=][ V][r][.]
_∥_

**Remark. Note that Θr is a rank-r unitary matrix wherein both rotation (det(Θr) = 1) and reflection**
_(det(Θr) =_ 1) are valid since the order of the orthonormal vectors in the matrix V Θr = Vr do
_−_ _∗_
_not alter_ _Vr_ _V_ Θr _F . In practice, Θr manifests itself predominantly as a rotation matrix during_
_the iterative minimization using gradient descent. ∥_ _−_ _∗_ _∥_


-----

ENERGY MINIMIZATION, LOSS SURFACE GEOMETRY, AND CONVERGENCE


In this section, we consider the energy minimization problem that constructs the projection space
spanning the rank-r sub-space of a given data matrix. For ease of visualization, we consider a 2 × 2
matrix X = diag(5, 1) with singular values 5 and 1 corresponding to right singular vectors v1 =

[1, 0][T] and v2 = [0, 1][T], respectively. Our objective here is to extract a rank r = 1 approximation
of this rank f = 2 matrix X. Certainly, this corresponds to identifying the right singular vector v1
with singular value 5. The tail-energy surface (log-scale) corresponding to the ∥Xv˜v˜[T] _−_ _X∥F is_
shown in Fig. 10. Here, ˜v is the test vector for a rank 1 approximation of X. The tail-energy is a
bi-quadratic function in ˜v with 1 maximum, 2[r] minima and 2[r] saddle points, where r is the desired
low-rank approximation of a given data matrix. Furthermore, all minima have the same tail energy: a
property of bi-quadratic functions. For the current specific example, the two equal tail-energy minima
correspond to v1 and −v1, respectively.

2.22.410( )

log

2.0

1.8

1.6

1.4Tail Energy,

1.5

1.0

0.5

0.0

1.5 1.0 0.5 0.0V1, 1 0.5 1.0 1.5 1.51.00.5V1, 2


Figure 10: Surface plot for the bi-quadratic tail energy (log-scale) with 2 minima, 2 saddle points, and one
maxima for X = diag(5, 1).

Although the minimization problem is non-convex, convergence is guaranteed since any perturbedgradient descent approach converges to either of the stable fixed points (minima). In other words, any
test vector ˜v other than v1 or −v1 will increase the tail-energy and hence will not be the solution. The
same argument applies for a high-dimensional dataset X where a low rank-r approximation is desired
with the number of equal tail-energy minima corresponding to 2[r] for all possible negative and positive
combinations of the r right singular vectors vi=1,···,r. In effect, the stage 1 minimization problem
constructs a right projection space _V[˜] = span_ _v1, . . ., vr_ that spans the top rank-r subspace of a
_{_ _}_
given dataset X. A similar line of argument then applies to our stage 2 minimization problem as
well. A mild limitation, that will be addressed in our future work, occurs when X = diag(5, 5 + ϵ),
0 ≤ _ϵ << 1, wherein the two right singular values cannot be resolved accurately (still better than_
Randomized SVD methods) without further considerations. This latter case, with near algebraic
multiplicity in singular values is a special case for conventional SVD as well.

D NETWORK INTERPRETABILITY


As described before in Fig. 2, our network weights and outputs are strictly defined and incorporated
as losses in the network minimization problem. We therefore refer to the problem informed (SVD)
restrictions on the network weights as representation driven losses. This is in contrast to kernel
regularization loss often considered to impose a weak requirement on the network weights to be small.
The representation driven, orthonormality loss term, in Stage 1 enforces that the weights _V[˜] must be_
orthonormal or (V _[T]_
_∗_ _[V][∗]_ [=][ I][r][)][ for a desired rank-][r][ which is greater than the rank-][f][ of the data matrix]
_X. We numerically verify the interpretability of the layer outputs and weights by considering two_
networks: (1) with, and (2) without the aforementioned orthonomality loss. For each of these two
cases, two synthetic datasets are considered corresponding to r ≤ _f and r > f_ . Please note that in
a practical scenario f is an unknown and can be determined only by performing a full SVD of X.
Therefore, numerically testing this aspect for our solver is necessary.

For the first case, we consider a synthetic data matrix X15 15 where the top 5 singular values are
_×_
positive (f = 5) while the rest are zero. The objective is to extract the top 10 (r = 10) singular
vectors where the desired rank is higher the the rank of the system itself. A total of four training runs


-----

1.00

0.75

0.50

0.25

Figure 11: Synthetic Low Rank Case: Correlation Map of extracted vectors V over four runs. Orthonormality
_∗_
is imposed for the first run resulting in a diagonal structure. Absence of this condition results in a scatter.

are considered: one run for a network with the orthonormality condition imposed and three runs for
another network without this additional constraints. Fig. 11 shows the correlation map between the
recovered vectors V for each of the four runs. Notice that only when the orthonormality criteria is
_∗_
not imposed, we get scatter away from the diagonal matrix, although all four runs converged to the
same tail energy. Since the true rank of X is 5, the null space of X is of dimension 10. The absence
of this orthonormality imposing, representation loss results in non-orthonormal vectors V spanning
_∗_
the low-rank range space.

1.00

0.75

0.50

0.25

Figure 12: Synthetic Full Rank Case: Correlation Map of extracted vectors _V[˜] over four runs. Orthonormality_
loss now does not contribute and therefore all runs have a diagonal structure.

For the second case, we consider a full-rank, synthetic data matrix X15 15 with f = 15. As before,
_×_
we extract the top 10 singular vectors (r = 10) using four training runs: one with and three without
imposing the orthonormality loss. Since the desired rank 10 system now itself is full rank, this
additional loss term does not contribute, as desired. Fig. 12 shows that the extracted vectors V
_∗_
remain orthonormal, for all the four runs, so as to minimize tail energy (∥X( V[˜] _V[˜]_ _[T]_ _−_ _I)∥F ), as_
described in Section 3.1 above. In fact, for any rank r approximation of a rank f system such that
_r_ _f_, an arbitrary non-orthonormal matrix V will increase the tail energy and hence will not be a
_≤_ _∗_
fixed point (solution) of our network minimization problem.

E IMPLEMENTATION DETAILS

E.1 CHOICE OF ACTIVATION FUNCTIONS

All the activation functions in both stages are Linear with no biases. One might argue that this
choice is not a neural approach, since all the activations are linear. However, please note that singular
vectors are linearly separable orthogonal features of a dataset, and therefore any other choice of
activation function will result in approximation errors. Since the elements of singular vectors are in

[−1, 1], a choice of relu activation is problematic. A simple verification is to approximate a straight
line segment in [−1, 1] with tanh activation, only to realize that the approximation error → 0 as the
number of neurons →∞. These arguments can also be verified by replacing linear activation in
Stage 1 by any non-linear activation only to find that the tail energy bound cannot be satisfied. Note
that given a small matrix, one can calculate the right singular vectors and substitute them directly as
our network weights to confirm this tail energy bound.

E.2 TRAINING, VALIDATION, AND TESTING SPLIT

An issue with training and validation split in matrix decomposition problems is that the error norm
cannot be bounded in a deterministic manner or computationally verified. For Singular Value
Decomposition of a given data matrix X differs from SVD on a truncated dataset _X[ˆ] in it’s singular_
triplets (singular values and vectors). Ensuring these triplets do not change over an arbitrary split is a
non-trivial computational task.


-----

**Remark. For dataset X, an arbitrary training/validation split results in a varying dataset** _X[ˆ] wherein_
_the norm ∥Xpred −_ _X[ˆ]_ _∥F changes according to the split. Since the desired features are unknown a_
_priori, a consistent truncated dataset_ _X[ˆ]c that spans the same space as the full data X cannot be_
_obtained using an arbitrary split._

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

y 0 y 0 y 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 -0.8-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 -0.8-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

x x x


Figure 13: Variation in extracted right singular vectors (dashed green lines) for three runs with each run
comprising of 200 different train/test data splits with an 80/20 ratio. The solid red lines indicate the ground truth
singular vectors when using the entire unsplit dataset (red dots). Extracted singular vectors deviate increasingly
from left to right over the three runs.
This results in a large variance in extracted features over multiple training/testing splits since the
span of _X[ˆ] itself is changing with each split. Fig. 13 shows a numerical experiment with a synthetic_
(rank-2) dataset containing 100 samples drawn (red dots) from a 2D ellipse with major and minor axes
2 and 1, respectively. The ground truth right singular vectors over the unsplit dataset are computed
using conventional SVD (solid red lines). Next the dataset was split with an 80/20 ratio as the
training/testing split for 200 different realizations per run. The mean/expectation of the extracted
right singular vectors (dashed green line) over the 200 realizations are then reported. One can easily
see that the expected right singular vectors vary quite substantially compared to the ground truth over
the three runs indicating that an arbitrary split of the dataset does not ensure accuracy.

E.3 DATA STREAMING

Given a data matrix X _[m][×][n], we stream the data along the smaller dimension assuming the user_
prescribed rank-r is such that r ≤ _min(m, n). For the sake of simplicity we assume that the data_
matrix has m samples and n features, where m > n, and consequently feature vectors of samples are
streamed in batches. We rely upon the built-in Keras fit_generator class for data streaming from the
secondary memory (HDD). For a big data matrix X that cannot be loaded into the main memory, this
allows us to mimic the modality of data residing on an external server. Thus, given a pointer to the
data, the function yields a batch of specified size for the network to train on for specific epochs. This
ability saves main memory load and allows us to process bigger datasets on smaller main-memory
machines than reported in prior works.

Note that for the stage-1 network to converge to a desired tolerance, we require multiple passes
(empirically ≤ 5) over the original data streamed batchwise. Therefore for Stage 1, the input data is
never persistently present in the main memory of the remote machine. The output data is dumped
onto the secondary memory assuming that storing a low rank approximation is still main memory
intensive. For Stage 2, this low rank approximation in the secondary memory is streamed as input,
and the extracted singular values and vectors are saved in main memory.

E.4 SETUP AND TRAINING

All experiments were done on a setup with Nvidia 2060 RTX Super 8GB GPU, Intel Core i7-9700F
3.0GHz 8-core CPU and 16GB DDR4 memory. We use Keras (Chollet, 2015) running on a Tensorflow
2.0 backend with Python 3.7 to train the networks presented in this paper. For optimization, we use
AdaMax (Kingma & Ba, 2014) with parameters (lr= 0.001) and 1000 steps per epoch.

E.5 ERROR METRICS

As discussed before in Section 2.1, since relative errors in tail energies do not imply similar errors at
scales in extracted singular factors, we rely upon additional error metrics on the extractor factors for


-----

performance comparison and benchmarking. In the following X and _X[ˆ] are used to denote the true_
and the reconstructed data matrices.

-  Scree Error: Absolute difference between true and approximated singular values.

_screeerr(r) =_ _σi(X)_ _σˆi( X[ˆ]_ ) _i_ [1, r]
_|_ _−_ _|_ _∀_ _∈_

-  Reconstruction Error: Frobenius norm error of the true data and its rank-r approximation.

_froberr(r) =_ _X_ _X_ _F_ _F_
_∥_ _−_ [ˆ] _∥[2]_ _[−∥][X][ −]_ _[X][r][∥][2]_

-  Spectral Error: 2-norm of the singular value of the true data and its rank-r approximation.


_spectralerr(r) =_ _X_ _X_ 2 _X_ _Xr_ 2
_∥_ _−_ [ˆ] _∥_ _−∥_ _−_ _∥_

-  Chi Square Statistic: Deviation between the true and approximated singular vectors.

_χ[2]e[rr][(][r][) = 1][ −]_ [1] _v[:r]( X[ˆ]_ )) _F_

_r_ _[∥][Corr][(][v][[:][r][]][(][X][)][,][ ˆ]_ _∥_

Here, σis are the true singular values and Xr is the desired rank-r approximation of X using
conventional SVD as the baseline for benchmarking. Under perfect recovery, all error metrics are
expected to approximately achieve zero at machine precision. All of our numerical experiments were
performed on a GPU using single (32-bit) precision floating point operations. Therefore, the tail
energies are expected to be correct to up to 8 significant digits in all the subsequent calculations. In
the following sections, _X[ˆ] is replaced by approximations from SketchySVD and Range-Net._

E.6 LOSS PROFILE

Figure 14: Range-Net Stage-1 training loss for Parrot image. The network minimization loss converges to the
EYM Tail Energy bound within five epochs and stabilizes for further epochs.

F ADDITIONAL EXPERIMENTS

F.1 FEATURE EXTRACTION: GRAPH (EIGEN)

Table 5: Description and Metrics of the Network Graphs for Range-Net

|Dataset|Nodes|Edges|rank|err err χ2 fr sp err|
|---|---|---|---|---|


|Airlines (air) Twitter (air) Wikivote (Leskovec & Krevl, 2014) Wikipedia (lev) Slashdot (Leskovec & Krevl, 2014)|235 3556 8297 49728 82168|2101 188712 103689 941425 948464|200 200 200 100 100|0 0 0.011 0 0 0.014 0 0 0.027 4.27e-6 1.23e-7 0.034 8.56e-6 6.92e-7 0.045|
|---|---|---|---|---|


Large scale networks occur in many applications where SVD is primarily used to identify the most
important nodes or as a pre-processing step for community detection. For these kind of graph based
datasets, we either perform SVD or Eigen decomposition on the graph, depending on the format in
which the data arrives. We demonstrate results on the following graphs of varying size, tabulated
in Table 5. If the data arrives directly in the form of an adjacency matrix, we can perform SVD or
Eigen decomposition on it directly. For cases, where an adjacency list is provided, a pre-processing
step is required to convert the list representation in a sparse vector.


-----

Since an Eigen decomposition problem is a special case of SVD, where the data matrix is symmetric
positive semi-definite, Range-Net is directly applicable. The benchmark was generated for smaller
graphs using a conventional SVD solver. For larger graphs, a similar benchmark was constructed
using the irlba routine by Baglama & Reichel (2005). Table. 5 shows the error metrics for all the
graphs, where consistently low values are observed for Frobenius and Spectral error metrics.

F.2 NAVIER-STOKES SIMULATED DATA (SVD)

For our next numerical experiment, we rely upon synthetic data generated using a Navier Stokes flow
simulator for an incompressible fluid. The flow data is available on tensor-product grid on two-spatial
and one temporal dimensions of size (w, h, t) = (100 × 50 × 200). For each point on the grid,
velocity vector values are available in both x and y spatial dimensions for 200 time instances. The
fully-developed, flow pattern exhibits a periodicity in the time dimension at approximately every ∼ 60
time step that can be identified using SVD as characteristic modes. The data is therefore reshaped
into a spatial vector for each time instance resulting in a spatio-temporal matrix X ∈ R[5000][×][200]. For
comparison purposes, we use only the x-direction stream velocity.

Figure 15: From left to right: x-direction stream velocity at times t = 0, 100, and 200

(a) SketchySVD (b) Truth (c) Range-Net

Figure 16: Reshaped U indicative of dynamic modes, corresponding to top six right singular vectors for r = 5.

**Fig. 15 shows the evolution of the stream velocity over three time instances and the inherent time-**
periodic nature of the data. Notice the central-left region of primary flow across all three images. Fig.
**16 shows the reshaped U vectors for SketchySVD, conventional SVD and Range-Net. Notice that the**
images corresponding to the first left singular mode (also called dynamic modes) captures a notion
of the primary flow in the left-center part. The second one captures spatial variations of the flow as
time progresses. For all the three methods, all the modes have similar solution visually. Note that


-----

1

0.6

2

0.4

0.2

Sketchy SVD vectors (V4

Index SVD vectors (V)

(a) Singular Values (b) Correlation

Truth
SketchySVD
Range-Net

Singular Value10[0]

10[0] 10[1]

Index

Figure 17: SketchySVD and Range-Net (a) extracted singular values and (b) cross correlation between estimated
and true (conventional SVD) right singular vectors for r = 5 on the Navier-Stokes data.


Range-Net computes a rank r = 5 approximation without oversampling, wherein SketchySVD relies
upon memory intensive sketchy projections of ranks k = 21, s = 43 to arrive at the solution. Overall,
for this low-rank dataset both SketchySVD and Range-Net perform reasonably well qualitatively
looking at the features in Fig. 16 and the singular value spectrum in Fig. 17, because this synthetic
dataset is extemely low-rank (f = 10).

**Remark. For a given data matrix X ∈** R[m][×][n] _of rank-f (f ≤_ min(m, n)) SketchySVD generates low
_error approximations if the oversampled rank k is such that k ≥_ _f_ _. For full rank tall skinny matrices,_
_this implies that k ≥_ min(m, n). For cases where k < f (full rank or otherwise), SketchySVD
_accrues large approximation errors resulting in incorrect SVD factors._

From a use-case point of view, randomized SVD generates low-error factors for full-rank, tall skinny
matrices (X ∈ R[m][×][n]) only when the oversampled rank k ≥ _n. This poses a serious limitation for_
all applications where this requirement is not met and consequently randomized SVD algorithms
accrue large approximation errors in SVD factors as shown in the Sandy Big Data case study below
and Appendix F.4. Note that given a big data matrix X determining the rank f of X is unknown and
therefore selecting an oversampled rank k such that k ≥ _f is impractical in such cases._

**Remark. Once the SVD factors are extracted, SketchySVD cannot be independently verified without**
_performing a full SVD. In contrast, Range-Net is independently verifiable since Stage-1 of Range-Net_
_cannot return orthonormal vectors if the vectors do not span the rank-r subspace of a given data X._

F.3 HIGH RANK APPROXIMATION: SANDY BIG DATA


This section provides an addendum to the numerical results presented in Section 4.3 of the main text.
**Fig. 18 shows evolution of Hurricane Sandy for two time instances. Similar to the Navier-Stokes**
simulation data, Fig. 19 shows three dynamic modes corresponding to rank 1, 20, 50, 100 singular
values. As shown, our results are in good agreement with conventional SVD whereas, Sketchy SVD
shows substantial deviations after the first 50 dynamic modes. Fig. 20 shows the scree-error in the
singular values extracted by SketchySVD and Range-Net.

Figure 18: Satellite image captures of hurricane Sandy over the Atlantic ocean at t = 0 (left) and t = 200
minutes approximately (right).

**Fig. 21 shows the cross correlation of the right singular vectors and scree-error in the corresponding**
singular values extracted by SketchySVD and Range-Net. Note that for a rank-100 approximation,
SketchySVD extracted right singular vectors start deviating after rank-10 as shown in Figs. 21
while the singular values deviate quite substantially from rank-1. Range-Net on the other hand is in
excellent agreement with the right singular vectors and values for all desired 100 indices.


-----

(a) SketchySVD (b) Truth (c) Range-Net

Figure 19: Reshaped Ui indicative of dynamic modes, corresponding to i = 1, 20, 50, 100 for r = 100
(oversampled rank k = 401 for SketchySVD. The dynamic mode approximation error stand out visually for
SketchySVD for indices 20, 50, 100. Our method does not have such artifacts.


15.0

12.5

10.0

7.5

5.0

2.5

0.0


10 20 30 40 50 60 70 80 90 100


(a) SketchySVDi (b) Range-Neti

0.008

0.006

) (ierr 0.004
scree

0.002

0.000

0 10 20 30 40 50 60 70 80 90 100

i

Figure 20: Scree-error in singular values for (a) SketchySVD and (b) Range-Net where a conventional SVD is
used as the baseline in scree-error metric.

We point out that accuracy is a matter of special concern in scientific computations. Any compression
that results in substantial loss of information or obscuring an otherwise identifiable feature in
recorded observations directly culls our capacity to make scientific improvements. Consequently,
any exploratory data analysis, however big or small, must accurately identify the underlying features.
Range-Net achieves the lower bound on tail-energy given by EYM theorem to ensure an accurate
resolution in big data setting. Note that increasing the sensor resolution implies that we are interested
in exploring and understanding the high-frequency features (lower singular values) of the data.


F.4 LOW RANK APPROXIMATION: SANDY BIG DATA

In this experiment, we extract the rank-10 SVD factors using SketchySVD and Range-Net for the
Sandy dataset. The oversampled ranks for sketchy SVD are k = 4r + 1 = 41 and s = 2k + 1 = 83
where k, s ≪ min(m, n).

As before, Fig. 22 show the cross-correlation between the extracted and true (conventional SVD)
right singular vectors using SketchySVD and Range-Net. Fig. 23 shows the scree error in the
extracted singular values for the two methods with singular values from conventional SVD as the
baseline. Finally, Fig. 24 shows a comparison between extracted dynamic modes corresponding to
indices i = [1, 4, 7, 10] from SketchySVD, conventional SVD, and Range-Net. One can easily see


-----

0.8

0.6

0.4

0.2

0.0


20

40

60

80


0.8

0.6

0.4

0.2


20

40

60

80


20 40 60 80

SVD vectors


20 40 60 80

SVD vectors


(a) SketchySVD (b) Range-Net

Figure 21: Cross-correlation between extracted and true (conventional SVD) right singular vectors for (a)
SketchySVD and (b) Range-Net for a rank r = 100 approximation. SketchySVD deviates substantially after
index 10 (although sketching at sizes k = 401 and s = 803) while Range-Net is in good agreement for all the
100 indices.


0.8

0.6

0.4

0.2


0.8

0.6

0.4

0.2

0.0


SVD vectors

(a) SketchySVD


SVD vectors

(b) Range-Net


Figure 22: Cross-correlation between extracted and true (conventional SVD) right singular vectors for (a)
SketchySVD and (b) Range-Net for a rank r = 10 approximation. SketchySVD deviates substantially after
index 3 while Range-Net is in good agreement for all the 10 indices.

that Sketchy SVD extracted dynamic modes/right singular vectors deviate quite substantially for
_i = [4, 7, 10]._


200

150

100

50

|1e|5|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|
|---|---|---|---|---|---|---|---|---|---|---|
|1e|5||||||||||
||||||||||||
||||||||||||
||||||||||||
||||||||||||


10


10


(a) SketchySVD


(b) Range-Net


Figure 23: Scree-error in singular values for (a) SketchySVD and (b) Range-Net where a conventional SVD is
used as the baseline in scree-error metric. Note that for Range-Net the error is at a scale of 10[−][5], 7 orders of
magnitude apart from SketchySVD (10[2]).

F.5 PARROT: ADDENDUM CORRELATION OF LEFT SINGULAR VECTORS


In the following, we show the deviation of left singular vectors, computed using Range-Net and
Sketchy SVD, from the conventional SVD computed left singular vectors as a baseline. As shown
before, random sketching introduces irreducible errors in randomized SVD methods resulting in
this unchecked deviation. Fig. 25 shows a comparison of the cross-correlation against the common
baseline for both Range-Net and Sketchy SVD.


-----

(a) SketchySVD (b) Truth (c) Range-Net


Figure 24: Reshaped Ui indicative of dynamic modes, corresponding to i = 1, 4, 7, 10 for r = 10. The
dynamic mode approximation error stand out visually for SketchySVD for indices 1, 4, 7, 10. Our method does
not have such artifacts.

0 0 1.0


0.8

0.6


0.8

0.6


10

15


10

15


0.4

0.2


0.4

0.2


19


19


0.0


10 15 19

SVD vectors (U)


10 15 19

SVD vectors (U)


(a) SketchySVD (b) Range-Net

Figure 25: Cross-correlation between true (conventional SVD) and extracted left singular vectors from (a)
SketchySVD (b) Range-Net for a rank-r = 20 approximation of the Parrot image.


-----

G SKETCHY SVD IMPLEMENTATION

A brief outline of the single-pass Sketch-SVD algorithm from Tropp et al. (2019). Note that for the
numbers reported in terms of storage, we implemented this code with additional memory optimization
and sparse matrices.

**Algorithm 1 Sketchy SVD**

**Input: X ∈** R[m][×][n], r : expected rank
**Output:** _X[˜] ∈_ R[m][×][k] the approximated rank k-dim data

1: Initialize k = 4r + 1, s = 2k + 1 _▷_ Oversampling parameters

2: Projection maps: Υ ∈ R[k][×][m], Ω _∈_ R[k][×][n], Φ ∈ R[s][×][m], Ψ ∈ R[s][×][n]

3: Projection matrices: A ∈ R[k][×][n], B ∈ R[m][×][k], Z ∈ R[s][×][s] as empty
4: for i = 1 : n do _▷_ Streaming Update

5: Form H ∈ R[m][×][n] as a sparse empty matrix

6: _H(i, :) = X(i, :)_ _▷_ Streamed columns

7: _A ←_ _A + ΥH_ _▷_ Update Co-Range

8: _B ←_ _B + HΩ[T]_ _▷_ Update Range

9: _Z ←_ _Z + ΦHΨ[T]_ _▷_ Update Core Sketch

10: Q ∈ R[m][×][k] _←_ _qr_econ(B)_ _▷_ Basis for Range

11: P ∈ R[n][×][k] _←_ _qr_econ(A[T]_ ) _▷_ Basis for Co-Range

12: C ∈ R[s][×][s] _←_ ((ΦQ) \ Z)/(ΨP ) _▷_ Core Matrix

13: [U, Σ, V _[T]_ ] ← _svd(C)_ _▷_ Full SVD of Core Matrix

14: Σ ∈ R[r][×][r] _←_ Σ[1 : r, 1 : r] _▷_ Pick top r

15: U ∈ R[k][×][r] _←_ _U_ [:, 1 : r] _▷_ Pick top r

16: V _[T]_ _∈_ R[r][×][k] _←_ _V_ _[T]_ [1 : r, :] _▷_ Pick top r

17:18: U V ∈[T] _∈RR[m][r][×][×][r][n]←←QUPV_ _[T]_ _▷_ Project to Column Space▷ Project to Row Space

19: _X[˜] ∈_ R[m][×][n] _←_ _U_ ΣV _[T]_ _▷_ Approximation


-----