forked from SakanaAI/AI-Scientist
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4lLyoISm9M.txt
2020 lines (1311 loc) · 87.1 KB
/
4lLyoISm9M.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# RANGE-NET: A HIGH PRECISION NEURAL SVD
**Anonymous authors**
Paper under double-blind review
ABSTRACT
For Big Data applications, computing a rank-r Singular Value Decomposition
(SVD) is restrictive due to the main memory requirements. Recently introduced
streaming Randomized SVD schemes work under the restrictive assumption that
the singular value spectrum of the data has an exponential decay. This is seldom
true for any practical data. Further, the approximation errors in the singular vectors
and values are high due to the randomized projection. We present Range-Net as a
low memory alternative to rank-r SVD that satisfies the lower bound on tail-energy
given by Eckart-Young-Mirsky (EYM) theorem at machine precision. Range-Net is
a deterministic two-stage neural optimization approach with random initialization,
where the memory requirement depends explicitly on the feature dimension and
desired rank, independent of the sample dimension. The data samples are read
in a streaming manner with the network minimization problem converging to
the desired rank-r approximation. Range-Net is fully interpretable where all the
network outputs and weights have a specific meaning. We provide theoretical
guarantees that Range-Net extracted SVD factors satisfy EYM tail-energy lower
bound with numerical experiments on real datasets at various scales that confirm
these bounds. A comparison against the state-of-the-art streaming Randomized
SVD shows that Range-Net is six orders of magnitude more accurate in terms of
tail energy while correctly extracting the singular values and vectors.
1 INTRODUCTION
Singular Value Decomposition (SVD) is pivotal to exploratory data analysis in identifying an invariant
structure under a minimalistic representation (assumptions on the structure) containing the span
of resolvable information in the dataset. Finding a low rank structure is a fundamental task in
applications including Image Compression (de Souza et al., 2015), Image Recovery (Brand, 2002),
Background Removal (Wang et al., 2018), Recommendation Systems (Zhang et al., 2005) and as a
pre-processing step for Clustering (Drineas et al., 2004) and Classification (Jing et al., 2017). With
the advent of digital sensors and modern day data acquisition technologies, the sheer amount of data
now requires that we revisit the solution scheme with reduced memory consumption as the target. In
this work, we reformulate SVD with special emphasis on the main memory requirement, with no loss
in accuracy, that precludes it’s use for big data applications.
It is well known that natural data matrices have a decaying spectrum wherein saving the data in
memory in its original form is either redundant or not required from an application point of view.
However, any assumption on the decay rate can only be validated if the singular value decomposition
is known a priori, which is seldom the case in exploratory data analysis (see Fig. 3). Visually
assessing a rank-r approximation for image processing applications might seem correct qualitatively
(see Fig. 4) but are still prone to large errors due to limited human vision acuity (see Fig. 6). This is
further exacerbated when the application at hand is associated with scientific computations wherein
the anomalies or unaccounted phenomena are still being explored from large scale datasets. The reader
is preemptively referred to Fig. 19 where the high frequency features related to turbulence cannot
be disregarded. Furthermore, for classification and clustering problems where feature dimension
reduction is desirable it is imperative that a low-rank approximation of a dataset contains most
(≥ 90%) of the original information content without altering the subspace information. In this case,
an over-sampled rank can exceed the feature dimension of the data itself (see Section 4.2).
-----
1.1 PROBLEM STATEMENT
Let us denote the raw data matrix as X ∈ R[m][×][n] of rank f ≤ min(m, n) and its approximation as
_Xr ∈_ R[m][×][n]. The SVD of X is X = U ΣV _[T]_, where U ∈ R[m][×][f] = [u1, · · ·, uf ] and V ∈ R[n][×][f] =
[v1, · · ·, vf ] are its left and right singular vectors respectively, and Σ ∈ R[f] _[×][f]_ = diag(σ1, · · ·, σf )
are its largest non-zero singular values. The rank r (r ≤ _f_ ) truncation of X is then Xr = UrΣrVr[T] [,]
where Σr = Σ[1:r] are the top r singular values, and Ur = U[1:r] and Vr = V[1:r] are the left and
right singular vectors. In other words, X = U ΣV _[T]_ = UrΣrVr[T] [+][ U]f _\r[Σ]f_ _\r[V][ T]f_ _\r_ [=][ X][r][ +][ X][f] _[\][r][.]_
Here, Uf _r, Vf_ _r are the trailing f_ _r left and right singular vectors._
_\_ _\_ _−_
**Theorem 1. Eckart-Young-Mirsky Theorem (Eckart & Young, 1936; Mirsky, 1960): Let X ∈** R[m][×][n]
_be a real, rank-f, matrix with m ≥_ _n with the singular value decomposition as X = U_ ΣV _[T]_ _, where_
_the orthonormal matrices U, V contain the left and right singular vectors of X and Σ is a diagonal_
_matrix of singular values. Then for an arbitrary rank-r, r ≤_ _f matrix Br ∈_ R[m][×][n],
_X_ _Br_ _F_ _X_ _Xr_ _F_
_∥_ _−_ _∥_ _≥∥_ _−_ _∥_
_where Xr = UrΣrVr with Σr is the diagonal matrix of the largest r singular values and Ur, Vr are_
_the corresponding left and right singular vector matrices._
The problem statement is then: Given X ∈ R[m][×][n] find _X[ˆ] such that,_
arg min _X_ _X_ _F_ (1)
_Xˆ_ _∈R[m][×][n], rank( X[ˆ]_ )≤r _∥_ _−_ [ˆ] _∥_
In effect, the minimizer _X[ˆ]_ of the above problem gives us the rank-r approximation of X such that
_∗_
_Xr = X[ˆ]_ . In this work we utilize the minimizer of the above problem to extract the top rank-r SVD
_∗_
factors of X without loading the entire data matrix into the main memory. Note that the minimizer
_naturally gives the lower bound on this tail energy in addition to being a rank-r approximation._
1.2 MAIN CONTRIBUTIONS
**Data and Representation Driven Neural SVD: The representation driven network loss terms en-**
sures that the data matrix X is decomposed into the desired SVD factors such that X = U ΣV _[T]_ . In
the absence of the representation enforcing loss term, the minimizer of Eq. 1 results in an arbitrary
decomposition such that X = ABC different from SVD factors.
**A Deterministic Approach with GPU Bit-precision Results: The network can be initialized with**
weights drawn from a random distribution where the iterative minimization is deterministic. The
streaming order of the samples is of no consequence and the user is free to choose the order in which
the samples are processed in a batch-wise manner (indexed or randomized).
**First Streaming Architecture with Exact Low Memory Cost: Range-net requires an exact mem-**
ory specification based upon the desired rank-r and data dimensions X ∈ R[m][×][n] given by r(n + r),
independent of the sample dimension m. This is the first streaming algorithm that does require the
user to wait until the streaming step is complete, contrary to randomized streaming algorithms.
**Layer-wise Fully Interpretable: Range-Net is a low-weight, fully interpretable, dense neural net-**
work where all the network weights and outputs have a precise definition. The network weights are
placeholders for the right (or left) orthonormal vectors upon convergence of the network minimization
problems (see Appendix D). The user can explicitly plug a ground truth solution to verify the network
design and directly arrive at the tail energy bound.
2 RELATED WORKS
The core idea behind randomized matrix decomposition is to make one or more passes over the data
and compute efficient sketches. They can be broadly categorized into four branches: 1) Sampling
based methods (Subset Selection (Boutsidis et al., 2014) and Randomized CUR (Drineas et al.,
2006)); 2) Random Projection based QR (Halko et al., 2011); 3) Randomized SVD (Halko et al.,
2011); and 4) Power iteration methods (Musco & Musco, 2015). The sketches can represent any
combination of row space, column space or the space generated by the intersection of rows and
columns (core space). However, all of these methods require loading the entire data in memory.
Readers are referred to Kishore Kumar & Schneider (2017); Ye et al. (2019) for an expanded survey.
-----
Conventional SVD although deterministic and accurate, becomes expensive when the data size
increases and requires r passes over the data (see Table 1). The two branches of interest to us are
the randomized SVD and Power iteration Methods for extracting SVD factors. Randomized SVD
algorithms (Halko et al., 2011) are generally a two stage process: 1) Randomized Sketching uses
random sampling to obtain a reduced matri(x/ces) which covers any combination of the row, column
and core space of the data; and 2) Deterministic Post-processing performs conventional SVD on
the reduced system from Randomized Sketching stage. These approaches make only one pass over
the data assuming that the singular value spectrum decays rapidly.
Power iteration based approach (Musco & Musco, 2015) requires multiple passes over the data
and are used when the singular value spectrum decays slowly. This class of algorithm constructs a
Krylov matrix inspired by Block Lanczos (Golub & Underwood, 1977) to obtain a polynomial series
expansion of the sketch. Although these algorithms achieve lower tail-energy errors, they cannot be
used in big-data applications when X itself is too large to be retained in the main memory. Here,
constructing a Krylov matrix with higher order terms such as (AA[T] or A[T] _A) is not feasible[1]._
Table 1: Current best randomized SVD Methods. k, l, s are overestimated sketch sizes for a rank-r estimate s.t.
_k, l, s > r. Note that Range-Net has an exact memory requirement (as conventional SVD), unlike order bounded_
Randomized methods. Note that deterministic implies the solution obtained upon convergence is deterministic.
|Method|Halko et al. (2011) Upadhyay (2016) Tropp et al. (2017b) Tropp et al. (2019)|Range-Net Conventional SVD|
|---|---|---|
|Space Complexity # Passes Type|O(k(m + n)) O(k(m + n) + s2) O(km + nl) O(k(m + n) + s2) 1 1 1 1 Randomized Randomized Randomized Randomized|r(n + r) n(m + 2n) ≤5 r Deterministic Deterministic|
Due to main-memory restrictions on remote compute machines, streaming (Clarkson & Woodruff,
2009; Liberty, 2013) algorithms became popular. For low-rank SVD approximations these involve
streaming the data and updating low-memory sketches covering the row, column and core spaces.
Existing randomized SVD capable of streaming include Halko et al. (2011); Upadhyay (2016); Tropp
et al. (2017a; 2019), each with different sketch sizes and upper bounds on approximation errors
(Table 1).
SketchySVD (Tropp et al., 2019) is the state of the art streaming randomized SVD, with sketch sizes
comparable to it’s predecessors and tighter upper bounds on the tail energy and lower errors. As a
two stage approach, SketchySVD (Alg. 1) constructs an overestimated rank-(k, s) sketch of the data
based on row, column and core projections. A QR decomposition on the row and column sketches
gives an estimate of the rank-k subspace. This is followed by a conventional SVD on the core matrix
to extract it’s singular values and vectors. Finally, the singular vectors are returned after projecting
them back to the original row and column space. The time cost of SketchySVD is O(k[2](m + n))
with memory cost O(k(m + n) + s[2]) with oversamling parameters k = 4r + 1 and s = 2k + 1.
2.1 LIMITATIONS OF RANDOMIZED SVD APPROACHES
We discuss a few limitations that led us to reformulate the problem in the spirit of EYM theorem.
The reader is referred to Appendix A for a detailed discussion and supporting numerical examples to
further enunciate these limitations.
**Tall and Skinny Matrices: For a rank-r approximation of X ∈** R[m][×][n] of rank-f, Randomized SVD
methods rely upon rank-k (k > r) sketches of X. However, these methods are useful only when
_k ≥_ _f but for practical datasets f ≤_ min(m, n) and therefore the memory requirement can still be
overbearing. The reader is referred to Section 4.2 for a practical example.
**Exponential Decay of Singular Values: Assuming exponential decay implies the rank of X itself**
is such that f ≪ min(m, n). For real world applications, the data matrices are almost full rank
_f ≤_ min(m, n), where a rank r truncation is chosen such that the desired dominant features are
accounted for. Appendix A shows a synthetic case with non-exponential decay of singular values
where sketching accrues substantial errors. Further, it is difficult to assume that the decay rate will
follow a strict functional form: mixture of linear, exponential and others (see Fig. 3).
**Upper Bound on Tail Energy: The problem statement in Eq. 1 suggests finding a minimum with**
the minimizer providing a lower bound on the tail-energy. Even if the solution scheme is upper
bounded (Halko et al., 2011; Upadhyay, 2016; Tropp et al., 2017a; 2019), the minimizer _X[ˆ]_ in Eq. 1
_∗_
1Readers are referred to Fig. 5 (c) and Fig. 8 for a performance comparison between power iteration schemes
and Range-Net for a mid-sized real and a synthetic dataset that can be loaded on our compute machine.
-----
or equivalently achieving the lower bound is necessary.
**Approximation Errors: A low-rank SVD solver that does not iteratively compute the projection**
(left or right) while solving Eq. 1 cannot extract SVD factors with low errors even with multiple
passes over the data matrix or multiple runs. As shown later in Theorem 2, any subspace projection
(left or right) of the original data matrix that does not correspond to the minimizer in Eq. 1 increases
the tail energy and therefore results in incorrect low-rank SVD factors (singular values and vectors).
**Memory Requirement: Randomized SVD requires an optimal choice of hyper-parameters (sketch**
sizes etc.) that are subjective to the dataset being processed. In a practical, limited memory scenario,
this entails tuning the hyper-parameters for optimal trade-off between memory requirement, compute
time and an approximation error that does not violate the upper bound on the tail energy.
**Remark. A low relative error in tail energy does not imply the extracted singular values and vectors**
_will have similar relative errors at scale. The issue has been raised by Musco & Musco (2015), that_
_merely upper bounding the tail energy equipped with Frobenius or spectral norm does not bound the_
_approximation errors in the extracted singular values or vectors. Therefore, for a fair comparison we_
_show error metrics on the extracted singular factors for all our numerical experiments._
3 RANGE-NET: A 2-STAGE SVD SOLVER
We present Range-Net that explicitly relies upon solving the minimization problem in Eq. 1 to
achieve the lower bound on the tail-energy for a desired rank-r approximation of a data matrix X
under a streaming setting. The readers are referred to Appendix B for the preliminaries followed by
proofs of theorems and lemmas associated with each of the two stages.
**Data** ⌃r
**Rank r Approximation**
_X_ _XV⇤V⇤[T]_
_V˜_ _XV⇤_ _XV⇤_ ⇥ _XV⇤⇥_
**Stream** **Dump** **Stream** **Dump**
Image/Graph/Matrix
Figure 1: An overview of the low-memory, two-stage Range-Net SVD for Big Data Applications. Stage 1
identifies the span of the desired rank-r approximation. Stage 2 rotates this span to align with the singular
vectors while extracting the singular values of the data. The input data can be streamed from either a server or
secondary memory. If the target is just a rank-r compression then Stage-2 can be discarded without any loss in
accuracy. Stage-2 only orders the rank-r features based upon their respective tail energies.
3.1 NETWORK ARCHITECTURE
The proposed network architecture is divided into two stages: (1) Projection, and (2) Rotation, each
containing only one dense layer of neurons and linear activation functions with no biases. Fig. 2
shows an outline of the this two-stage network architecture where all the weights and outputs have a
specific meaning enforced using representation and data driven loss terms. Contrary to randomized
SVD algorithms the subspace projection (Stage 1) is not specified preemptively (consequently no
assumptions) but is computed by solving an iterative minimization problem following EYM theorem
corresponding to Eq. 1. The rotation stage (Stage 2) then reuses the EYM theorem again in a modified
form to extract the singular vectors and values.
**Stage 1: Rank-r Sub-space Identification: The projection stage constructs an orthonormal basis**
that spans the r-dimensional sub-space of a data matrix X ∈ R[m][×][n] of an unknown rank f ≤
_min(m, n). This orthonormal basis (V[˜] ) is extracted as the stage-1 network weights once the network_
minimization problem converges to a fixed-point. The representation loss ∥V[˜] _[T][ ˜]V −_ _Ir∥F in stage-1_
enforces the orthonormal requirement on the projection space (even when r > f ) while the datadriven loss ∥X − _XV[˜]_ _V[˜]_ _[T]_ _∥F minimizes the tail energy. Although the minimization problem is_
non-convex, the tail-energy is guaranteed to converge to the minimum at machine precision. The
reader is referred to Appendix C for a discussion on the minimization problem (bi-quadratic loss
function with 2[r] global minima) for details regarding the convergence behavior.
**Theorem 2. For any r, f ∈** Z[+], 0 < r ≤ _f_ _, if the tail energy of a rank-f matrix X ∈_ R[m][×][n], f ≤
_min(m, n), with respect to an arbitrary rank-r matrix Br = XV[˜]rV[˜]r[T]_ _[is bounded below by the tail]_
-----
_Y = XV[˜] ⇥_
|Col1|Linear Linear Projection X V˜ Rotation Y V˜ ⇥ n⇥r r⇥r Loss =kY diag(Y T Y )⇥T −X V˜ k2|
|---|---|
_Data_
_Loss = kV[˜]_ _[T][ ˜]V −_ **IrkF[2]** + kXV[˜] _V[˜]_ _[T]_ _−_ _XkF[2]_ |kY _[T]_ _Y −_ _diag{z(Y_ _[T]_ _Y )kF[2]_ }
_Representation_ _Data_ _Representation_
| {z } | {z } | {z }
Figure 2: Network Architecture: Projection (Stage-1) and Rotation (Stage-2) for a 2-stage SVD
_energy of X with respect to it’s rank-r approximation Xr = XVrVr[T]_ _[as,][ ∥][X][ −]_ _[B][r][∥][F]_
_where, Vr = span{v1, v2, · · ·, vr} and vis are the right singular vectors corresponding to the largest[≥∥][X][ −]_ _[X][r][∥][F]_
_r singular values then the minimizer of arg minV˜ ∈R[(][n][×][r][)]∥X −_ _XV[˜]_ _V[˜]_ _[T]_ _∥F is V∗_ _such that V∗V∗[T]_ [=][ V][r][V][ T]r _[.]_
As per Theorem 2, the equality holds true only when span{Br} = span{Xr}. Further, we define
_Br as, Br = XV[˜]_ _V[˜]_ _[T]_ then, V∗V∗[T] [=][ V][r][V][ T]r [and][ V][ T]∗ _[V][∗]_ [=][ I][r] [where][ V][r] [is a rank-][r][ matrix with column]
vectors as top-r right singular vectors of X. The minimization problem then reads,
min _X_ _XV[˜]_ _V[˜]_ _[T]_ _F_ _s.t._ _V˜_ _[T][ ˜]V = Ir_ (2)
_V˜_ _∥_ _−_ _∥_
with a minimum at the fixed point V = span _v1, . . ., vr_ where vi=1,...,r are the right singular
_∗_ _{_ _}_
vectors of Xr. This minimization problem describes the Stage 1 loss function of our network
architecture. Upon convergence, the minimizer _V[˜]∗_ is such that V∗V∗[T] [=][ V][r][V][ T]r [following][ Theorem 2]
where Vr is the matrix with columns as right singular vectors of X corresponding to the largest r
singular values of X.
**Lemma 2.1. If Vr[T]** _[V][r]_ [=][ I][r] _[and][ V][r][V][ T]r_ [=][ V][∗][V][ T]
_∗_ _[then][ V][ T]∗_ _[V][∗]_ [=][ I][r][.]
**Lemma 2.2. If X ∈** R[m][×][n] _is a rank f matrix, then for any rank r > f_ _, where {r, f_ _} ≤_ min(m, n),
_if V_ _[T]_ _[and][ V][∗][V][ T]_ _r_ _[then][ V][ T]r_ _[V][r]_ [=][ I][r][.]
_∗_ _[V][∗]_ [=][ I][r] _∗_ [=][ V][r][V][ T]
**Remark. Note that for r ≤** _f_ _, the orthonormality constraint is trivially satisfied as shown in Lemma_
**_2.1. However for r > f_** _, the orthonormality constraint ensures that the column vectors in V_ _are_
_∗_
_orthonormal (see Lemma 2.2) allowing us to extract orthonormal right singular column vectors of_
_Vr from the Stage 2 minimization problem._
**Stage 2: Singular Value and Vector Extraction: The rotation stage then extracts the singular values**
by rotating the orthonormal vectors (V ) to align with the right singular vectors (Vr = V Θr). From
_∗_ _∗_
the fixed point of the Stage-1 minimization problem Eq. 2 we have V∗V∗[T] [=][ V][r][V][ T]r [. According to]
the EYM theorem the tail energy of a rank-r matrix XV _Cr, where Cr is an arbitrary rank-r, real_
_∗_
valued, square matrix, with respect to XV∗ is now bounded below by 0 or ∥XV∗ _−_ _XV∗Cr∥F ≥_ 0.
**Theorem 3. Given a rank-r matrix XV** R[m][×][r] _and an arbitrary, rank-r matrix C_ R[m][×][r], follow_∗_ _∈_ _∈_
_ingwhere the equality holds true if and only if Theorem 1, the tail energy of XV∗_ _with respect to C = Ir._ _XV∗C is bounded as, ∥XV∗_ _−_ _XV∗C∥F ≥_ 0,
**Lemma 3.1. If C = ΘrΘ[T]r** _[, where][ Θ][r]_ _[is a]_
_real-valued unitary matrix in an r-dimensional Euclidean space.[∈]_ [R][r][×][r][ is a rank-][r][ matrix such that][ C][ =][ I][r][, then][ Θ][r]
**Theorem 4. Given a rank-r matrix XV∗** _∈_ R[m][×][r], such that V∗V∗[T] [=][ V][r][V][ T]r _[where][ V][r]_ _[is a matrix]_
_with column vectors as the top-r right singular vectors of X, and a real-valued unitary matrix_
Θandr ∈ σiRs are the top-[r][×][r] _then (XVr singular values of∗Θr)[T]_ (XV∗Θr) is a diagonal matrix X if and only if V Θ Σr =[2]r _V[where]r._ [ Σ]r[2] [= diag(][σ]1[2][, σ]2[2][,][ · · ·][, σ]r[2][)]
_∗_
From Theorem 3 and Lemma 3.1 we know that, Cr = ΘrΘ[T]r [, where][ Θ][r] [is a rank-][r][, unitary matrix]
in an r-dimensional Euclidean space. Further, from Theorem 4 we have that (XV Θr)[T] (XV Θr) is
_∗_ _∗_
a diagonal matrix Σ[2]r [= diag(][σ]1[2][,][ · · ·][, σ]r[2][)][, where][ σ][i][s are the top-r singular values of][ X][ if and only]
if V Θr = Vr. Assuming Y = XV Θr, the minimization problem now reads:
_∗_ _∗_
min _Y Θ[T]r_ _s.t._ _Y_ _[T]_ _Y_ diag(Y _[T]_ _Y ) = 0_ (3)
Θr _∥_ _[−]_ _[XV][∗][∥][F]_ _−_
-----
**Remark. Note that stage 1 can be verified numerically independently of stage 2 by checking**
_whether the orthonormality condition is met in addition to minimization problem converging to_
_the tail-energy bound. Similarly, stage 2 minimization problem will return a rotation matrix Θr_
(Θ[T]r [Θ][r] [= Θ][r][Θ][T]r [=][ I][r][, det][(Θ][r][) =][ ±][1)][ upon convergence that can again be verified numerically.]
As discussed previously, this choice of loss terms equipped with a Frobenius norm ensures a rank-r
approximation in accord with the Eckart-Young-Mirsky (EYM) theorem. We therefore state that
the expected values of the stage-1 loss term at the minimum must correspond to the rank (n − _r)_
tail energy. Further, the stage-2 loss is expected to reach a machine precision zero at the minimum.
Once, the network minimization problems converge, the singular values are extracted from Stage 2
network weights Θr as Σ[2]r [= (][XV][∗][Θ][r][)][T][ (][XV][∗][Θ][r][)][. The right singular vectors can now be extracted]
using Stage 2 layer weights given by Vr = V Θr. Once Vr and Σr are known, left singular vectors
_∗_
_Ur = XV∗ΘrΣ[−]r_ [1][. Please note that for][ r > f] [,][ f][ −] _[r][ singular values are zero and therefore][ Σ]r[−][1]_
implies inverting the non-singular values using a threshold of 10[−][8].
**Implementation Details: A detailed discussion and justification of our implementation choices are**
given in Appendix E. These include a) Activation: all the activation functions are linear with no
biases, since SVD requires linearly separable orthogonal features; b) Data Split: we do not perform
any split of the training data, since SVD factors are unique to the full data only; c) Data Streaming:
we stream data from HDD into main memory to avoid data sample load; d) Training and Setup; e)
Error Metrics; and f) Loss Profiles. Note that Range-Net has no hyper-parameters and therefore does
not require any post-hoc tuning or adjustments.
4 RESULTS
We present numerical experiments for three datasets: (a) Parrot image, (b) MNIST, and (c) hurricane
Sandy. The reader is referred to Appendix E.5 for the definitions of the error metrics used. Note that,
these error metrics rely upon conventional SVD as the baseline for a fair comparison. For additional
numerical experiments on sparse graph datasets and other low rank approximations see Appendix F.
SketchySVD’s algorithmic implementation (Alg. 1) can be found in Appendix G.
10[0] 10[1] 10[2] 10[3]
|103 Value Singular 102 101|Col2|
|---|---|
Index
|102 Value 101 Singular 100 101 102|Col2|
|---|---|
|103 100 Value 103 Singular 106 109 1012|Col2|
|---|---|
10[0] 10[1] 10[2] 10[3]
index
10[0] 10[1] 10[2] 10[3]
index
(a) Parrot (b) MNIST (c) Sandy
Figure 3: Singular value spectrum for the three practical datasets considered in this work. One can visually
assess that the decay rate of the singular values is non-exponential.
4.1 IMAGE COMPRESSION: PARROTS (SVD)
As an example for SVD of natural images, we use the well known Parrots image from the image
processing domain. The original image is in an RGB format, converted to a gray scale for demonstration purposes followed by normalization between [0, 1]. This 1024 × 1536 data matrix is then used
to compute a rank r = 20 approximation for comparison and numerical analysis. Fig. 4 shows the
result of the low rank reconstruction for SVD, SketchySVD and Range-Net. Visually one can verify
that Fig. 4 (b), (d) are similar while Fig. 4 (c) is different. To make the error in approximation more
clear, we plot the absolute difference of SketchySVD and our net from the truncated rank image. Fig.
**4 (e), (f) shows the the corresponding plots with heatmaps imposed for clarity. Notice that while**
the reconstruction error for our network (≈ 10[−][7]) is close to the GPU precision, SketchySVD has
significantly higher error scale (≈ 10[−][1]), validating the artifacts in the approximated image.
The singular value spectrum does not decay exponentially (Fig. 3) and the data matrix is near-full
rank (f ≈ 1024). Fig. 5 (a, b) shows the scree error as the absolute difference between the predicted
and the true singular values. For SketchySVD, the error fluctuates across the top r = 20 values, but
also the scale of fluctuations is around 1. Comparably, Range-Net incurs significantly lower errors
in singular values (scale of 10[−][4]). Fig. 5 (c) shows the reconstruction errors in Frobenius norm for
SketchySVD (Tropp et al., 2019) (red line), Block Lanczos with Power Iteration (Musco & Musco,
2015) (black line), Sklearn’s randomized SVD (skr) implementation Halko et al. (2011) with (solid
-----
(a) True Image (b) SketchySVD r = 20 (c) Range-Net r = 20
4.0e-01
3.0e-01
2.0e-01
1.0e-01
5.0e-07
4.0e-07
3.0e-07
2.0e-07
1.0e-07
0.0e+00
(d) Xr for r = 20 (e) |X[˜] _[sketchy]_ _−_ _Xr|_ (f) |X[˜] _[net]_ _−_ _Xr|_
Figure 4: (a) True image, rank-20 reconstruction using (b) SktechySVD (oversampled rank k = 81), (c)
Range-Net (5-passes), (d) conventional SVD. Note that Sketchy SVD reconstruction error (10[−][1]) is 6 orders of
magnitude apart from Range-Net’s reconstruction error (10[−][7]) w.r.t. to the true Xr from conventional SVD.
cyan line) and without power iteration (dashed blue line), and Range-Net (green line) over 1000 runs
on the Parrot data. This shows that in order to gain lower reconstruction errors a power iteration is
necessary that quickly becomes intangible in a big-data setting. Further, note that the expected error
(upper bound) over multiple runs of Randomized SVD algorithms does not contract (reduce).
2.5
2.0
1.5
1.0
0.5
0.0
0.0005
0.0004
0.0003
0.0002
0.0001
0.0000
0.0 0 5 10 15 20 0.0000 0 5 10 15 20 10[0] 10[1] 10[2] 10[3]
|1 )|Col2|Col3|
|---|---|---|
|101 (Frob) 101 Error|SketchySVD Lanczos Power Iteration (iter=5) Sklearn RandSVD (iter=0) Sklearn RandSVD (iter=5) Range-Net(pass=5)||
|||SketchySVD Lanczos Power Iteration (iter=5)|
|103 uction 5||Sklearn RandSVD (iter=0) Sklearn RandSVD (iter=5)|
|10 Reconstr 107|||
SketchySVD
Lanczos Power Iteration (iter=5)
Sklearn RandSVD (iter=0)
Sklearn RandSVD (iter=5)
Range-Net(pass=5)
(a) SketchySVDi (b) Range-Neti (c) Reconstruction ErrorsRun
Figure 5: Scree error in the extracted singular values from (a) SketchySVD (≈ 1) and (b) Range-Net (≈ 10[−][4]).
Notice the scale of errors. (c) Reconstruction errors (rank r = 20) for Range-Net and randomized SVD schemes
(with and without power iterations) for Parrot image over 1000 runs.
10 15 20
10 15 20
**Fig. 6 shows the cross-correlation between extracted**
right singular vectors from SketchySVD (left) and
Range-Net (right) against conventional SVD for a
rank-20 approximation. SketchySVD oversampled
rank is k = 81 and still the extracted right singular vectors deviate substantially. This implies that
the extracted vectors from SketchySVD do not span
the top rank-20 subspace of X as opposed to RangeNet where stage-1 explicitly ensures this span without any oversampling. The higher the vector index,
the higher the spread owing to random projections.
Range-Net has a near-perfect cross-correlation with
the true vectors, indicated by the solid diagonal and
zero off-diagonal (near GPU-precision) entries.
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
10
15
19
10
15
19
SVD vectors (V)10 15 19
SVD vectors (V)10 15 19
(a) SketchySVD (b) Range-Net
Figure 6: Cross-correlation between true (conventional SVD) and extracted right singular vectors
from (a) SketchySVD (b) Range-Net for a rank_r = 20 approximation of the Parrot image._
We tabulate error metrics (see Appendix E for definitions) for SketchySVD and Range-Net for
various ranks in Table. 2. Note that all errors are reported w.r.t. the true SVD, where errfr and errsp
denote the frobenius and spectral errors respectively. One can easily see that as the rank increases,
SketchySVD’s performance keeps on deteriorating for the χ[2]err [metric (measure of correlation]
between true and estimated vectors), also evident from Fig. 6. Range-Net on the other hand has
consistent lower errors in all metrics while simultaneously being memory efficient.
Table 2: Metric Performance and Memory of SketchySVD vs. Range-Net, for Parrot image
|Rank|SketchySVD errfr errsp χ2 Mem (MB) Time (s) err|RangeNet errfr errsp χ2 Mem (MB) Time (s) err|
|---|---|---|
|r = 10 r = 20 r = 50 r = 100|27.904 18.071 0.492 10.6 10 11.974 2.201 0.662 21.69 14 2.772 0.181 0.762 59.87 30 0.614 2.14e-2 0.923 139.9 63|0.0 0.0 0.018 0.33 21 0 0 0.023 0.69 26 1.91e-7 0 0.027 1.72 43 2.32e-7 1.08e-7 0.033 3.6 72|
|---|---|---|
-----
4.2 DIMENSION REDUCTION: MNIST (EIGEN / PCA)
Principal Component Analysis (PCA) is a special variant of Eigen decomposition, where the samples
are mean corrected before constructing a feature covariance matrix followed by Eigen decomposition.
Note that Range-Net does not require construction of the feature covariance matrix and can directly
extract the eigenvectors and values without any modification. This is due to the fact that the
Eigenvalues are the square of the singular values for any non-square data matrix X ∈ R[m][×][n] where
right singular vectors are exactly the same as the eigenvectors.
MNIST has 60k images of size 28 × 28 in the training set. We reshape each image into a 784-dim
vector to obtain the data matrix X ∈ R[60000][×][784], as a tall skinny matrix. In a streaming setting, the
mean feature vector computation requires one pass over the data matrix. This can be subsequently
used during the network training (Stage-1) to mean correct streamed input vectors.
Table 3: Metric Performance and Memory of SketchySVD vs. Range-Net, for MNIST
|Rank|SketchySVD errfr errsp χ2 Mem (GB) Time (s) err|RangeNet errfr errsp χ2 Mem (MB) Time (s) err|
|---|---|---|
|r = 20 r = 50 r = 100 r = 200|1.09e3 8.29e2 1.34 0.47 216 1.03e3 8.25e2 2.14 1.18 384 1.05e3 8.47e2 2.03 2.38 702 1.14e3 8.62e2 1.0 4.84 1452|0 0 0.012 0.51 416 0 0 0.025 1.33 552 1.12e-7 1.01e-7 0.052 2.83 776 2.36e-7 2.52e-7 0.071 6.29 1256|
|---|---|---|
For this dataset, it is well known that r = 200 captures ≥ 90% variance in the dataset. For
SketchySVD, this results in projection matrices of ranks k = 4r + 1 = 801 and s = 2k + 1 = 1602.
Since MNIST only has n = 784 features, SketchySVD (Alg. 1) is almost equal if not more memory
intensive than conventional SVD for such tall and skinny matrices. Table. 3 shows the error metrics
under different rank setting, where even with oversampling, SketchySVD’s errors are high. Range-Net
on the other hand with an exact memory requirement (r(n + r)) can handle much larger full rank tall
and skinny matrices without incurring extraneous memory load. As discussed before, since this data
matrix is tall and skinny (60k × 784) we already know that for SketchySVD any rank-r s.t. (r ≥ 196)
will result in the oversampling parameters k ≥ 784 and s ≥ 1569. SketchySVD will now extract
lower-error SVD factors since the oversampled rank redundantly exceeds the feature dimension.
4.3 SCIENTIFIC COMPUTING: SANDY BIG DATA (SVD)
Satellite data gathered by NASA for Hurricane Sandy over the Atlantic ocean represents the big data
counter-part for scientific computations. The data-set is openly available[2] and comprises of RGB
snapshots captured at approximately one-minute interval. The full data-set consists of 896 × 719
pixel images for 1208 time-instances is of size 24 GB. We chose this particular big data so that
a conventional SVD can be performed on our machine (16 GB RAM) for benchmarking. Please
note that this restriction is imposed by conventional SVD method due to its high main memory
requirements. In contrast, our neural SVD solver can handle data sets that are orders of magnitude
larger in size with the same hardware specification.
Range-Net can not only handle larger datasets than SketchySVD, but also ensures lower errors in
approximating the SVD factors. Tab. 4 shows the error metrics for Range-Net with a comparison of
peak main-memory load between SketchySVD and Range-Net for ranks r = [10, 50, 100]. Appendix
**F.3 shows a comparison of dynamic mode reconstructions obtained from SketchySVD, conventional**
full rank SVD, and Range-Net and the associated scree errors in the computed singular values.
Table 4: Metric Performance and Memory of SketchySVD vs. Range-Net, for Sandy
|Rank|SketchySVD errfr errsp χ2 Mem (GB) Time (s) err|RangeNet errfr errsp χ2 Mem (MB) Time (s) err|
|---|---|---|
|r = 10 r = 50 r = 100|1.43e3 7.72e2 0.47 2.56 325 7.18e2 1.81e2 2.04 12.48 507 4.32e2 6.68e1 3.022 24.91 792|0 0 0.011 0.39 371 0 0 0.018 2.01 607 1.12e-7 1.24e-7 0.021 4.19 779|
|---|---|---|
Randomized SVD extracted factors deviate quite substantially when the user specified rank r is such
that the oversampled rank k is much lower than the unknown rank f of a given data matrix. The
reader is also referred to the additional experiment in Appendix F.4 where a low rank (r = 10)
approximation is extracted. Here, SketchySVD deviates quite substantially after rank-3 while RangeNet still remains in excellent agreement with the baseline singular vectors and values.
2https://www.nasa.gov/mission_pages/hurricanes/archives/2012/h2012_Sandy.html
-----
4.4 STORAGE COMPLEXITY ANALYSIS
To estimate the memory efficiency of Range-Net, let us consider the peak main memory (RAM)
requirement for the compuation of SVD factors. Range-Net has two layers in succession, one
corresponding to the low-rank projector _V[˜]_ _[r][×][n]_ and the rotation matrix Θ[r][×][r]. For Sketchy SVD, the
peak memory load occurs during the construction of a core matrix C _[s][×][s]_ (see Alg. 1). This requires
that the two projection matrices Φ[m][×][s], Ψ[n][×][s], one projected data matrix Z _[s][×][s], and two rank-k_
decomposition Q[m][×][k], P _[n][×][k]_ and the core matrix C _[s][×][s], be present in the memory simultaneously._
The memory efficiency factor (seff ) for a rank-r approximation with k = 4r + 1, s = 2k + 1 is:
_seff =_ [SketchySVD]Range-Net = _[ns][ +][ ms][ + 2]rn +[s][2] r[ +][2][ mk][ +][ nk]_ = [(][m][ +][ n]r[)(](n[k] +[ +] r[ s])[) + 2][s][2] _≈_ [12(][m][ +](n[ n] +[) + 128] r) _[r]_
_≈_ 7.67e2 for MNIST(m = 60k, n = 784, r = 200)
10[4]
To validate the ratio, we constructed a synthetic dataset of
_m = 50k rows and the number of columns were varied starting_ 10[3]
at n = 10k with increments of 10k. The expected rank was held Memory (MB)
at r = 200. Fig. 7 shows the memory allocation (in MegaBytes 10[2] SketchySVDRange-Net
(MB)) of SketchySVD vs. Range-Net, while n varies between 0.0 2.5 5.0 7.5 10.0 12.5
SketchySVD
Range-Net
[10k 150k]. When m = 50k, n = 150k and r = 200, Feature size (n+1) * 10000
_−_ Figure 7: Peak memory load of RangeSketchySVD has a peak memory consumption of 14GB due to
Net vs. SketchySVD on synthetic data.
oversampling parameters of k = 801, s = 1603, while RangeNet only requires 916MB. Since Range-Net has an exact memory requirement of r(n + r) for a
rank-r SVD, it always occupies main lesser memory than oversampling of SketchySVD (or any other
randomized method) by two orders of magnitude, and scalable to larger datasets.
DISCUSSION
From the numerical experiments it is evident that Range-Net performs at par with conventional SVD
with GPU bit precision results, while being extremely memory efficient. Range-Net constructs the
rank-r projector V V _[T]_ iteratively to reach the tail energy lower bound given by EYM. Alternatively,
the rank-r minimizer V of Eq. 1 gives the correct rank-r right projector V V _[T]_ of X. Any arbitrary
(random) projection of X onto an oversampled rank-k subspace (k > r), rather than the tail energy
minimizing subspace V V _[T]_, can inadvertently annihilate the desired top ranking singular features
resulting in irreducible approximation errors.
This issue is specially important in exploratory data analysis for scientific computing, where one is
not only interested in the top-most singular values of the dataset but also the dominant phenomena
(singular vectors). Furthermore, with increasing digital sensor resolution the focus now is to isolate
and study the lower energy (high frequency/ small spatial scale) features as in the case of Hurricane
Sandy, where turbulence manifests as low rank features (see Figs. 19, 24 in Appendix).
We again point out that accuracy is gained by achieving (or finding) this minimizer or lower bound of
the rank-r tail-energy of X and therefore the upper bound on this tail-energy is of no consequence
to Range-Net. As pointed out by Musco & Musco (2015), upper bounding the tail energy does not
ensure that the approximation errors in the extracted singular values or vectors will reduce. Therefore,
small relative tail energy errors can mislead the reader that the singular factors are equivalently
accurate wherein the absolute errors can still be substantially large.
6 CONCLUSION
We present Range-Net as a low-weight, high-precision, fully interpretable neural SVD solver for
big data applications that is independently verifiable without performing a full SVD. We show that
our solution approach achieves lower errors metrics for the extracted singular vectors and values
compared to Randomized SVD methods. A discussion is also provided on the limiting assumptions
and practical consequences of using Randomized SVD schemes for big data applications. We also
verify that our network minimization problems converges to the EYM tail energy bound in Frobenius
norm at machine precision. A number of practical problems are considered, where SVD or Eigen
decompositions are required, that demonstrate the applicability of Range-Net to large scale datasets.
A fair comparison is provided against the state of the art randomized, streaming SVD algorithm with
conventional SVD solution as the baseline for benchmarking and verification.
-----
REFERENCES
[Gephi sample data sets. http://wiki.gephi.org/index.php/Datasets.](http://wiki.gephi.org/index.php/Datasets)
[Lev muchnik’s data sets web page. http://www.levmuchnik.net/Content/Networks/](http://www.levmuchnik.net/Content/Networks/NetworkData.html)
[NetworkData.html.](http://www.levmuchnik.net/Content/Networks/NetworkData.html)
[sklearn.utils.extmath.randomized_svd. https://scikit-learn.org/stable/modules/](https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html)
[generated/sklearn.utils.extmath.randomized_svd.html.](https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html)
James Baglama and Lothar Reichel. Augmented implicitly restarted lanczos bidiagonalization
methods. SIAM Journal on Scientific Computing, 27(1):19–42, 2005.
Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix
reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014.
Matthew Brand. Incremental singular value decomposition of uncertain data with missing values. In
_European Conference on Computer Vision, pp. 707–720. Springer, 2002._
[François Chollet. keras. https://github.com/fchollet/keras, 2015.](https://github.com/fchollet/keras)
Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In
_Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 205–214, 2009._
Julio Cesar Stacchini de Souza, Tatiana Mariano Lessa Assis, and Bikash Chandra Pal. Data
compression in smart distribution systems via singular value decomposition. IEEE Transactions
_on Smart Grid, 8(1):275–284, 2015._
Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and V Vinay. Clustering large graphs
via the singular value decomposition. Machine learning, 56(1-3):9–33, 2004.
Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii:
Computing a low-rank approximation to a matrix. SIAM Journal on computing, 36(1):158–183,
2006.
Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychome_trika, 1(3):211–218, 1936._
Gene H Golub and Richard Underwood. The block lanczos method for computing eigenvalues. In
_Mathematical software, pp. 361–377. Elsevier, 1977._
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):
217–288, 2011.
Liping Jing, Chenyang Shen, Liu Yang, Jian Yu, and Michael K Ng. Multi-label classification by
semi-supervised singular value decomposition. IEEE Transactions on Image Processing, 26(10):
4612–4625, 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
_arXiv:1412.6980, 2014._
N Kishore Kumar and Jan Schneider. Literature survey on low rank approximation of matrices.
_Linear and Multilinear Algebra, 65(11):2212–2244, 2017._
[Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http:](http://snap.stanford.edu/data)
[//snap.stanford.edu/data, June 2014.](http://snap.stanford.edu/data)
Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD
_international conference on Knowledge discovery and data mining, pp. 581–588, 2013._
Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of
_mathematics, 11(1):50–59, 1960._
Cameron Musco and Christopher Musco. Stronger approximate singular value decomposition via the
block lanczos and power methods. arXiv preprint arXiv:1504.05477, 16:27, 2015.
-----
Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.
Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation of a
positive-semidefinite matrix from streaming data. In Advances in Neural Information Processing
_Systems, pp. 1225–1234, 2017a._
Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms
for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):
1454–1485, 2017b.
Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Streaming low-rank matrix
approximation with an application to scientific simulation. SIAM Journal on Scientific Computing,
41(4):A2430–A2463, 2019.
Jalaj Upadhyay. Fast and space-optimal low-rank factorization in the streaming model with application
in differential privacy. arXiv preprint arXiv:1604.01429, 2016.
Shuqin Wang, Yongli Wang, Yongyong Chen, Peng Pan, Zhipeng Sun, and Guoping He. Robust pca
using matrix factorization for background/foreground separation. IEEE Access, 6:18945–18953,
2018.
Haishan Ye, Shusen Wang, Zhihua Zhang, and Tong Zhang. Fast generalized matrix regression with
applications in machine learning. arXiv preprint arXiv:1912.12008, 2019.
Sheng Zhang, Weihong Wang, James Ford, Fillia Makedon, and Justin Pearlman. Using singular
value decomposition approximation for collaborative filtering. In Seventh IEEE International
_Conference on E-Commerce Technology (CEC’05), pp. 257–264. IEEE, 2005._
-----
A NEED FOR RANGE-NET
To illustrate the limitations of current streaming randomized SVD approaches, we consider a synthetic
data matrix X with slow decay in the singular value. The numerical results section later presents a
number of these singular value spectra for different practical datasets (Fig. 3) to demonstrate that the
decay rates are subjective to the problem at hand.
_X = diag(450, 449, · · ·, 2, 1, 0, · · ·, 0)_
_f_ =450 _n−f_
Here X ∈ R[m][×][n] is a strictly diagonal matrix with| {z _m = n}_ | {z } = 500 with rank f = 450, where the
singular value spectrum decays linearly. Fig. 8 shows a comparison of reconstruction errors (see
Metrics in Appendix E.5 for the definition) for SketchySVD (Tropp et al., 2019) (red line), Block
Lanczos with Power Iteration (Musco & Musco, 2015) (black line), Sklearn’s randomized SVD (skr)
implementation (Halko et al., 2011) with (solid cyan line) and without power iteration (dashed blue
10[0] 10[1] 10[2] 10[3]
|en lin|ne) over 1000 runs for this synthetic dataset|
|---|---|
|103|SketchySVD Lanczos Power Iteration (iter=5) Sklearn RandSVD (iter=0) Sklearn RandSVD (iter=5) Range-Net(pass=5)|
|1||
|10 10 1 10 3 10 5 10 7||
|||
Run
Figure 8: Reconstruction errors (r = 20) for Range-Net and randomized SVD schemes (with and without
power iterations) for the non-exponentially decaying singular value spectrum over 1000 runs.
Please note that although power iteration improves the reconstruction error for both Block Lanczos
(Musco & Musco, 2015) and Sklearn’s RandSVD (Halko et al., 2011), power iteration itself requires
a persistent presence of the data matrix X in the main memory. For a practical big data scenario,
power iteration is therefore not a feasible alternative when the data matrix X or it’s sketch is itself too
big to be loaded into the main memory. Note that the error expectation (upper bound) over multiple
_runs of Randomized SVD algorithms does not reduce. We further identify the following requirements_
for SketchySVD to return SVD factors with relatively lower approximation errors:
1. Decay rate of singular values of a dataset must be exponential.
2. For a rank-f matrix, the desired rank r must be chosen such that the oversampled rank k is strictly
greater than f (k ≥ _f_ ) to achieve lower errors at scale compared to other runs.
We suggest that the reader also attempt the case where all the diagonal entries are strictly ones and
zeros under a high rank setting.
|103 (r) Values 102 r|Col2|Tru k =|th 41|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
|Col1|Tru k =|th 41|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|k =|
|---|---|---|---|---|---|---|---|
10[1]
10[0]
10[0] 10[1] 10[2]
)r 6 × 10[3]
4 × 10[3] Truth
k = 41
3 × 10[3] k = 81
Tail Energy ( k = 121
2 × 10[3] k = 161
k = 201
Rank(r)
(b) Tail Energy
10[0] 10[1] 10[2]
Truth
k = 41
k = 81
k = 121
k = 161
k = 201
Rank(r)
(a) Estimated Singular Values
10[0]
10 1 k = 41
k = 81
k = 121
Relative Error (Frob) k = 161
10 2 k = 201
10[0] 10[1] 10[2]
Rank(r)
(c) Relative Tail Energy
Figure 9: SketchySVD approximation errors for a synthetic dataset with linear decay in singular value spectrum,
corresponding to r = [10, 20, 30, 40, 50] with corresponding oversampled rank k = [41, 81, 121, 161, 201].
Since the decay is non-exponential, SketchySVD accrues large approximation errors, hence impractical for real
datasets with similar behavior.
**Fig. 9 shows the singular values extracted by SketchySVD for a linearly decaying spectrum with**
corresponding errors in absolute and relative tail energies. The reader is referred to Appendix
-----
**B.1 for the definitions of tail energy and relative tail energy. Note that the synthetic data is a**
diagonal matrix chosen specifically so that the exact tail energies can be computed using Frobenius
[1]
_k_ 2
norm as _x[2]ii_ . For a rank-r approximation, SketchySVD suggests oversampling by a
_i=r+1_
factor of k = 4P _r + 1 to extract the rank-r factors correctly. Hence for an oversampled rank_
_k = [41, 81, 121, 161, 201] the corresponding top rank r = [10, 20, 30, 40, 50] extracted singular_
values and vectors will have the lowest approximation errors. However as shown in the Fig. 9 (a) the
extracted singular values have an order of magnitude difference w.r.t. the ground truth. Consequently,
**Fig. 9 (b) shows that the absolute tail-energies of the extracted features deviate quite substantially**
from the true tail-energy. Furthermore, we also notice that the deviations remains large as long as the
oversampled rank-k is such that k < f . For a practical dataset f is either unknown or almost full
rank f ≤ min(m, n) or both and can only be detected by performing a full SVD on the dataset. This
poses a serious restriction on SketchySVD’s reliability for a realistic big data application, due to an
exponential decay assumption.
We also notice that for smaller values of r, the accrued error in both the extracted singular values and