-
Notifications
You must be signed in to change notification settings - Fork 5
/
pdf?id=B1e9csRcFm.txt
771 lines (437 loc) · 38.1 KB
/
pdf?id=B1e9csRcFm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
Under review as a conference paper at ICLR 2019
THE IMPORTANCE OF NORM REGULARIZATION IN LINEAR GRAPH EMBEDDING: THEORETICAL ANALYSIS AND EMPIRICAL DEMONSTRATION
Anonymous authors Paper under double-blind review
ABSTRACT
Learning distributed representations for nodes in graphs is a crucial primitive in network analysis with a wide spectrum of applications. Linear graph embedding methods learn such representations by optimizing the likelihood of both positive and negative edges while constraining the dimension of the embedding vectors. We argue that the generalization performance of these methods is not due to the dimensionality constraint as commonly believed, but rather the small norm of embedding vectors. Both theoretical and empirical evidence are provided to support this argument: (a) we prove that the generalization error of these methods can be bounded by limiting the norm of vectors, regardless of the embedding dimension; (b) we show that the generalization performance of linear graph embedding methods is correlated with the norm of embedding vectors, which is small due to the early stopping of SGD and the vanishing gradients. We performed extensive experiments to validate our analysis and showcased the importance of proper norm regularization in practice.
1 INTRODUCTION
Graphs have long been considered as one of the most fundamental structures that can naturally represent interactions between numerous real-life objects (e.g., the Web, social networks, proteinprotein interaction networks). Graph embedding, whose goal is to learn distributed representations for nodes while preserving the structure of the given graph, is a fundamental problem in network analysis that underpins many applications. A handful of graph embedding techniques have been proposed in recent years (Perozzi et al., 2014; Tang et al., 2015b; Grover & Leskovec, 2016), along with impressive results in applications like link prediction, text classification (Tang et al., 2015a), and gene function prediction (Wang et al., 2015).
Linear graph embedding methods preserve graph structures by converting the inner products of the node embeddings into probability distributions with a softmax function (Perozzi et al., 2014; Tang et al., 2015b; Grover & Leskovec, 2016). Since the exact softmax objective is computationally expensive to optimize, the negative sampling technique (Mikolov et al., 2013) is often used in these methods: instead of optimizing the softmax objective function, we try to maximize the probability of positive instances while minimizing the probability of some randomly sampled negative instances. It has been shown that by using this negative sampling technique, these graph embedding methods are essentially computing a factorization of the adjacency (or proximity) matrix of graph (Levy & Goldberg, 2014). Hence, it is commonly believed that the key to the generalization performance of these methods is the dimensionality constraint.
However, in this paper we argue that the key factor to the good generalization of these embedding methods is not the dimensionality constraint, but rather the small norm of embedding vectors. We provide both theoretical and empirical evidence to support this argument: · Theoretically, we analyze the generalization error of two linear graph embedding hypothesis
spaces (restricting embedding dimension/norm), and show that only the norm-restricted hypothesis class can theoretically guarantee good generalization in typical parameter settings. · Empirically, we show that the success of existing linear graph embedding methods (Perozzi et al., 2014; Tang et al., 2015b; Grover & Leskovec, 2016) are due to the early stopping of stochastic
1
Under review as a conference paper at ICLR 2019
gradient descent (SGD), which implicitly restricts the norm of embedding vectors. Furthermore, with prolonged SGD execution and no proper norm regularization, the embedding vectors can severely overfit the training data.
PAPER OUTLINE
The rest of this paper is organized as follows. In Section 2, we review the definition of graph embedding problem and the general framework of linear graph embedding. In Section 3, we present both theoretical and empirical evidence to support our argument that the generalization of embedding vectors is determined by their norm. In Section 4, we present additional experimental results for a hinge-loss linear graph embedding variant, which further support our argument. In Section 5, we discuss the new insights that we gained from previous results. Finally in Section 6, we conclude our paper. Details of the experiment settings, algorithm pseudo-codes, theorem proofs and the discussion of other related work can all be found in the appendix.
2 PRELIMINIARIES
2.1 THE GRAPH EMBEDDING PROBLEM
We consider a graph G = (V, E), where V is the set of nodes in G, and E is the set of edges between the nodes in V . For any two nodes u, v V , an edge (u, v) E if u and v are connected, and we assume all edges are unweighted and undirected for simplicity1. The task of graph embedding is to learn a D-dimensional vector representation xu for each node u V such that the structure of G can be maximally preserved. These embedding vectors can then be used as features for subsequent applications (e.g., node label classification or link prediction).
2.2 THE LINEAR GRAPH EMBEDDING FRAMEWORK
Linear graph embedding (Tang et al., 2015b; Grover & Leskovec, 2016) is one of the two major approaches for computing graph embeddings 2. These methods use the inner products of embedding
vectors to capture the likelihood of edge existence, and are appealing to practitioners due to their simplicity and good empirical performance. Formally, given a node u and its neighborhood N+(u) 3, the probability of observing node v being a neighbor of u is defined as:
p(v|u) =
exp(xuT xv) exp(xuT xk
)
.
kV
By minimizing the KL-divergence between the embedding-based distribution and the actual neighborhood distribution, the overall objective function is equivalent to:
L=-
log p(v|u)
uE vN+(u)
Unfortunately, it is quite problematic to optimize this objective function directly, as the softmax term involves normalizing over all vertices. To address this issue, the negative sampling (Mikolov et al., 2013) technique is used to avoid computing gradients over the full softmax function. Intuitively, the negative sampling technique can be viewed as randomly selecting a set of nodes N-(u) that are not connected to each node u as its negative neighbors. The embedding vectors are then learned by
1All linear graph embedding methods discussed in this paper can be generalized to weighted case by multiplying the weight to the corresponding loss function of each edge. The directed case is usually handled by associating each node with two embedding vectors for incoming and outgoing edges respectively, which is equivalent as learning embedding on a transformed undirected bipartite graph.
2The other major approach is to use deep neural network structure to compute the embedding vectors, see the discussion of other related works in the appendix for details.
3Note that N+(u) can be either the set of direct neighbors in the original graph G (Tang et al., 2015b), or an expanded neighborhood based on measures like random walk (Grover & Leskovec, 2016).
2
Under review as a conference paper at ICLR 2019
minimizing the following objective function instead:
L=-
u
log (xTu xv) -
v N+ (u)
u
v N- (u)
|N+(u)| |N-(u)|
log
(-xuT
xv ).
(1)
2.3 THE MATRIX FACTORIZATION INTERPRETATION
Although the embedding vectors learned through negative sampling do have good empirical performance, there is very few theoretical analysis of such technique that explains the good empirical performance. The most well-known analysis of negative sampling was done by Levy & Goldberg (2014), which claims that the embedding vectors are approximating a low-rank factorization of the PMI (Pointwise Mutual Information) matrix.
More specifically, the key discovery of Levy & Goldberg (2014) is that when the embedding
dimension is large enough, the optimal solution to Eqn (1) recovers exactly the PMI matrix (up to a
shifted constant, assuming the asymptotic case where N-(u) = V for all u V ):
u, v,
xTu xv
=
log(
|E| · 1(u,v)E |N+(u)||N+(v)|
)
-
log
Based on this result, Levy & Goldberg (2014) suggest that optimizing Eqn (1) under the dimensionality constraint is equivalent as computing a low-rank factorization of the shifted PMI matrix. This is currently the mainstream opinion regarding the intuition behind negative sampling. Although Levy and Goldberg only analyzed negative sampling in the context of word embedding, it is commonly believed that the same conclusion also holds for graph embedding (Qiu et al., 2018).
3 THE IMPORTANCE OF NORM REGULARIZATION
As explained in Section 2.3, it is commonly believed that linear grpah embedding methods are approximating a low-rank factorization of PMI matrices. As such, people often deem the dimensionality constraint of embedding vectors as the key factor to good generalization (Tang et al., 2015b; Grover & Leskovec, 2016). However, due to the sparsity of real-world networks, the explanation of Levy & Goldberg is actually very counter-intuitive in the graph embedding setting: the average node degree usually only ranges from 10 to 100, which is much less than the typical value of embedding dimension (usually in the range of 100 400). Essentially, this means that in the context of graph embedding, the total number of free parameters is larger than the total number of training data points, which makes it intuitively very unlikely that the negative sampling model (i.e., Eqn (1)) can inherently guarantee the generalization of embedding vectors in such scenario, and it is much more plausible if the observed good empirical performance is due to some other reason.
In this paper, we provide a different explanation to the good empirical performance of linear graph embedding methods: we argue that the good generalization of linear graph embedding vectors is due to their small norm, which is in turn caused by the vanishing gradients during the stochastic gradient descent (SGD) optimization procedure. We provide the following evidence to support this argument:
· In Section 3.1, we theoretically analyze the generalization error of two linear graph embedding variants: one has the standard dimensionality constraints, while the other restricts the vector norms. Our analysis shows that:
The embedding vectors can generalize well to unseen data if their average squared l2 norm is small, and this is always true regardless of the embedding dimension choice.
Without norm regularization, the embedding vectors can severely overfit the training data if the embedding dimension is larger than the average node degree.
· In Section 3.2, we provide empirical evidence that the generalization of linear graph embedding is determined by vector norm instead of embedding dimension. We show that:
In practice, the average norm of the embedding vectors is small due to the early stopping of SGD and the vanishing gradients.
The generalization performance of embedding vectors starts to drop when the average norm of embedding vectors gets large.
The dimensionality constraint is only helpful when the embedding dimension is very small (around 5 10) and there is no norm regularization.
3
Under review as a conference paper at ICLR 2019
3.1 GENERALIZATION ANALYSIS OF TWO LINEAR GRAPH EMBEDDING VARIANTS
In this section, we present a generalization error analysis of linear graph embedding based on
the uniform convergence framework (Bartlett & Mendelson, 2002), which bounds the maximum
difference between the training and generalization error over the entire hypothesis space. We assume
the following statistical model for graph generation: there exists an unknown probability distribution Q over the Cartesian product V × U of two vertex sets V and U . Each sample (a, b) from Q denotes an edge connecting a V and b U .The set of (positive) training edges E+ consists of the first m i.i.d. samples from the distribution Q, and the negative edge set E- consists of i.i.d. samples from the uniform distribution U over V × U . The goal is to use these samples to learn a model that generalizes well to the underlying distribution Q. We allow either V = U for homogeneous graphs or V U = for bipartite graphs.
Denote E± = {(a, b, +1) : (a, b) E+} {(a, b, -1) : (a, b) E-} to be the collection of all
training data, and we assume that data points in E± are actually sampled from a combined distribution P over V × U × {±1} that generates both positive and negative edges. Using the above notations, the training error Lt(x) and generalization error Lg(x) of embedding x : (U V ) RD are defined as
follows:
Lt(x)
=
1 |E±|
- log (yxaT xb)
(a,b,y)E±
Lg(x) = -E(a,b,y)P log (yxaT xb)
In the uniform convergence framework, we try to prove the following statement:
Pr( sup (Lg(x) - Lt(x)) ) 1 -
xH
which bounds the maximum difference between Lt(x) and Lg(x) over all possible embeddings x in the hypothesis space H. If the above uniform convergence statement is true, then minimizing the
training error Lt(x) would naturally lead to small generalization error Lg(x) with high probability.
Now we present our first technical result, which follows the above framework and bounds the generalization error of linear graph embedding methods with norm constraints:
Theorem 1. [Generalization of Linear Graph Embedding with Norm Constraints] Let E± = {(a1, b1, y1), (a2, b2, y2), . . . , (am+m , bm+m , ym+m )} be i.i.d. samples from a distribution P over V × U × {±1}. Let x : (U V ) RD to be the embedding for nodes in the graph. Then for any bounded 1-Lipschitz loss function l : R [0, B] and CU , CV > 0, with probability 1 - (over the sampling of E±), the following inequality holds
E(a,b,y)P l(yxaT xb)
1 m+m
m+m
l(yixTai xbi )
+
m
2 +
m
E ||A ||2
i=1
for all embeddings x satisfying
CU CV + 4B
2 ln(4/) m+m
xu 2 CU ,
xv 2 CV
uU
vV
where ||A||2 is the spectral norm of the randomized adjacency matrix A defined as follows:
A(i, j) =
ij y, (ui, vj, y) E± 0 y, (ui, vj, y) / E±
in which ij are i.i.d. Rademacher random variables.
The proof can be found in the appendix. Intuitively, this theorem states that the size of the training
dataset (i.e., the total number of edges |E±| = m + m ) required for learning a norm constrained graph embedding is about O(C||A||2), where C is the sum of squared norm of all embedding vectors and E||A||2 is the expected spectral norm of the randomized adjacency matrix A (defined
in the theorem). Assuming that the average squared norm of embedding vectors is O(1), then C
scales as O(|V | + |U |), and a rough estimate of E||A||2 puts it around O(
|E±
| ln(|U |+|V |U |+|V |
|)
)4
.
4This estimation is computed on a ErdosRenyi style random graph, details can be found in the appendix.
4
Under review as a conference paper at ICLR 2019
Based on these estimates, we see that the generalization error gap of norm constrained embeddings scales as O(d-0.5(ln n)0.5)5, where d is the average node degree and n is the total number of vertices.
On the other hand, if we restrict only the embedding dimension (i.e., no norm regularization on embedding vectors), and the embedding dimension is larger than the average degree of the graph, then it is possible for the embedding vectors to severely overfit the training data. The following example demonstrates this possibility on a d-regular graph, in which the embedding vectors can always achieve zero training error even when the edge labels are randomly placed:
Claim 1. Let G = (V, E) be a d-regular graph with n vertices and m = nd/2 labeled edges (with labels yi {±1}):
V = {v1, . . . , vn} E = {(a1, b1, y1), . . . , (am, bm, ym)}
Then for each of the 2m possible combination of labels, there always exists a d-dimensional embedding x : V Rd that achieves perfect classification accuracy on the randomized training dataset:
(y1, . . . , ym) {±1}m, x : V Rd s.t. i {1, . . . , m}, yixaTi xbi > 1
The proof can be found in the appendix. In other words, without norm regularization, the number of training samples required for learning D-dimensional embedding vectors is at least (nD). Considering the fact that many large-scale graphs are sparse (with average degree < 20) and the default embedding dimension commonly ranges from 100 to 400, it is highly unlikely that the the dimensionality constraint by itself could lead to good generalization performance.
3.2 EMPIRICAL EVIDENCE ON THE CAUSE OF GENERALIZATION
In this section, we present several sets of experimental results for the standard linear graph embedding, which collectively suggest that the generalization of these methods are actually determined by vector norm instead of embedding dimension.
Experiment Setting: We use stochastic gradient descent (SGD) to minimize the following objective:
L = -+1
log (xTu xv) - -1
log (-xuT xv) + r ||xv||22
(u,v)E+
(u,v)E-
vV
Here E+ is the set of edges in the training graph, and E- is the set of negative edges with both ends sampled uniformly from all vertices. The SGD learning rate is standard: t = (t + c)-1/2. Three different datasets are used in the experiments: Tweet, BlogCatalog and YouTube, and their
details can be found in the appendix.
(a) Tweet
(b) BlogCatalog
(c) YouTube
Figure 1: Average l2 Norm during SGD Optimization
SGD Optimization Results in Small Vector Norm: Figure 1 shows the average l2 norm of the embedding vectors during the first 50 SGD epochs (with varying value of r). As we can see, the average norm of embedding vectors increases consistently after each epoch, but the increase rate gets slower as time progresses. In practice, the SGD procedure is often stopped after 10 50 epochs (especially for large scale graphs with millions of vertices6), and the relatively early stopping time
would naturally result in small vector norm.
5We suspect that a more accurate estimation could remove the (ln n)0.5 factor. 6Each epoch of SGD has time complexity O(|E|D), where D is the embedding dimensionality (usually around 100). Therefore in large scale graphs, even a single epoch would require billions of floating point
operations.
5
Under review as a conference paper at ICLR 2019
(a) Tweet
(b) BlogCatalog
(c) YouTube
Figure 2: Average Norm of Stochastic Gradients during SGD Optimization
The Vanishing Gradients: Figure 2 shows the average l2 norm of the stochastic gradients L/xu during the first 50 SGD epochs:
L xu
=-
(-xTu xv)xv
v N+ (u)
+ (xuT xv)xv
v N- (u)
+ 2rxu
(2)
From the figure, we can see that the stochastic gradients become smaller during the later stage of
SGD, which is consistent with our earlier observation in Figure 1. This phenomenon can be intuitively
explained as follows: after a few SGD epochs, most of the training data points have already been well fitted by the embedding vectors, which means that most of the coefficients (±xTu xv) in Eqn (2) will be close to 0 afterwards, and as a result the stochastic gradients will be small in the following epochs.
(a) Tweet (AP)
(b) BlogCatalog (AP)
(c) YouTube (AP)
(d) BlogCatalog (F1)
Figure 3: Generalization Performance During SGD
Regularization and Early Stopping: Figure 3 shows the generalization performance of embedding
vectors during the first 50 SGD epochs, in which we depicts the resulting average precision (AP) score7 for link prediction and F1 score for node label classification. As we can see, the generalization performance of embedding vectors starts to drop after 5 20 epochs when r is small, indicating that they are overfitting the training dataset afterwards. The generalization performance is worst near
the end of SGD execution when r = 0, which coincides with the fact that embedding vectors in that case also have the largest norm among all settings. Thus, Figure 3 and Figure 1 collectively suggest
that the generalization of linear graph embedding is determined by vector norm.
Impact of Embedding Dimension Choice: Figure 4 shows the generalization AP score on Tweet
dataset with varying value of r and embedding dimension D. As we can see in Figure 4, without any norm regularization (r = 0), the embedding vectors will overfit the training dataset for any D greater than 10, which is consistent with our analysis in Claim 1. On the other hand, with larger r, the impact of embedding dimension choice is significantly less noticeable, indicating that the primary
factor for generalization is the vector norm in such scenarios.
7Average Precision (AP) evaluates the performance on ranking problems: we first compute the precision and
recall value at every position in the ranked sequence, and then view the precision p(r) as a function of recall r.
The average precision is then computed as AveP =
1 0
p(r)dr.
6
Under review as a conference paper at ICLR 2019
(a) r = 0
(b) r = 0.2
(c) r = 0.6
(d) r = 1
Figure 4: Impact of Embedding Dimension
4 DEMONSTRATING THE IMPORTANCE OF NORM REGULARIZATION VIA HINGE-LOSS LINEAR GRAPH EMBEDDING
In this section, we present the experimental results for a non-standard linear graph embedding formulation, which optimizes the following objective:
L =+1
h(xuT xv) + -1
h(-xuT
xv )
+
r 2
||xv ||22
(u,v)E+
(u,v)E-
vV
(3)
By replacing logistic loss with hinge-loss, it is now possible to apply the dual coordinate descent (DCD) method (Hsieh et al., 2008) for optimization, which circumvents the issue of vanishing gradients in SGD, allowing us to directly observe the impact of norm regularization. More specifically, consider all terms in Eqn (3) that are relevant to a particular vertex u:
L(u)
=
(xi ,yi )D
yi r
max(1
-
yixTu xi, 0) +
1 2
xu
2.
(4)
in which we defined D = {(xv, +1) : v N+(u)} {(xk, -1) : k N-(u)}. Since Eqn (4) takes the same form as a soft-margin linear SVM objective, with xu being the linear coefficients and (xi, yi) being training data, it allows us to use any SVM solver to optimize Eqn (4), and then apply it asynchronously on the graph vertices to update their embeddings. The pseudo-code for the
optimization procedure using DCD can be found in the appendix.
(a) Tweet
(b) BlogCatalog
(c) YouTube
Figure 5: Generalization Average Precision with Varying r (D = 100)
Impact of Regularization Coefficient: Figure 5 shows the generalization performance of embedding vectors obtained from DCD procedure ( 20 epochs). As we can see, the quality of embeddings vectors is very bad when r 0, indicating that proper norm regularization is necessary for generalization. The value of r also affects the gap between training and testing performance, which is consistent with our analysis that r controls the model capacity of linear graph embedding.
Impact of Embedding Dimension Choice: The choice of embedding dimension D on the other hand is not very impactful as demonstrated in Figure 6: as long as D is reasonably large ( 30), the exact choice has very little effect on the generalization performance. Even with extremely large embedding dimension setting (D = 1600). These results are consistent with our theory that the generalization of linear graph embedding is primarily determined by the norm constraints.
7
Under review as a conference paper at ICLR 2019
(a) Tweet
(b) BlogCatalog
(c) YouTube
Figure 6: Generalization Average Precision with Varying D (r = 3)
5 DISCUSSION
So far, we have seen many pieces of evidence supporting our argument, suggesting that the generalization of embedding vectors in linear graph embedding is determined by the vector norm. Intuitively, it means that these embedding methods are trying to embed the vertices onto a small sphere centered around the origin point. The radius of the sphere controls the model capacity, and choosing proper embedding dimension allows us to control the trade-off between the expressive power of the model and the computation efficiency.
Note that the connection between norm regularization and generalization performance is actually very intuitive. To see this, let us consider the semantic meaning of embedding vectors: the probability of any particular edge (u, v) being positive is equal to
Pr(y
=
1|u,
v)
=
(xuT
xv )
=
( xuT xv ||xu ||2 ||xv ||2
||xu ||2 ||xv ||2 )
As we can see, this probability value is determined by three factors:
· xuT xv/(||xu||2||xv||2), the cosine similarity between xu and xv, evaluates the degree of agreement between the directions of xu and xv.
· ||xu||2 and ||xv||2 on the other hand, reflects the degree of confidence we have regarding the embedding vectors of u and v.
Therefore, by restricting the norm of embedding vectors, we are limiting the confidence level that we have regarding the embedding vectors, which is indeed intuitively helpful for preventing overfitting.
It is worth noting that our results in this paper do not invalidate the analysis of Levy & Goldberg (2014), but rather clarifies on some key points: as pointed out by Levy & Goldberg (2014), linear graph embedding methods are indeed approximating the factorization of PMI matrices. However, as we have seen in this paper, the embedding vectors are primarily constrained by their norm instead of embedding dimension, which implies that the resulting factorization is not really a standard low-rank one, but rather a low-norm factorization:
xuT xv PMI(u, v) s.t.
||xu||22 C
u
The low-norm factorization represents an interesting alternative to the standard low-rank factorization,
for which our current understanding of such factorization is still very limited. Given the empirical
success of linear graph embedding methods, it would be really helpful if we can have a more in-depth
analysis of such factorization, to deepen our understanding and potentially inspire new algorithms.
6 CONCLUSION
We have shown that the generalization of linear graph embedding methods are not determined by the dimensionality constraint but rather the norm of embedding vectors. We proved that limiting the norm of embedding vectors would lead to good generalization, and showed that the generalization of existing linear graph embedding methods is due to the early stopping of SGD and vanishing gradients. We experimentally investigated the impact embedding dimension choice, and demonstrated that such choice only matters when there is no norm regularization. In most cases, the best generalization performance is obtained by choosing the optimal value for the norm regularization coefficient, and in such case the impact of embedding dimension case is negligible. Our findings combined with the analysis of Levy & Goldberg (2014) suggest that linear graph embedding methods are probably computing a low-norm factorization of the PMI matrix, which is an interesting alternative to the standard low-rank factorization and calls for further study.
8
Under review as a conference paper at ICLR 2019
REFERENCES
Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463482, 2002.
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, volume 14, pp. 585591, 2001.
Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pp. 855864, 2016.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 10241034, 2017.
Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In ICML, pp. 408415, 2008.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Joseph B Kruskal and Myron Wish. Multidimensional scaling, volume 11. Sage, 1978.
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 21772185, 2014.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 31113119, 2013.
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC'07), San Diego, CA, October 2007.
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, pp. 701710, 2014.
Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 459467. ACM, 2018.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization. In Advances in neural information processing systems, pp. 13291336, 2005.
Jian Tang, Meng Qu, and Qiaozhu Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In KDD, pp. 11651174, 2015a.
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW, pp. 10671077, 2015b.
Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):23192323, 2000.
Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In KDD, pp. 12251234. ACM, 2016.
Sheng Wang, Hyunghoon Cho, ChengXiang Zhai, Bonnie Berger, and Jian Peng. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics, 31(12):i357i364, 2015.
R. Zafarani and H. Liu. Social computing data repository at ASU, 2009. URL http:// socialcomputing.asu.edu.
9
Under review as a conference paper at ICLR 2019
APPENDIX
DATASETS AND EXPERIMENTAL PROTOCOLS
We use the following three datasets in our experiments:
· Tweet is an undirected graph that encodes keyword co-occurrence relationships using Twitter data: we collected 1.1 million English tweets using Twitter's Streaming API during 2014 August, and then extracted the most frequent 10,000 keywords as graph nodes and their co-occurrences as edges. All nodes with more than 2,000 neighbors are removed as stop words. There are 9,913 nodes and 681,188 edges in total.
· BlogCatalog (Zafarani & Liu, 2009) is an undirected graph that contains the social relationships between BlogCatalog users. It consists of 10,312 nodes and 333,983 undirected edges, and each node belongs to one of the 39 groups.
· YouTube (Mislove et al., 2007) is a social network among YouTube users. It includes 500,000 nodes and 3,319,221 undirected edges8.
For each positive edge in training and testing datasets, we randomly sampled 4 negative edges, which are used for learning the embedding vectors (in training dataset) and evaluating average precision (in testing dataset). In all experiments, + = 1, - = 0.03, which achieves the optimal generalization performance according to cross-validation. All initial coordinates of embedding vectors are uniformly sampled form [-0.1, 0.1].
OTHER RELATED WORKS
In the early days of graph embedding research, graphs are only used as the intermediate data model for visualization (Kruskal & Wish, 1978) or non-linear dimension reduction (Tenenbaum et al., 2000; Belkin & Niyogi, 2001). Typically, the first step is to construct an affinity graph from the features of the data points, and then the low-dimensional embedding of graph vertices are computed by finding the eigenvectors of the affinity matrix.
For more recent graph embedding techniques, apart from the linear graph embedding methods discussed in this paper, there are also methods (Wang et al., 2016; Kipf & Welling, 2016; Hamilton et al., 2017) that explore the option of using deep neural network structures to compute the embedding vectors. These methods typically try to learn a deep neural network model that takes the raw features of graph vertices to compute their low-dimensional embedding vectors: SDNE (Wang et al., 2016) uses the adjacency list of vertices as input to predict their Laplacian Eigenmaps; GCN (Kipf & Welling, 2016) aggregates the output of neighboring vertices in previous layer to serve as input to the current layer (hence the name "graph convolutional network"); GraphSage (Hamilton et al., 2017) extends GCN by allowing other forms of aggregator (i.e., in addition to the mean aggregator in GCN). Interestingly though, all these methods use only 2 or 3 neural network layers in their experiments, and there is also evidence suggesting that using higher number of layer would result in worse generalization performance (Kipf & Welling, 2016). Therefore, it still feels unclear to us whether the deep neural network structure is really helpful in the task of graph embedding.
Prior to our work, there are some existing research works suggesting that norm constrained graph embedding could generalize well. Srebro et al. (2005) studied the problem of computing norm constrained matrix factorization, and reported superior performance compared to the standard lowrank matrix factorization on several tasks. Given the connection between matrix factorization and linear graph embedding (Levy & Goldberg, 2014), the results in our paper is not really that surprising.
8Available at http://socialnetworks.mpi-sws.org/data-imc2007.html. We only used the subgraph induced by the first 500,000 nodes since our machine doesn't have sufficient memory for training the whole graph. The original graph is directed, but we treat it as undirected graph as in Tang et al. (2015b).
10
Under review as a conference paper at ICLR 2019
PROOF OF THEOREM 1
Since E± consists of i.i.d. samples from P, by the uniform convergence theorem (Bartlett & Mendelson, 2002; Shalev-Shwartz & Ben-David, 2014), with probability 1 - :
x, s.t.
xu 2 CU ,
xv 2 CV ,
uU
vV
E(a,b,y)P l(yxaT xb)
1 m+m
m+m
l(yixTai xbi ) + 2R(HCU ,CV ) + 4B
i=1
2 ln(4/) m+m
where HCU ,CV = {x : uU xu 2 CU , vV xv 2 CV } is the hypothesis set, and R(HCU ,CV ) is the empirical Rademacher Complexity of HCU ,CV , which has the following explicit
form:
1
R(HCU ,CV )
=
m+m
Ea,b{-1,1} sup
xHCU ,CV
ai,bi l(yixaTi xbi )
i
Here a,b are i.i.d. Rademacher random variables: Pr(a,b = 1) = Pr(a,b = -1) = 0.5. Since l is 1-Lipschitz, based on the Contraction Lemma (Shalev-Shwartz & Ben-David, 2014), we have:
1
R(HCU ,CV )
m+m
Ea,b{-1,1} sup
xHCU ,CV
1
= m+m
Ea,b{-1,1} sup
xHCU ,CV
ai,bi yixTai xbi
i
ai,bi xaTi xbi
i
Let us denote XU as the |U |d dimensional vector obtained by concatenating all vectors xu, and XV as the |V |d dimensional vector obtained by concatenating all vectors xv:
XU = (xu1 , xu2 , . . . , xu|U| ) XV = (xv1 , xv2 , . . . , xv|V | )
Then we have: ||XU ||2 CU ||XV ||2 CV
The next step is to rewrite the term i ai,bi xaTi xbi in matrix form:
sup
xHCU ,CV
ai,bi xTai xbi
i
=
sup
XUT [A Id]XV
||XU ||2 CU ,||XV ||2 CV
= CU A Id 2 CV
where A B represents the Kronecker product of A and B, and ||A||2 represents the spectral norm of A (i.e., the largest singular value of A).
Finally, since ||A I||2 = ||A||2, we get the desired result in Theorem 1.
PROOF SKETCH OF CLAIM 1
We provide the sketch of a constructive proof here.
Firstly, we randomly initialize all embedding vectors. Then for each v V , consider all the relevant
constraints to xv:
Cv = {(a, b, y) E : a = v or b = v}
Since G is d-regular, |Cv| d. Therefore, there always exists vector b Rd satisfying the following |Cv| constraints:
(a, b, y) Cv, yxaxb = 1 +
as long as all the referenced embedding vectors are linearly independent.
Choose any vector b in a small neighborhood of b that is not the linear combination of any other d - 1 embedding vectors (this is always possible since the viable set is a d-dimensional sphere minus a finite number of d - 1 dimensional subspaces), and set xv b .
Once we have repeated the above procedure for every node in V , it is easy to see that all the constraints yxaT xb : (a, b, y) E are now satisfied.
11
Under review as a conference paper at ICLR 2019
ROUGH ESTIMATION OF ||A||2 ON ERDOSRENYI GRAPH
By the definition of spectral norm, ||A||2 is equal to:
||A||2 = sup yT Ax
||x||2 =||y||2 =1,x,yRn
Note that,
yT Ax =
ij yixj
(i,j)E
Now let us assume that the graph G is generated from a Erdos-Renyi model (i.e., the probability of any pair u, v being directed connected is independent), then we have:
yT Ax =
ij eij yixj
ij
where eij is the boolean random variable indicating whether (i, j) E.
By Central Limit Theorem,
m ij eij yixj N (0, n2 )
ij
where m is the expected number of edges, and n is the total number of vertices. Then we have,
Pr(yT
A x
t)
O(e-
t2 n2 2m
)
for all ||x||2 = ||y||2 = 1.
Now let S be an -net of the unit sphere in n dimensional Euclidean space, which has roughly O( -n) total number of points. Consider any unit vector x, y Rn, and let xS, yS be the closest point of x, y in S, then:
yT Ax =(yS + y - yS)T A(xS + x - xS) =yST AxS + (y - yS)T AxS + yST A(x - xS) + (y - yS)T A(x - xS) yST AxS + 2 n + 2n
since ||A|| n is always true.
By union bound, the probability that at least one pair of xS, yS S satisfying yST AxS t is at
most:
Pr(xS, yS
S
:
yST AxS
t)
O(
e )-2n
-
t2 n2 2m
Let = 1/n, t = 8m ln n/n, then the above inequality becomes:
Pr(xS, yS S : yST AxS t) O(e-n ln n)
Since xS, yS S, yST AxS < t implies that
sup yT Ax < t + 2 n + 2n
||x||2 =||y||2 =1,x,yRn
Therefore, we estimate ||A||2 to be of order O( m ln n/n).
PSEUDOCODE OF DUAL COORDINATE DESCENT ALGORITHM
Algorithm 1 shows the full pseudo-code of the DCD method for optimizing the hinge-loss variant of linear graph embedding learning.
12
Under review as a conference paper at ICLR 2019
Algorithm 1 DCD Method for Hinge-Loss Linear Graph Embedding
function DCDUPDATE(u, N+(u), N-(u))
D = {(v, +1) : v N+(u)} {(v, -1) : v N-(u)}
w (v,s)D uv sxv
for (v, s) D do
5: G swT xv - 1
U s/r
min(G, 0) if uv = 0 P G max(G, 0) if uv = U
G
Otherwise
if P G = 0 then Q = xTv xv
10: ¯uv uv
uv min(max(uv - G/Q), 0, U )
w w + (uv - ¯uv)sxv
end if
end for 15: xu w
end function
function MAIN(V, E+, E-, +, -, r) Randomly initialize xv for all v V Initialize uv 0 for all (u, v) E+ E-
20: for t 1, . . . , T do for u V do N+(u) {v V : (u, v) E+} N-(u) {v V : (u, v) E-}
DCDUpdate(u, N+(u), N-(u))
25: end for
end for
end function
13