forked from SakanaAI/AI-Scientist
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path30nbp1eV0dJ.txt
1348 lines (915 loc) · 65.1 KB
/
30nbp1eV0dJ.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# TIGHT LOWER BOUNDS FOR DIFFERENTIALLY PRI## VATE ERM
**Anonymous authors**
Paper under double-blind review
ABSTRACT
We consider the lower bounds of differentially private ERM for general convex functions. For approximate-DP, the well-known upper bound of DP-ERM
_√p log(1/δ)_
is O( _ϵn_ ), which is believed to be tight. However, current lower bounds
_√p_
are off by some logarithmic terms, in particular Ω( _ϵn_ [)][ for constrained case and]
_√p_ _√p log(1/δ)_
Ω( _ϵn log p_ [)][ for unconstrained case. We achieve tight][ Ω(] _ϵn_ ) lower bounds
for both cases by introducing a novel biased mean property for fingerprinting
codes.
As for pure-DP, we utilize a novel ℓ2 loss function instead of linear functions
considered by previous papers, and achieve the first (tight) Ω( _ϵn[p]_ [)][ lower bound.]
We also introduce an auxiliary dimension to simplify the computation brought by
_ℓ2 loss._
Our results close a gap in our understanding of DP-ERM by presenting the fundamental limits. Our techniques may be of independent interest, which help enrich
the tools so that it readily applies to problems that are not (easily) reducible from
one-way marginals.
1 INTRODUCTION
Since the seminal work of Dwork et al. (2006), differential privacy (DP) has become the standard
and rigorous notion of privacy guarantee for machine learning algorithms, among which many fundamental ones are based on empirical risk minimization (ERM). Motivated by this, private ERM
becomes one of the most well-studied problem in the DP literature, e.g. Chaudhuri and Monteleoni
(2008); Rubinstein et al. (2009); Chaudhuri et al. (2011); Kifer et al. (2012); Song et al. (2013); Jain
and Thakurta (2014); Bassily et al. (2014); Talwar et al. (2015); Kasiviswanathan and Jin (2016);
Fukuchi et al. (2017); Wu et al. (2017); Zhang et al. (2017); Wang et al. (2017); Iyengar et al.
(2019); Bassily et al. (2020); Kulkarni et al. (2021); Asi et al. (2021); Bassily et al. (2021); Wang
et al. (2021).
Roughly speaking, in the ERM setting, we are given a convex function family defined on a convex
set C ⊆ R[p] and a sample set D = {d1, · · ·, dn} drawn i.i.d from some unknown distribution P with
the objective to minimize the loss function
_L(θ;_ ) = [1]
_D_ _n_
_ℓ(θ; di),_
_i=1_
X
and the value L(θ; D) _−_ minθ′∈C L(θ[′]; D) is called the excess empirical loss with respect to solution
_θ, measuring how it compares with the best solution in C._
Private ERM in the constrained case was studied first and most of the previous literature belongs
to this case. More specifically, the constrained case considers convex loss functions defined on a
bounded convex set√p _C ⊊R[p]. Assuming the functions are 1-Lipschitz over the convex set of diameter_
1, the Ω( _ϵn_ [)][ lower bound of private ERM is given by Bassily et al. (2014), even for (special and]
simpler) generalized linear model (GLM).
However, there are still several aspects that existing works don’t cover. First, existing upper bounds
are off by at least a logarithmic term log(1/δ). For example in Wang et al. (2017); Bassily et al.
-----
_L_ _θ0_ _θ[∗]_ 2[√]p log(1/δ)
(2019) they give upper bounds like O( _||_ _−_ _||ϵn_ ), which are believed to be tight. In
_√p_
Bassily et al. (2014), they present a lower bound Ω( _nϵ_ [)][ by reducing linear loss to one-way marginal]
results in Hardt and Talwar (2010); Bun et al. (2018). In Steinke and Ullman (2015) they achieve the
tight lower bound for answering one-way marginals with respect to ℓ1 norm, but it does not imply
tight lower bounds for general loss functions by the methods in Bassily et al. (2014) directly.
Another aspect is DP-ERM in the unconstrained case which was neglected before and gathered
people’s attention recently, see Jain and Thakurta (2014); Song et al. (2021). The unconstrained
case is interesting in that we can’t use linear loss any more which lies at the heart of the construction
in the constrained case. Moreover, previous algorithms in the constrained case suffer from the curse
of dimensionality when p is large. For example, when p = Ω(n[2]), the lower bound is Ω(1) and any
private algorithm can not get meaningful bounds on the excess empirical loss. Song et al. (2021)
proves an dimension-independent O( _√rankϵn_ ) upper bound in the unconstrained setting for the special
case of GLMs, where rank denotes the rank of the feature matrix of GLM. However, it’s unknown
if similar results can be achieved for general loss functions. As for the lower bound, Asi et al.
_√p_
(2021) give Ω( _nϵ log p_ [)][ lower bound by considering][ ℓ][1][ loss as the objective functions and reducing]
the results also from one-way marginals.
1.1 OUR CONTRIBUTIONS
_√p log(1/δ)_
In this paper, we fill up the two gaps together by proving an Ω(min(1, _nϵ_ )) tight lower
bound for the excess risk of unconstrained 1-Lipschitz convex loss functions for approximate differentially private algorithms. This bound is automatically applicable in the constrained case, which
improves previous results and achieve a tight lower bound for both constrained and unconstrained
case. We summarize our main results as follows:
_√p log(1/δ)_
- We prove an Ω(min(1, _nϵ_ )) tight lower bound for the excess risk of unconstrained
1-Lipschitz convex loss functions for approximate differentially private algorithm. This
bound improves Asi et al. (2021) by a log(p) log(1/δ) factor in the unconstrained case
and matches the upper bound in Kairouz et al. (2020).
p
- We also prove an Ω(min(1, _nϵ[p]_ [))][ lower bound for the excess risk of unconstrained 1-]
Lipschitz convex loss functions for any pure differential privacy algorithm.
Note that our main results for unconstrained case can be extended to constrained case directly, thus
our lower bound for approximate private algorithm is log(1/δ) multiplicative better than the well
known bound in Bassily et al. (2014) with the help of group privacy technique in Steinke and Ullman
p
(2015).
A key contribution of this paper is novel tools for private lower bound techniques. For most
problems, accuracy lower bound in the private setting is established via reduction from one-way
marginals. Hence the tools for lower bounds is quite limited. We contribute to refinement of such
tools – in particular, we propose modifications such as the additional ”biased means” property to
fingerprinting codes, which is the key lower bound technique. Such modifications help enrich the
tools so that it readily applies to problems which are not (easily) reducible from one-way marginals.
1.2 OUR TECHNIQUES
In general, the direct technical challenge of the unconstrained case lies in the choice of loss function
and the difficulties caused by the new loss function. The loss function is required to be convex and
Lipschitz-continuous at the same time. In the constrained case, the linear loss function is obviously
a good choice for constructing lower bounds, because it is easy to analyze and easily reducible from
one-way marginals. However, in the unconstrained case any non-trivial linear loss function can
take ‘−∞’ value thus not applicable anymore. We further observe that convexity plus Lipschitzcontinuity means ‘asymptotically linear’: the sub-gradient along any direction must converge. This
observation guides us in choosing new loss functions for unconstrained DP-ERM (which can be
extended to constrained case directly). We briefly introduce the new problems caused by the new
loss functions and our method to overcome them for approximate-DP and pure-DP separately.
-----
1.2.1 APPROXIMATE-DP
The construction of our lower bound for approximate-DP is based on the Fingerprinting Codes,
which was first studied by Wagner (1983) and developed by Boneh and Shaw (1998); Tardos (2008).
From a technical perspective, we change the previously used linear loss and use an ℓ1 norm function
instead where ℓ(θ; d) =√p ∥θ − _d∥1. ℓ1 loss has been used in a concurrent work Asi et al. (2021)_
which proves an Ω( _ϵn log p_ [)][ lower bound of approximate DP in the constrained case by reducing the]
results from one-way marginals, and can be extended to unconstrained case directly. We improve
this bound by logarithmic terms and achieve optimality by utilizing the group privacy technique
from Steinke and Ullman (2015). We observe a novel biased mean property in the fingerprinting
code to successfully combine and adjust these techniques to fit the ℓ1 loss.
We briefly describe the proof based on group privacy technique first. In the group privacy we need
to copy some hard instances of data-set D[k] of size nk := ⌊n/k⌋ according to the construction of
fingerprinting codes by k times, and append n − _knk data points to get a final data-set D of size n._
Fix any (ϵ, δ)-differentially private algorithm A for D, if we remove one element i[∗] from D[k], we
can get _i[∗]_ [and][ D][−][i][∗] [where][ D][−][i][∗] [and][ D][ can have at most][ k][ elements different. Running][ A][ on][ D]
_D−[k]_
and _i∗_ respectively, we get an (kϵ, δ[′])-differential privacy algorithm for and _i[∗]_ [. Setting][ k]
_D−_ _D[k]_ _D−[k]_
appropriately, if A can lead to small error on the DP-ERM, it can be an adversary which contradicts
the properties of the fingerprinting codes. Intuitively, the differential privacy means it is hard to find
the removed element i[∗], but fingerprinting codes suggest the removed element is traceable as long
as D[k] satisfies the required properties and A leads to small excess empirical loss with respect to
_L(θ; D[k])._
The direct use of the biased mean property is in appending the n − _knk points. As nearly all of_
previous lower bounds in DP convex optimization are based on the results from one-way marginals,
we try to demonstrate the proof in the language of one-way marginals. Because for linear functions,
large one-way marginal errors lead to large excess empirical loss directly, which means the lower
bound of private one-way marginals can apply to DP-ERM. But it is obvious that without additional
assumption, large one-way marginal errors can not mean large excess empirical loss of the ℓ1 loss
functions anymore. Consider the toy case when p = 1 and = _di_ _i=1_ where di 0, 1 .
_D[k]_ _{_ _}[⌊][n/k][⌋]_ _∈{_ _}_
Denote the mean of these ⌊n/k⌋ by D[k]. Similarly let D be the mean values of D constructed from
_D[k]_ by method above. For example, if D[k] = 1/2 and we only append points 1/2, then D = 1/2 and
whatever the one-way marginals are, the excess empirical loss (of the n ℓ1 loss functions) can be 0, as
_L(θ;_ ) = _n[1]_ _ni=1_
_D_ _[∥][θ][−][d][i][∥][1][ is a constant function over][ [0][,][ 1]][. So we need the mean][ D][k][ to be biased.]_
More specifically, we need 1/2 should be larger than some value depending on k, then we
P _|D[k]_ _−_ _|_
can append n _−_ _knk dummy points safely. For general p, we need the unbiased mean property holds_
for a large fraction of coordinates. For any single dimension, the biased mean property serves to
ensure the prediction of the column (dimension) is unchanged during the group privacy mechanism,
in which some number of dummy points are appended that may potentially change the prediction of
unbiased column. This novel property sets more stringent conditions. Fortunately, we observe that
the previous construction of fingerprinting codes in Bun et al. (2018) satisfies it.
1.2.2 PURE-DP
Although the square loss seems tempting in the constrained case, which intuitively reduces the
unconstrained case to constrained because the loss value grows fast outside a bounded region and
also makes computation simple. It’s unfortunately non-Lipschitz in the unconstrained case thus not
applicable directly.
We use the novel ℓ2-norm loss as a natural substitute, which is both convex and Lipschitzcontinuous. Unlike the constrained case, the ℓ2-norm loss brings the drawback that the minimizer
of the ERM problem no longer has a closed form solution or any nice property for computation.
Roughly speaking, in the analysis of Bassily et al. (2014) they have an ’adding dummy points’ procedure which will perturb the minimizer. Linear loss has this nice property that after adding these
dummy points the minimizer can only move along its direction, while for ℓ2 loss the minimizer
might be intractable. To overcome this problem, we define the Fermat point and introduce an auxiliary dimension to simplify the messy calculation brought by ℓ2-norm loss in our analysis. The
-----
dummy points have support only in this auxiliary dimension and guarantee that the perturbation
of the minimizer is still along its own direction, reducing computation in any high dimension to a
two-dimension subspace spanned by the minimizer and the auxiliary dimension.
1.3 CONSTRAINED AND UNCONSTRAINED
In this subsection, we briefly discuss the relationships and differences between constrained case and
unconstrained case, and compare our bounds with previous bounds.
Previous studies on DP-ERM mostly focus on the constrained setting, and the unconstrained case
recently attract people’s interest because Jain and Thakurta (2014); Song et al. (2021) found that an
_O(_ _√rankϵn_ ) upper bound can be achieved for minimizing the excess risk of GLMs, which evades the
curse of dimensionality.
It has been known that the unconstrained condition is necessary for dimension independence, as
_√p_
pointed out by Bassily et al. (2014) in which they prove an Ω( _nϵ_ [)][ lower bound even for minimizing]
constrained GLMs for the case when “rank ≤ _n ≪_ _p”._
We are interested in the necessity of the unconstrained condition to get rank-dependent bound. The
unconstrained GLM can be viewed as a rank-dimensional problem, as the noise added in the null
space of the feature matrix will not affect the excess empirical loss. However, this does not hold
in the constrained case. Take the dimension-independent algorithm in Song et al. (2021) which is
based on SGD as an example. The pitfall for the dimension-independent algorithm lies in projection
if SGD is modified to projected-SGD for constrained case, that running SGD in the constrained
setting requires projection which might “increase rank”. We can see there is some fundamental
difference between constrained and unconstrained case, and analyzing unconstrained case is also an
interesting and important direction.
Classic methods, like Bassily et al. (2014) usually connect linear loss to one-way marginals Bun
et al. (2018), and then use lower bounds for one-way marginals to imply lower bounds for linear
loss. As Bassily et al. (2014) are using results from Hardt and Talwar (2010); Bun et al. (2018) in
_√p log(1/δ)_
one-way marginals and Steinke and Ullman (2015) achieves tight Ω( _αϵ_ ) bound by using
the novel group privacy technique, one may ask whether combining Steinke and Ullman (2015) and
Bassily et al. (2014) can achieve the tight lower bound in the constrained case trivially. Because
Steinke and Ullman (2015) considers ℓ1 distance but Bassily et al. (2014) considers ℓ2 distance,
there is a _p gap between them and one can’t directly combine them. Though with some effort,_
_[√]_
one may get tighter bounds in the constrained case by modifying the results in Steinke and Ullman
(2015) from ℓ1 norm to ℓ2 norm, then applying the analysis in Bassily et al. (2014).
As shown in Asi et al. (2021), proving a nearly tight lower bound in the unconstrained case is direct
by utilizing one-way marginals and choosing the right objective functions, but getting rid off those
extra logarithmic terms in the unconstrained case is nontrivial as the one-way marginals can not
work directly in group privacy. To the best of our knowledge, our result is the first time that achieves
this improved tight lower bound for general loss function class in both cases. See Table 1.4 for
detailed comparisons between previous bounds and ours.
1.4 RELATED WORK
The existing lower bounds of excess empirical loss, i.e. the constrained case in Bassily et al. (2014)
and the unconstrained case in Song et al. (2021), are all using GLM functions. The objective function used in Bassily et al. (2014) is ℓ(θ; d) = ⟨θ, d⟩ which can’t be applied in the unconstrained
case, otherwise the loss value would be infinite. Considering this limitation, Song et al. (2021)
adopts ℓ(θ; d) = |⟨θ, x⟩− _y|. They transfer the problem of minimizing GLM to estimating one-way_
marginals, and then get the lower bound by properties in the definition of the Fingerprinting Codes.
As mentioned before, our lower bound are based on ℓ1 norms, thus we can not transfer to oneway marginals directly. Merely using the properties in the definition of Fingerprinting Codes is not
enough for a good lower bound. Instead, we need to make full use of the concrete structure of the
codes.
-----
|Article|Constrained?|Loss Function|Pure DP|Approximate DP|
|---|---|---|---|---|
|Bassily et al. (2014)|constrained|GLM|Ω( p ) nϵ|√p Ω( ) nϵ|
|Song et al. (2021)|unconstrained|GLM|N/A|√ Ω( rank) nϵ|
|Asi et al. (2021)|both|general|N/A|√p Ω( ) nϵ log p|
|Ours|both|general|Ω( p ) nϵ|√ p log(1/δ) Ω( ) nϵ|
Article Constrained? Loss Function Pure DP Approximate DP
_√p_
Bassily et al. (2014) constrained GLM Ω( _nϵ[p]_ [)] Ω( _nϵ_ [)]
Song et al. (2021) unconstrained GLM N/A Ω( _√ranknϵ_ )
_√p_
Asi et al. (2021) both general N/A Ω( _nϵ log p_ [)]
_√p log(1/δ)_
Ours both general Ω( _nϵ[p]_ [)] Ω( _nϵ_ )
Table 1: Comparison on lower bounds for private convex ERM. Our lower bounds can be extended
to constrained case easily. The lower bound of Song et al. (2021) is weaker than ours in the important
_p ≫_ _n setting._
As for the upper bounds, the private ERM Wang et al. (2017) and private Stochastic Convex Optimization (SCO) Feldman et al. (2020) for convex and smooth functions are extensively studied,
where the objective is to minimize the function Ed [ℓ(θ; d)] in the SCO and people only need
_∼P_
(nearly) linear gradient queries to get optimal excess loss. But for convex functions without any
smoothness assumption, the current best algorithms Kulkarni et al. (2021); Asi et al. (2021) will
need more queries (n[1][.][375] in the worst case). Besides, most of the previous works are considering problems in ℓ2 norm, and there are some recent results Bassily et al. (2021); Asi et al. (2021)
studying the general ℓp norm.
1.5 ROADMAP
In section 2 we introduce background knowledge needed in the rest of the paper. In section 3
_√p log(1/δ)_
we prove the main result of this paper, an Ω(min(1, _nϵ_ )) lower bound for approximate DP
ERM in the unconstrained case. In section 4 we discuss an Ω(min(1, _nϵ[p]_ [))][ lower bound for the excess]
risk of pure DP algorithms for minimizing any unconstrained 1-Lipschitz convex loss function.
Section 5 concludes this paper. All missing (technical) proofs can be found in the appendix.
2 PRELIMINARY
We consider minimizing the excess risk of unconstrained Lipschiz convex function with DP algorithms in this paper, where we let n denote the sample size and p be the dimension of a sample.
In this section, we will introduce main background knowledge required in the rest of the paper.
Additional background knowledge such as the definition of GLM can be found in appendix.
**Definition 2.1 (Differential privacy). A randomized mechanism M is (ϵ, δ)-differentially private if**
for any event O ∈ Range(M) and for any neighboring databases D and D[′] that differ in a single
data element, one has
Pr[M(D) ∈O] ≤ exp(ϵ) Pr[M(D[′]) ∈O] + δ.
When δ > 0, we refer to the above condition as approximate differential privacy. The special case
when δ = 0 is called pure differential privacy.
**Definition 2.2 (Empirical Risk Minimization). Given a family of convex loss functions**
_ℓ(θ, d)_ _d_ of θ over R[p] and a set of samples = _d1,_ _, dn_ over the universe,
_{_ _}_ _∈D_ _K ⊆_ _D_ _{_ _· · ·_ _}_ _D_
the objective of Empirical Risk Minimization (ERM) is to minimize
_L(θ;_ ) = [1]
_D_ _n_
_ℓ(θ; di)._
_i=1_
X
The excess empirical loss with respect to a solution θ is defined by
_L(θ; D) −_ _L(θ[∗]; D)_
where θ[∗] arg minθ _L(θ;_ ), measuring the performance of the solution θ compared with the
_∈_ _∈K_ _D_
best solution in K.
**Definition 2.3 (G-Lipschitz Continuity). A function f : R[p]** _→_ R is G-Lipschitz continuous with
respect to ℓ2 norm if the following holds for all θ, θ[′] _∈_ R[p]:
_f_ (θ) _f_ (θ[′]) _G_ _θ_ _θ[′]_ 2 (1)
_|_ _−_ _| ≤_ _∥_ _−_ _∥_
-----
The Chernoff Bound will serve to prove the fingerprinting code constructed in Bun et al. (2018)
satisfies our modified definition of fingerprinting code as well.
**Proposition 2.4 (The Chernoff Bound). Let X =** _i=1_ _[X][i][ where][ X][i][ = 1][ with probability][ p][i][ and]_
_Xi = 0 with probability 1−pi. Assume all Xi are independent random variables. Let u =_ _i=1_ _[p][i][.]_
_Then_
[P][n]
_P_ ( _X_ _u_ _δu)_ 2exp( _uδ[2]/2)._ (2)
_|_ _−_ _| ≥_ _≤_ _−_ [P][n]
3 APPROXIMATE DP
In this section, we consider the lower bound for approximate differential privacy where 2[−][O][(][n][)] _<_
_δ < o(1/n). Such assumption on δ is common in literature, for example in Steinke and Ullman_
(2015). We briefly introduce the (classic) fingerprinting codes first:
3.1 FINGERPRINTING CODES
**Definition 3.1 (Fingerprinting codes). We are given n, p ∈** N, ξ ∈ (0, 1]. A pair of (random)
algorithms (Gen, Trace) is called an (n, p)-fingerprinting code with security ξ ∈ (0, 1] if Gen
outputs a code-book C ∈{0, 1}[n][×][p] and for any (possibly randomized) adversary AF P and any
subset S ⊆ [n], if we set c ←R AF P (CS), then
- Pr[c _F_ (CS) Trace(C, c) = ] _ξ_
_∈_ _⊥_ _≤_
- Pr [Trace (C, c[V]) ∈ [n]\S] ≤ _ξ_
where F (CS) = _c ∈{0, 1}[d]_ _| ∀j ∈_ [d], ∃i ∈ _S, cj = cij_, and the probability is taken over the
coins of Gen, Trace and AF P .
There is a very good motivation behind the fingerprinting codes. For example, a software distributor
adds a fingerprint to each copy of her software to protect the IP. A coalition of malicious users
can compare their copies and find the digits that differ which belong to the fingerprint. For other
locations they can’t decide and won’t change them, which is called the marking condition. This is
the reason that we requires c _F_ (CS).
_∈_
The two properties of fingerprinting codes demonstrate that one can identify at least one malicious
user among all with high probability. Bun et al. (2018) extends the definition that the codes can tolerate a small fraction of errors in the marking condition. We further modify this definition, requiring
the codes to have biased means, see below.
3.2 OUR RESULT
We modify the definition of fingerprinting code instead for our analysis.
**Definition** **3.2** (Error Robust Biased Mean Fingerprinting Codes). Given _n, p_ _∈_
fingerprinting code with securityN, ξ, β, α1, α2, α3 ∈ (0, 1]. We say a pair of (random) algorithms ξ and (α1, α2, α3)-biased mean, robust to a (Gen β, Trace) fraction of errors if is an (n, p)Gen outputs a code-book C ∈{0, 1}[n][×][p] and for any (possibly randomized) adversary AF P and
any coalition S ⊆ [n], if we set c ←R AF P (CS), then
- Pr[c _Fβ(CS)_ Trace(C, c) = ] _ξ_
_∈_ _⊥_ _≤_
- Pr [Trace (C, c)[V] ∈ [n]\S] ≤ _ξ_
- Pr[Gα1 (C) (1 _α2)]_ _α3_
_≥_ _−_ _≤_
where Fβ (CS) = _c ∈{0, 1}[p]_ _| Prj←R[p][∃i ∈_ _S, cj = cij] ≥_ 1 − _β_, Gα(CS) = _|{j_ :
_i_ _S_ _[c][ij][/][|][S][| −]_ [1][/][2][| ≤] _[α][}|][ is the number of slightly biased columns in][ C][S][ and the probability is]_
_|_ _∈_
taken over the coins of Gen, Trace and AF P .
[P]
We use the fingerprinting code in Bun et al. (2018) for the construction of our lower bound, see
Algorithm 1 in the appendix. We utilize an ℓ1 loss and use the fingerprinting code in Bun et al.
-----
(2018) as our ’hard case’. To proceed, we first introduce a few lemmas which would be of use later.
Similar to Bun et al. (2018), we have the following standard lemma which allows us to reduce any
_ϵ < 1 to ϵ = 1 case without loss of generality, using the well-known ’secrecy of the sample’ lemma_
from Kasiviswanathan et al. (2011).
**Lemma 3.1. A condition Q has sample complexity n[∗]** _for algorithms with (1, o(1/n))-differential_
_privacy (n[∗]_ _is the smallest sample size that there exists an (1, o(1/n))-differentially private algo-_
_rithm A which satisfies Q), if and only if it also has sample complexity Θ(n[∗]/ϵ) for algorithms with_
(ϵ, o(1/n))-differential privacy.
Notice that Lemma 3.1 discusses the sample complexity of the algorithm, therefore is independent of
the (α1, α2, α3)-biased mean appeared in the above definition which only concerns the construction
of the fingerprinting code. The following lemma verifies that the fingerprinting code Algorithm 1
indeed has biased mean as in definition 3.2. The proof is straightforward by using the Chernoff
bound multiple times.
**Lemma 3.2. Algorithm 1 (the fingerprinting code) has (1/100, 999/1000, exp(−Ω(p))-biased**
_mean._
Directly combining Lemma 3.2 and Theorem 3.4 from Bun et al. (2018), we have the following
lemma, which states that for the fingerprinting code Algorithm 1 which we will use in proving our
main theorem to satisfy the error robust biased mean property in definition 3.2, one needs roughly
Ω(˜ _[√]p) samples._
**Lemma 3.3. For every p ∈** N and ξ ∈ (0, 1], there exists an (n, p)-fingerprinting code (Algorithm
_1) with security ξ and (1/100, 999/1000, exp(−Ω(p))-biased mean, robust to a 1/75 fraction of_
_error for_
_n = n(p, ξ) = Ω([˜]_
_p/ log(1/ξ))._
We are ready to prove the main result of this section by using Lemma 3.3 to reach a contradiction.
Consider the following ℓ1 norm loss function. Define
_ℓ(θ; d) = ||θ −_ _d||1, θ, d ∈_ R[p] (3)
For any data-set D = {d1, ..., dn}, we define L(θ; D) = _n[1]_ _ni=1_ _[ℓ][(][θ][;][ d][i][)][.]_
**Theorem 3.4 (Lower bound for (ϵ, δ)-differentially private algorithms). Let n, p be large enough**
P
_and 1 ≥_ _ϵ > 0, 2[−][O][(][n][)]_ _< δ < o(1/n). For every (ϵ, δ)-differentially private algorithm with output_
_θ[priv]_ R[p], there is a data-set = _d1, ..., dn_ 0, 1 [1]2 _[}][p][ such that]_
_∈_ _D_ _{_ _} ⊂{_ _}[p]_ _∪{_
_p log(1/δ)_
)GC) (4)
_nϵ_
E[L(θ[priv]; D) − _L(θ[⋆]; D)] = Ω(min(1,_
_where ℓ_ _is G-Lipschitz, θ[⋆]_ _is a minimizer of L(θ; D), and C is the diameter of the set_
_{arg minθ L(θ; D)|D ⊂{0, 1}[n][×][p]}, which contains all possible true minimizers._
Due to the space limit, we leave the proof of the Theorem 3.4 in the appendix.
The dependence on the diameter C makes sense as one can minimize a substitute loss function
_ℓ[′](x) = ℓ(ax) where a ∈_ (0, 1) is a constant instead, which decreases Lipschitz constant G but
increases the diameter C. Note also that C > 0 whenever all possible D don’t share the same
minimizer of L, which is often the case. This bound improves a log factor over Bassily et al.
(2014) by combining the the group privacy technique in Steinke and Ullman (2015) and our modified
definition of fingerprinting code.
We leave several remarks discussing slight generalizations of Theorem 3.4.
**Remark 3.5. Our lower bound can be directly extended to the constrained setting, by setting**
_the constrained domain to be [0, 1][n][×][p]_ _which contains the convex hull of all possible minimizers_
_{arg minθ L(θ; D)|D ⊂{0, 1}[n][×][p]}._
_√rank log(1/δ)_
**Remark 3.6. Similarly, we can derive an Ω(min(1,** _nϵ_ ) lower bound when we addition
_ally assume the rank of gradient subspace. The analysis remains the same except we first apply_
_orthogonal transformation then set the complement of the gradient subspace to be all 0’s in D._
-----
**Remark 3.7. The third property of definition 3.2 serves the group privacy analysis to further im-**
_prove a log(1/δ) term over Bassily et al. (2014). One can simplify the proof by setting k = 1_
_and borrow the lower bound for 1-way marginals from Bun et al. (2018), at the cost of losing this_
_log(1/δ) term. See appendix for details._
4 PURE DP
In this section, we give a lower bound for ϵ-(pure) differentially private algorithms for minimizing
unconstrained convex Lipschitz loss function L(θ; D). In the construction of lower bounds for
constrained DP-ERM (Bassily et al. (2014)), they chose linear function ℓ(θ; d) = ⟨θ, d⟩ as their
objective function which isn’t applicable in the unconstrained setting because it could decrease to
negative infinity. Instead, we use a novel ℓ2 norm loss function to over come this problem:
_ℓ(θ; d) = ||θ −_ _d||2, θ, d ∈_ R[p] (5)
For any dataset D = {d1, ..., dn}, we define L(θ; D) = _n[1]_ _ni=1_ _[ℓ][(][θ][;][ d][i][)][.][ Clearly, both][ ℓ]_ [and][ L][ are]
convex and 1-Lipschitz. The structure of the proof is similar to that in Bassily et al. (2014), while
technical details are quite different as we need to handle a non-linear objective function. DifferentP
from the simple average of points in Bassily et al. (2014), we need to consider the Fermat point
instead, which is the minimizer of the ℓ2 norm loss function.
4.1 FERMAT POINT
**Definition 4.1 (Fermat point). The set of Fermat points P** (D) of a dataset D = {d1, ..., dn} contains
_points minimizing its ℓ2 distance to all points in D:_
_x_ _di_ 2 (6)
_||_ _−_ _||_ _}_
_i=1_
X
_P_ ( ) = arg min
_D_ _{_ _x_ _R[p]_
_∈_
One obstacle of using ℓ2 norm as our loss is that Fermat points aren’t unique in the worst case.
Given a (finite) dataset D, we can easily see that P (D) is a compact subset of the convex hull of D,
which encourages us to define a unique “maximum” element in P (D). To do so, we introduce the
following well-order on R[p].
**Definition 4.2 (Coordinate dictionary order). A point x is said to be larger than y in coordinate**
_dictionary order if and only if there exists an index i ∈_ [n] such that xi > yi, and for any j < i we
_have that xj = yj._
It’s straightforward to verify that CDO (coordinate dictionary order) is a well-order. Next we use
CDO to select a unique member from the set P (D) of all Fermat points.
**Definition 4.3 (Ordered Fermat point). The Ordered Fermat point q(D) of a dataset D** =
_d1, ..., dn_ _is defined as:_
_{_ _}_
_q(_ ) = arg max (7)
_D_ _x_ _P (_ ) [CDO(][x][)]
_∈_ _D_
Such q(D) must exist for a finite dataset as long as P (D) is compact and non-empty, because there
can’t be an ordered infinite sequence with its limit outside of P (D) which contradicts compactness.
The technical proof of the following proposition is deferred to appendix.
**Proposition 4.1. q(D) always exists for a finite dataset D.**
Note that q(D) is unique by definition and is always a minimizer of L(θ; D) over R[p]. In the following subsection we are going to show that any pure DP algorithm can’t estimate q(D) with good
accuracy, then prove that a large error in estimating q(D) will lead to large error in the excess risk
of ℓ2 norm loss as well, establishing the main lower bound of this section.
4.2 LOWER BOUND
In this subsection, we prove a lower bound on the excess risk incurred by any ϵ-differentially private
algorithm whose output is denoted by θ[priv] _∈_ R[p]. We first introduce the following lemma showing
-----
that it’s impossible to find the location of the ordered Fermat point q(D) with good accuracy using
a pure DP algorithm.
The proof follows the spirit of Bassily et al. (2014), constructing datasets ’far away’ from each other
such that the events of estimating the Fermat point of each dataset accurately are mutually disjoint.
Then by differential privacy as long as one can estimate one dataset accurately, one can estimate any
other one with certain probability as well. The sum of all these probabilities is no more than 1 due
to the disjointness, which leads to the desired bound.
We denote e1 ≜ (1, 0, ..., 0)[⊤] and let ⊕ denote the direct sum of vectors, i.e. α ⊕ _β = (α, β) where_
_α ∈_ R[a], β ∈ R[b] are both vectors. For a vector α and a set S, we denote α ⊕ _S = {(α, β) : β ∈_ _S}._
**Lemma 4.4. Let n, p** _≥_ 2 and ϵ _>_ 0. _There is a number M_ = Ω(min(n, _[p]ϵ_ [))][ such]
_that for any ϵ-differentially private algorithm A, there is a dataset_ = _d1, ..., dn_
_D_ _{_ _}_ _⊂_
0 _√p1_ 1 _,_ _√p1_ 1 _e1,_ _e1, 0_ _with_ _i=1_ _[d][i][||][2][ ≤]_ _[M][ such that, with probability at]_
_⊕{_ _−_ _−_ _−_ _}[p][−][1][]_ _∪{_ _−_ _}_ _||_
_least_ 1/2 (taken over the algorithm random coins), we have
[P][n]
_A(_ ) _q(_ ) 2 = Ω(min(1, [p] (8)
_||_ _D_ _−_ _D_ _||_ _nϵ_ [))]
The classic analysis of Bassily et al. (2014) contains an ’adding dummy points’ which will perturb
the location of the minimizer. In the constrained case, such perturbation won’t change the direction
of the minimizer (seen as a vector), but in the unconstrained case non-linear loss functions no longer
enjoy such good properties. To oversome this issue, we introduce the auxiliary dimension and the
dummy points we add have support only in this dimension. The benefit of doing so is that the Fermat
point q(D) will also only change along its direction after we add dummy points, which simplifies
the computation.
Lemma 4.4 implies that it’s impossible to estimate the ordered Fermat point with good accuracy
using a pure DP algorithm. In the following theorem we are going to show that a bad estimate on
the ordered Fermat point leads to higher ℓ2 norm loss. As the fermat point is a minimizer of ℓ2 norm
loss, we can naturally translate the discrepancy in estimating q(D) to the excess risk.
**Theorem 4.5 (Lower bound for ϵ-differentially private algorithms). Let n, p ≥** 2 and ϵ > 0. For ev_ery ϵ-differentially private algorithm with output θ[priv]_ _∈_ R[p], there is a dataset D = {d1, ..., dn} ⊂
0 _√p1_ 1 _,_ _√p1_ 1 _e1,_ _e1, 0_ _such that, with probability at least 1/2 (over the algo-_
_⊕{_ _−_ _−_ _−_ _}[p][−][1][]_ _∪{_ _−_ _}_
_rithm random coins), we must have that_
_L(θ[priv];_ ) min _L(θ;_ ) = Ω(min(1, [p] (9)
_D_ _−_ _θ_ _D_ _nϵ_ [))]
The proof is based on calculation in the two-dimensional subspace spanned by q(D) and the auxiliary dimension. By observing that q(D) is perpendicular to the auxiliary dimension, we can parameterize θ[priv] by these two unit vectors and write down the expression of L(θ[priv]; D) − minθ L(θ; D)
explicitly. Then by elementary inequality scaling we get the desired result.
**Remark 4.6. In fact the lower bound in Theorem 4.5 also holds for the case p = 1. The only**
_difference is that the case p = 1 doesn’t need the auxiliary dimension because the perturbation_
_of the minimizer is always along its direction. We can simply use dummy points {1, −1, 0} and a_
_similar analysis to Bassily et al. (2014) to achieve this result._
5 CONCLUSION
In this paper, we study differentially private convex ERM in the unconstrained case and give the first
tight lower bounds for approximate-DP ERM for general loss functions. Our results also directly
imply a same lower bound for the constrained case, improving the classic lower bound in Bassily
et al. (2014) by log(1/δ). We also give an Ω( _nϵ[p]_ [)][ lower bound for unconstrained pure-DP ERM]
which recovers the result in the constrained case. Our techniques enrich the quite limited tools in
constructing lower bounds in the private setting and we hope they can find future use, especially
for those problems which are not (easily) reducible from one-way marginals. Designing better
algorithms for general (un)constrained DP-ERM based on our insights would also be an interesting
and meaningful direction, which we leave as future work.
-----
REFERENCES
Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization:
Optimal rates in ℓ1 geometry. arXiv preprint arXiv:2103.01516, 2021.
Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient
algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of
_Computer Science, pages 464–473. IEEE, 2014._
Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic
convex optimization with optimal rates. In Advances in Neural Information Processing Systems,
pages 11282–11291, 2019.
Raef Bassily, Vitaly Feldman, Crist´obal Guzm´an, and Kunal Talwar. Stability of stochastic gradient
descent on nonsmooth convex losses. arXiv preprint arXiv:2006.06914, 2020.
Raef Bassily, Crist´obal Guzm´an, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. arXiv preprint arXiv:2103.01278, 2021.
Dan Boneh and James Shaw. Collusion-secure fingerprinting for digital data. IEEE Transactions on
_Information Theory, 44(5):1897–1905, 1998._
Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate
differential privacy. SIAM Journal on Computing, 47(5):1888–1938, 2018.
Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In NIPS, volume 8, pages 289–296. Citeseer, 2008.
Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical
risk minimization. Journal of Machine Learning Research, 12(3), 2011.
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity
in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations
_and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014._
Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal
rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of
_Computing, pages 439–449, 2020._
Kazuto Fukuchi, Quang Khai Tran, and Jun Sakuma. Differentially private empirical risk minimization with input perturbation. In International Conference on Discovery Science, pages 82–90.
Springer, 2017.
Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the
_forty-second ACM symposium on Theory of computing, pages 705–714, 2010._
Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security
_and Privacy (SP), pages 299–316. IEEE, 2019._
Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pages 476–484. PMLR,
2014.
Peter Kairouz, M´onica Ribero, Keith Rush, and Abhradeep Thakurta. Dimension independence in
unconstrained private erm via adaptive preconditioning. arXiv preprint arXiv:2008.06570, 2020.
Shiva Prasad Kasiviswanathan and Hongxia Jin. Efficient private empirical risk minimization for
high-dimensional learning. In International Conference on Machine Learning, pages 488–497.
PMLR, 2016.
Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam
Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
-----
Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization
and high-dimensional regression. In Conference on Learning Theory, pages 25–1. JMLR Workshop and Conference Proceedings, 2012.
Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization
and stochastic convex optimization in subquadratic steps. arXiv preprint arXiv:2103.15352, 2021.
Benjamin IP Rubinstein, Peter L Bartlett, Ling Huang, and Nina Taft. Learning in a large function
space: Privacy-preserving mechanisms for svm learning. arXiv preprint arXiv:0911.5708, 2009.
Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing,
pages 245–248. IEEE, 2013.
Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and
_Statistics, pages 2638–2646. PMLR, 2021._
Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. arXiv
_preprint arXiv:1501.06095, 2015._
Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private lasso. In Proceedings
_of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages_
3025–3033, 2015.
G´abor Tardos. Optimal probabilistic fingerprint codes. Journal of the ACM (JACM), 55(2):1–24,
2008.
Neal R Wagner. Fingerprinting. In 1983 IEEE Symposium on Security and Privacy, pages 18–18.
IEEE, 1983.
Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited:
Faster and more general. In Advances in Neural Information Processing Systems, pages 2722–
2731, 2017.
Puyu Wang, Yunwen Lei, Yiming Ying, and Hai Zhang. Differentially private sgd with non-smooth
loss. arXiv preprint arXiv:2101.08925, 2021.
Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on
differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the
_2017 ACM International Conference on Management of Data, pages 1307–1322, 2017._
Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. arXiv preprint arXiv:1703.09947, 2017.
-----
A ADDITIONAL BACKGROUND KNOWLEDGE
A.1 GENERALIZED LINEAR MODEL (GLM)
The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that
allows for response variables that have error distribution models other than a normal distribution. To
be specific,
**Definition A.1 (Generalized linear model (GLM)). The generalized linear model (GLM) is a special**
_class of ERM problems where the loss function ℓ(θ, d) takes the following inner-product form:_
_ℓ(θ; d) = ℓ(⟨θ, x⟩; y)_ (10)
_for d = (x, y). Here, x ∈_ R[p] _is usually called the feature vector and y ∈_ R is called the response.
A.2 PROPERTIES OF DIFFERENTIAL PRIVACY
In this subsection we introduce several very basic properties of differential privacy without proving
them (refer Dwork et al. (2014) for details). Readers familiar with the field of differential privacy
can feel free to skip this section.
**Proposition A.1 (Group privacy). If M : X** _[n]_ _→_ _Y is (ϵ, δ)-differentially private mechanism, then_
_for all pairs of datasets x, x[′]_ _∈_ _X_ _[n], then M(x), M(x[′]) are (kϵ, kδe[kϵ])-indistinguishable when_
_x, x[′]_ _differs on exact k locations._
**Proposition A.2 (Post processing). If M : X** _[n]_ _→_ _Y is (ϵ, δ)-differentially private and A : Y →_ _Z_
_is any randomized function, then A ◦M : X_ _[n]_ _→_ _Z is also (ϵ, δ)-differentially private._
**Proposition A.3 (Composition). Let Mi be an (ϵi, δi)-differentially private mechanism for all i ∈**
[k]. If M[k] is defined to be
[k](x) = ( 1(x), ..., _k(x))_ (11)
_M_ _M_ _M_
_then M[k] is ([P][k]i=1_ _[ϵ][i][,][ P]i[k]=1_ _[δ][i][)][-differentially private.]_
B FINGERPRINTING CODE
In this section we briefly introduce the mechanism of the fingerprinting code Algorithm 1. The
sub-procedure part is the original fingerprinting code in Tardos (2008), with a pair of randomized
algorithms (Gen, Trace). The code generator Gen outputs a codebook C ∈{0, 1}[n][×][p]. The ith row
of C is the codeword of user i. The parameter p is called the length of the fingerprinting code.
The security property of fingerprinting codes asserts that any codeword can be “traced” to a user i.
Moreover, we require that the fingerprinting code can find one of the malicious users even when they
get together and combine their codewords in any way that respects the marking condition. That is,
there is a tracing algorithm Trace that takes as inputs the codebook C and the combined codeword
_c[′]_ and outputs one of the malicious users with high probability.
The sub-procedure Gen[′] first uses a sin[2] _x like distribution to generate a parameter pj (the mean)_
for each column j independently, then generates C randomly by setting each element to be 1 with
probability pj according to its location. The sub-procedure Trace[′] computes a threshold value Z
and a ’score function’ Si(c[′]) for each user i, then report i when its score is higher than the threshold.
The main-procedure was introduced in Bun et al. (2018), where Gen adds dummy columns to the
original fingerprinting code and applies a random permutation. Trace can first ’undo’ the permutation and remove the dummy columns, then use Trace[′] as a black box. This procedure makes the
fingerprinting code more robust in that it tolerates a small fraction of errors to the marking condition.
C OMITTED PROOFS
C.1 PROOF OF LEMMA 3.1
_Proof. The proof uses a black-box reduction, therefore doesn’t depend on Q._ The direction
that O(n[∗]/ϵ) samples are sufficient is equal to proving the assertion that given a (1, o(1/n))
-----
**Algorithm 1 The Fingerprinting Code (Gen, Trace)**
1: Sub-procedure Gen[′]:
2: Let d = 100n[2] log(n/ξ) be the length of the code.
3: Let t = 1/300n be a parameter and let t[′] be such that sin[2]t[′] = t.
4: for j = 1, ..., d: do
5: Choose random r uniformly from [t[′], π/2 − _t[′]] and let pj = sin[2]rj. Note that pj ∈_ [t, 1 − _t]._
6: For each i = 1, ..., n, set Cij = 1 with probability pj independently.
7: end for
8: return C
9: Sub-procedure Trace[′](C, c[′]):
10: Let Z = 20n log(n/ξ) be a parameter.
11: For each j = 1, ..., d, let qj = (1 − _pj)/pj._
12: For each j = 1, ..., d, and eachp i = 1, ..., n, let Uij = qj if Cij = 1 and Uij = −1/qj else wise.
13: for each i = 1, ..., n: do
14: Let Si(c[′]) = _j=1_ _[c]j[′]_ _[U][ij]_
15: Output i if Si(c[′]) _Z/2._
_≥_
16: Output if S[P]i(c[d][′]) < Z/2 for every i = 1, ..., n.
_⊥_
17: end for
18: Main-procedure Gen:
19: Let C be the (random) output of Gen[′], C ∈{0, 1}[n][×][d]
20: Append 2d 0-marked columns and 2d 1-marked columns to C.
21: Apply a random permutation π to the columns of the augmented codebook.
22: Let the new codebook be C _[′]_ _∈{0, 1}[n][×][5][d]._
23: return C _[′]_
24: Main-procedure Trace(C, c[′]):
25: Obtain C _[′]_ from the shared state with Gen.
26: Obtain C by applying π[−][1] to the columns of C _[′]_ and removing the dummy columns.
27: Obtain c by applying π[−][1] to c[′] and removing the symbols corresponding to fake columns.
28: return i randomly from Trace[′](C, c).
differentially private algorithm A, we can get a new algorithm A[′] with (ϵ, o(1/n))-differential privacy at the cost of shrinking the size of the dataset by a factor of ϵ.
Given input ϵ and a dataset X, we construct A[′] to first generate a new dataset T by selecting each
element of X with probability ϵ independently, then feed T to A. Fix an event S and two neighboring
datasets X1, X2 that differs by a single element i. Consider running A on X1. If i is not included
in the sample T, then the output is distributed the same as a run on X2. On the other hand, if i is
included in the sample T, then the behavior of A on T is only a factor of e off from the behavior
of A on T \ {i}. Again, because of independence, the distribution of T \ {i} is the same as the
distribution of T conditioned on the omission of i.
For a set X, let pX denote the distribution of A(X), we have that for any event S,
_pX1_ (S) = (1 _ϵ)pX1_ (S _i /_ _T_ ) + ϵpX1 (S _i_ _T_ )
_−_ _|_ _∈_ _|_ _∈_
(1 _ϵ)pX2_ (S) + ϵ(e _pX2_ (S) + δ)
_≤_ _−_ _·_
exp(2ϵ)pX2 (S) + ϵδ
_≤_
A lower bound of pX1 (S) exp( _ϵ)pX2_ (S) _ϵδ/e can be obtained similarly. To conclude,_
_≥_ _−_ _−_
since ϵδ = o(1/n) as the sample size n decreases by a factor of ϵ, A[′] has (2ϵ, o(1/n))-differential
privacy. The size of X is roughly 1/ϵ times larger than T, combined with the fact that A has sample
complexity n[∗] and T is fed to A, A[′] has sample complexity at least Θ(n[∗]/ϵ).
For the other direction, simply using the composability of differential privacy yields the desired
result. In particular, by the k-fold adaptive composition theorem in Dwork et al. (2006), we can
combine 1/ϵ independent copies of (ϵ, δ)-differentially private algorithms to get an (1, δ/ϵ) one and
notice that if δ = o(1/n), then δ/ϵ = o(1/n) as well because the sample size n is scaled by a factor
of ϵ at the same time, offsetting the increase in δ.
-----
C.2 PROOF OF LEMMA 3.2
_Proof. In line 5 of algorithm 1, every column j is assigned a probability pj independently where_
1
_Pr[_ _pj_ (12)
_|_ _−_ [1]2 _[|][ <][ 0][.][002]][ <]_ 400
by straightforward calculation. By the Chernoff bound (with u < p/400, δ = 1), with probability at
least
, at least 1 − 2001 [fraction of the columns have]1 − 2 exp([ |][p]−[j][ −]p/800)2[1] _[| ≥]_ [0][.][002][. Denote][ m][j][ to be the mean of](13)
entries of column j, then by using the Chernoff bound again (with δ = 0.001), we have that with
probability at least
1 − 2 exp(−n/8000000) (14)
a column2 exp(−n/ j8000000) actually satisfiesp and uδ = 0 |mj −.01p[1]2) together with the union bound, at least 0.99 fraction of[| ≥] [0][.][001][.] Again by the Chernoff bound (with u ≤
all columns have |mj − 2[1] _[| ≥]_ [0][.][001][ with probability at least]
1 − 2 exp(−p/800) − 2 exp(−pe[n/][8000000]/40000) = 1 − _O(e[−][Ω(][p][)])_ (15)
C.3 PROOF OF THEOREM 3.4
_Proof. Let (α1, α2, α3) = (1/100, 999/1000, exp(_ Ω(p)) be the parameters in the statement of
_−_
Lemma 3.3. Let k = Θ(log(1/δ)) be a parameter to be determined later and nk = ⌊n/k⌋.
Consider the case when p ≥ _pnk first, where pnk = O(ϵ[2]n[2]k_ [log(1][/δ][))][. Without loss of generality,]
we assume ϵ = 1 first, and pnk = O(n[2]k [log(1][/δ][))][ corresponds to the number in Lemma 3.3 where]
we set ξ = δ. We will use contradiction to prove that for any (ϵ, δ)-differentially private mechanism
, there exists some 0, 1 with Gα1 1/k( ) 1 _α2 such that_
_M_ _D ∈{_ _}[n][×][p]_ _−_ _D_ _≤_ _−_
E[L(M(D); D) − _L(θ[⋆]; D)] ≥_ Ω(p) (16)
Assume for contradiction that M : {0, 1}[n][×][p] _→_ [0, 1][n][×][p] is a (randomized) (ϵ, δ)-differentially
private mechanism such that
E[L( ( ); ) _L(θ[⋆];_ )] < [α][1][α][2][p]
_M_ _D_ _D_ _−_ _D_ 1000
for all D ∈{0, 1}[n][×][p] with Gα1−1/k(D) ≤ (1 − _α2). We then construct a mechanism Mk =_
_{0, 1}[n][k][×][p]_ with respect to M as follows: with input D[k] _∈{0, 1}[n][k][×][p], Mk will copy D[k]_ for k
times and append enough 0’s to get a dataset 0, 1 . The output is _k(_ ) = ( ).
_D ∈{_ _}[n][×][p]_ _M_ _D[k]_ _M_ _D_
_Mk is (k,_ _[e]e[k][−][−]1[1]_ _[δ][)][-differentially private by the group privacy. According to the construction above,]_
we know that if Gα1 ( ) < 1 _α2, then Gα1_ 1/k( ) < 1 _α2 as well._
_D[k]_ _−_ _−_ _D_ _−_
We consider algorithm AF P to be the adversarial algorithm in the fingerprinting codes, which rounds
the the output _k(_ ) to the binary vector, i.e. rounding those coordinates with values no less than
_M_ _D[k]_
1/2 to 1 and the remaining 0, and let c = AF P (M(D)) be the vector after rounding. As Mk is
(k, _[e]e[k][−]1[1]_ _[δ][)][-differentially private,][ A][F P][ is also][ (][k,][ e]e[k][−]1[1]_ _[δ][)][-differentially private.]_
_−_ _−_
If for some 0, 1 with Gα1 ( ) 1 _α2,_ (constructed from as above) further
_D[k]_ _∈{_ _}[n][k][×][p]_ _D[k]_ _≤_ _−_ _D_ _D[k]_
satisfies
E[L( ( ); ) _L(θ[⋆];_ )] < [α][1][α][2][p]
_M_ _D_ _D_ _−_ _D_ 1000 _[.]_
As we are considering the ℓ1 loss, we can consider the loss caused by each coordinate independently. Recall that _k(_ ) = ( ). The fraction of those nearly unbiased columns (the mean
_M_ _D[k]_ _M_ _D_
is close to 1/2) is at most 1 _α2, and we treat the worst-case error for them. For other ’α1-biased’_
_−_
columns (coordinates) which take at least α2 fraction of all, if M(D) is right for the prediction, then