review_iclr_bench/iclr_parsed/30nbp1eV0dJ.txt

# TIGHT LOWER BOUNDS FOR DIFFERENTIALLY PRI## VATE ERM

**Anonymous authors**
Paper under double-blind review

ABSTRACT

We consider the lower bounds of differentially private ERM for general convex functions. For approximate-DP, the well-known upper bound of DP-ERM

_√p log(1/δ)_
is O( _ϵn_ ), which is believed to be tight. However, current lower bounds

_√p_
are off by some logarithmic terms, in particular Ω( _ϵn_ [)][ for constrained case and]

_√p_ _√p log(1/δ)_
Ω( _ϵn log p_ [)][ for unconstrained case. We achieve tight][ Ω(] _ϵn_ ) lower bounds

for both cases by introducing a novel biased mean property for fingerprinting
codes.
As for pure-DP, we utilize a novel ℓ2 loss function instead of linear functions
considered by previous papers, and achieve the first (tight) Ω( _ϵn[p]_ [)][ lower bound.]

We also introduce an auxiliary dimension to simplify the computation brought by
_ℓ2 loss._
Our results close a gap in our understanding of DP-ERM by presenting the fundamental limits. Our techniques may be of independent interest, which help enrich
the tools so that it readily applies to problems that are not (easily) reducible from
one-way marginals.


1 INTRODUCTION

Since the seminal work of Dwork et al. (2006), differential privacy (DP) has become the standard
and rigorous notion of privacy guarantee for machine learning algorithms, among which many fundamental ones are based on empirical risk minimization (ERM). Motivated by this, private ERM
becomes one of the most well-studied problem in the DP literature, e.g. Chaudhuri and Monteleoni
(2008); Rubinstein et al. (2009); Chaudhuri et al. (2011); Kifer et al. (2012); Song et al. (2013); Jain
and Thakurta (2014); Bassily et al. (2014); Talwar et al. (2015); Kasiviswanathan and Jin (2016);
Fukuchi et al. (2017); Wu et al. (2017); Zhang et al. (2017); Wang et al. (2017); Iyengar et al.
(2019); Bassily et al. (2020); Kulkarni et al. (2021); Asi et al. (2021); Bassily et al. (2021); Wang
et al. (2021).

Roughly speaking, in the ERM setting, we are given a convex function family defined on a convex
set C ⊆ R[p] and a sample set D = {d1, · · ·, dn} drawn i.i.d from some unknown distribution P with
the objective to minimize the loss function


_L(θ;_ ) = [1]
_D_ _n_


_ℓ(θ; di),_
_i=1_

X


and the value L(θ; D) _−_ minθ′∈C L(θ[′]; D) is called the excess empirical loss with respect to solution
_θ, measuring how it compares with the best solution in C._

Private ERM in the constrained case was studied first and most of the previous literature belongs
to this case. More specifically, the constrained case considers convex loss functions defined on a
bounded convex set√p _C ⊊R[p]. Assuming the functions are 1-Lipschitz over the convex set of diameter_
1, the Ω( _ϵn_ [)][ lower bound of private ERM is given by Bassily et al. (2014), even for (special and]

simpler) generalized linear model (GLM).

However, there are still several aspects that existing works don’t cover. First, existing upper bounds
are off by at least a logarithmic term log(1/δ). For example in Wang et al. (2017); Bassily et al.


-----

_L_ _θ0_ _θ[∗]_ 2[√]p log(1/δ)
(2019) they give upper bounds like O( _||_ _−_ _||ϵn_ ), which are believed to be tight. In

_√p_
Bassily et al. (2014), they present a lower bound Ω( _nϵ_ [)][ by reducing linear loss to one-way marginal]

results in Hardt and Talwar (2010); Bun et al. (2018). In Steinke and Ullman (2015) they achieve the
tight lower bound for answering one-way marginals with respect to ℓ1 norm, but it does not imply
tight lower bounds for general loss functions by the methods in Bassily et al. (2014) directly.

Another aspect is DP-ERM in the unconstrained case which was neglected before and gathered
people’s attention recently, see Jain and Thakurta (2014); Song et al. (2021). The unconstrained
case is interesting in that we can’t use linear loss any more which lies at the heart of the construction
in the constrained case. Moreover, previous algorithms in the constrained case suffer from the curse
of dimensionality when p is large. For example, when p = Ω(n[2]), the lower bound is Ω(1) and any
private algorithm can not get meaningful bounds on the excess empirical loss. Song et al. (2021)
proves an dimension-independent O( _√rankϵn_ ) upper bound in the unconstrained setting for the special

case of GLMs, where rank denotes the rank of the feature matrix of GLM. However, it’s unknown
if similar results can be achieved for general loss functions. As for the lower bound, Asi et al.
_√p_
(2021) give Ω( _nϵ log p_ [)][ lower bound by considering][ ℓ][1][ loss as the objective functions and reducing]

the results also from one-way marginals.

1.1 OUR CONTRIBUTIONS


_√p log(1/δ)_
In this paper, we fill up the two gaps together by proving an Ω(min(1, _nϵ_ )) tight lower

bound for the excess risk of unconstrained 1-Lipschitz convex loss functions for approximate differentially private algorithms. This bound is automatically applicable in the constrained case, which
improves previous results and achieve a tight lower bound for both constrained and unconstrained
case. We summarize our main results as follows:

_√p log(1/δ)_

-  We prove an Ω(min(1, _nϵ_ )) tight lower bound for the excess risk of unconstrained

1-Lipschitz convex loss functions for approximate differentially private algorithm. This
bound improves Asi et al. (2021) by a log(p) log(1/δ) factor in the unconstrained case

and matches the upper bound in Kairouz et al. (2020).

p

-  We also prove an Ω(min(1, _nϵ[p]_ [))][ lower bound for the excess risk of unconstrained 1-]

Lipschitz convex loss functions for any pure differential privacy algorithm.


Note that our main results for unconstrained case can be extended to constrained case directly, thus
our lower bound for approximate private algorithm is log(1/δ) multiplicative better than the well
known bound in Bassily et al. (2014) with the help of group privacy technique in Steinke and Ullman

p

(2015).

A key contribution of this paper is novel tools for private lower bound techniques. For most
problems, accuracy lower bound in the private setting is established via reduction from one-way
marginals. Hence the tools for lower bounds is quite limited. We contribute to refinement of such
tools – in particular, we propose modifications such as the additional ”biased means” property to
fingerprinting codes, which is the key lower bound technique. Such modifications help enrich the
tools so that it readily applies to problems which are not (easily) reducible from one-way marginals.

1.2 OUR TECHNIQUES

In general, the direct technical challenge of the unconstrained case lies in the choice of loss function
and the difficulties caused by the new loss function. The loss function is required to be convex and
Lipschitz-continuous at the same time. In the constrained case, the linear loss function is obviously
a good choice for constructing lower bounds, because it is easy to analyze and easily reducible from
one-way marginals. However, in the unconstrained case any non-trivial linear loss function can
take ‘−∞’ value thus not applicable anymore. We further observe that convexity plus Lipschitzcontinuity means ‘asymptotically linear’: the sub-gradient along any direction must converge. This
observation guides us in choosing new loss functions for unconstrained DP-ERM (which can be
extended to constrained case directly). We briefly introduce the new problems caused by the new
loss functions and our method to overcome them for approximate-DP and pure-DP separately.


-----

1.2.1 APPROXIMATE-DP

The construction of our lower bound for approximate-DP is based on the Fingerprinting Codes,
which was first studied by Wagner (1983) and developed by Boneh and Shaw (1998); Tardos (2008).

From a technical perspective, we change the previously used linear loss and use an ℓ1 norm function
instead where ℓ(θ; d) =√p ∥θ − _d∥1. ℓ1 loss has been used in a concurrent work Asi et al. (2021)_
which proves an Ω( _ϵn log p_ [)][ lower bound of approximate DP in the constrained case by reducing the]

results from one-way marginals, and can be extended to unconstrained case directly. We improve
this bound by logarithmic terms and achieve optimality by utilizing the group privacy technique
from Steinke and Ullman (2015). We observe a novel biased mean property in the fingerprinting
code to successfully combine and adjust these techniques to fit the ℓ1 loss.

We briefly describe the proof based on group privacy technique first. In the group privacy we need
to copy some hard instances of data-set D[k] of size nk := ⌊n/k⌋ according to the construction of
fingerprinting codes by k times, and append n − _knk data points to get a final data-set D of size n._
Fix any (ϵ, δ)-differentially private algorithm A for D, if we remove one element i[∗] from D[k], we
can get _i[∗]_ [and][ D][−][i][∗] [where][ D][−][i][∗] [and][ D][ can have at most][ k][ elements different. Running][ A][ on][ D]
_D−[k]_
and _i∗_ respectively, we get an (kϵ, δ[′])-differential privacy algorithm for and _i[∗]_ [. Setting][ k]
_D−_ _D[k]_ _D−[k]_
appropriately, if A can lead to small error on the DP-ERM, it can be an adversary which contradicts
the properties of the fingerprinting codes. Intuitively, the differential privacy means it is hard to find
the removed element i[∗], but fingerprinting codes suggest the removed element is traceable as long
as D[k] satisfies the required properties and A leads to small excess empirical loss with respect to
_L(θ; D[k])._

The direct use of the biased mean property is in appending the n − _knk points. As nearly all of_
previous lower bounds in DP convex optimization are based on the results from one-way marginals,
we try to demonstrate the proof in the language of one-way marginals. Because for linear functions,
large one-way marginal errors lead to large excess empirical loss directly, which means the lower
bound of private one-way marginals can apply to DP-ERM. But it is obvious that without additional
assumption, large one-way marginal errors can not mean large excess empirical loss of the ℓ1 loss
functions anymore. Consider the toy case when p = 1 and = _di_ _i=1_ where di 0, 1 .
_D[k]_ _{_ _}[⌊][n/k][⌋]_ _∈{_ _}_
Denote the mean of these ⌊n/k⌋ by D[k]. Similarly let D be the mean values of D constructed from
_D[k]_ by method above. For example, if D[k] = 1/2 and we only append points 1/2, then D = 1/2 and
whatever the one-way marginals are, the excess empirical loss (of the n ℓ1 loss functions) can be 0, as
_L(θ;_ ) = _n[1]_ _ni=1_
_D_ _[∥][θ][−][d][i][∥][1][ is a constant function over][ [0][,][ 1]][. So we need the mean][ D][k][ to be biased.]_

More specifically, we need 1/2 should be larger than some value depending on k, then we
P _|D[k]_ _−_ _|_
can append n _−_ _knk dummy points safely. For general p, we need the unbiased mean property holds_
for a large fraction of coordinates. For any single dimension, the biased mean property serves to
ensure the prediction of the column (dimension) is unchanged during the group privacy mechanism,
in which some number of dummy points are appended that may potentially change the prediction of
unbiased column. This novel property sets more stringent conditions. Fortunately, we observe that
the previous construction of fingerprinting codes in Bun et al. (2018) satisfies it.

1.2.2 PURE-DP

Although the square loss seems tempting in the constrained case, which intuitively reduces the
unconstrained case to constrained because the loss value grows fast outside a bounded region and
also makes computation simple. It’s unfortunately non-Lipschitz in the unconstrained case thus not
applicable directly.

We use the novel ℓ2-norm loss as a natural substitute, which is both convex and Lipschitzcontinuous. Unlike the constrained case, the ℓ2-norm loss brings the drawback that the minimizer
of the ERM problem no longer has a closed form solution or any nice property for computation.
Roughly speaking, in the analysis of Bassily et al. (2014) they have an ’adding dummy points’ procedure which will perturb the minimizer. Linear loss has this nice property that after adding these
dummy points the minimizer can only move along its direction, while for ℓ2 loss the minimizer
might be intractable. To overcome this problem, we define the Fermat point and introduce an auxiliary dimension to simplify the messy calculation brought by ℓ2-norm loss in our analysis. The


-----

dummy points have support only in this auxiliary dimension and guarantee that the perturbation
of the minimizer is still along its own direction, reducing computation in any high dimension to a
two-dimension subspace spanned by the minimizer and the auxiliary dimension.

1.3 CONSTRAINED AND UNCONSTRAINED

In this subsection, we briefly discuss the relationships and differences between constrained case and
unconstrained case, and compare our bounds with previous bounds.

Previous studies on DP-ERM mostly focus on the constrained setting, and the unconstrained case
recently attract people’s interest because Jain and Thakurta (2014); Song et al. (2021) found that an
_O(_ _√rankϵn_ ) upper bound can be achieved for minimizing the excess risk of GLMs, which evades the

curse of dimensionality.

It has been known that the unconstrained condition is necessary for dimension independence, as
_√p_
pointed out by Bassily et al. (2014) in which they prove an Ω( _nϵ_ [)][ lower bound even for minimizing]

constrained GLMs for the case when “rank ≤ _n ≪_ _p”._

We are interested in the necessity of the unconstrained condition to get rank-dependent bound. The
unconstrained GLM can be viewed as a rank-dimensional problem, as the noise added in the null
space of the feature matrix will not affect the excess empirical loss. However, this does not hold
in the constrained case. Take the dimension-independent algorithm in Song et al. (2021) which is
based on SGD as an example. The pitfall for the dimension-independent algorithm lies in projection
if SGD is modified to projected-SGD for constrained case, that running SGD in the constrained
setting requires projection which might “increase rank”. We can see there is some fundamental
difference between constrained and unconstrained case, and analyzing unconstrained case is also an
interesting and important direction.

Classic methods, like Bassily et al. (2014) usually connect linear loss to one-way marginals Bun
et al. (2018), and then use lower bounds for one-way marginals to imply lower bounds for linear
loss. As Bassily et al. (2014) are using results from Hardt and Talwar (2010); Bun et al. (2018) in

_√p log(1/δ)_
one-way marginals and Steinke and Ullman (2015) achieves tight Ω( _αϵ_ ) bound by using

the novel group privacy technique, one may ask whether combining Steinke and Ullman (2015) and
Bassily et al. (2014) can achieve the tight lower bound in the constrained case trivially. Because
Steinke and Ullman (2015) considers ℓ1 distance but Bassily et al. (2014) considers ℓ2 distance,
there is a _p gap between them and one can’t directly combine them. Though with some effort,_

_[√]_
one may get tighter bounds in the constrained case by modifying the results in Steinke and Ullman
(2015) from ℓ1 norm to ℓ2 norm, then applying the analysis in Bassily et al. (2014).

As shown in Asi et al. (2021), proving a nearly tight lower bound in the unconstrained case is direct
by utilizing one-way marginals and choosing the right objective functions, but getting rid off those
extra logarithmic terms in the unconstrained case is nontrivial as the one-way marginals can not
work directly in group privacy. To the best of our knowledge, our result is the first time that achieves
this improved tight lower bound for general loss function class in both cases. See Table 1.4 for
detailed comparisons between previous bounds and ours.

1.4 RELATED WORK

The existing lower bounds of excess empirical loss, i.e. the constrained case in Bassily et al. (2014)
and the unconstrained case in Song et al. (2021), are all using GLM functions. The objective function used in Bassily et al. (2014) is ℓ(θ; d) = ⟨θ, d⟩ which can’t be applied in the unconstrained
case, otherwise the loss value would be infinite. Considering this limitation, Song et al. (2021)
adopts ℓ(θ; d) = |⟨θ, x⟩− _y|. They transfer the problem of minimizing GLM to estimating one-way_
marginals, and then get the lower bound by properties in the definition of the Fingerprinting Codes.

As mentioned before, our lower bound are based on ℓ1 norms, thus we can not transfer to oneway marginals directly. Merely using the properties in the definition of Fingerprinting Codes is not
enough for a good lower bound. Instead, we need to make full use of the concrete structure of the
codes.


-----

|Article|Constrained?|Loss Function|Pure DP|Approximate DP|
|---|---|---|---|---|
|Bassily et al. (2014)|constrained|GLM|Ω( p ) nϵ|√p Ω( ) nϵ|
|Song et al. (2021)|unconstrained|GLM|N/A|√ Ω( rank) nϵ|
|Asi et al. (2021)|both|general|N/A|√p Ω( ) nϵ log p|
|Ours|both|general|Ω( p ) nϵ|√ p log(1/δ) Ω( ) nϵ|


Article Constrained? Loss Function Pure DP Approximate DP

_√p_

Bassily et al. (2014) constrained GLM Ω( _nϵ[p]_ [)] Ω( _nϵ_ [)]

Song et al. (2021) unconstrained GLM N/A Ω( _√ranknϵ_ )

_√p_
Asi et al. (2021) both general N/A Ω( _nϵ log p_ [)]

_√p log(1/δ)_

Ours both general Ω( _nϵ[p]_ [)] Ω( _nϵ_ )


Table 1: Comparison on lower bounds for private convex ERM. Our lower bounds can be extended
to constrained case easily. The lower bound of Song et al. (2021) is weaker than ours in the important
_p ≫_ _n setting._

As for the upper bounds, the private ERM Wang et al. (2017) and private Stochastic Convex Optimization (SCO) Feldman et al. (2020) for convex and smooth functions are extensively studied,
where the objective is to minimize the function Ed [ℓ(θ; d)] in the SCO and people only need
_∼P_
(nearly) linear gradient queries to get optimal excess loss. But for convex functions without any
smoothness assumption, the current best algorithms Kulkarni et al. (2021); Asi et al. (2021) will
need more queries (n[1][.][375] in the worst case). Besides, most of the previous works are considering problems in ℓ2 norm, and there are some recent results Bassily et al. (2021); Asi et al. (2021)
studying the general ℓp norm.

1.5 ROADMAP

In section 2 we introduce background knowledge needed in the rest of the paper. In section 3

_√p log(1/δ)_
we prove the main result of this paper, an Ω(min(1, _nϵ_ )) lower bound for approximate DP
ERM in the unconstrained case. In section 4 we discuss an Ω(min(1, _nϵ[p]_ [))][ lower bound for the excess]

risk of pure DP algorithms for minimizing any unconstrained 1-Lipschitz convex loss function.
Section 5 concludes this paper. All missing (technical) proofs can be found in the appendix.

2 PRELIMINARY

We consider minimizing the excess risk of unconstrained Lipschiz convex function with DP algorithms in this paper, where we let n denote the sample size and p be the dimension of a sample.
In this section, we will introduce main background knowledge required in the rest of the paper.
Additional background knowledge such as the definition of GLM can be found in appendix.
**Definition 2.1 (Differential privacy). A randomized mechanism M is (ϵ, δ)-differentially private if**
for any event O ∈ Range(M) and for any neighboring databases D and D[′] that differ in a single
data element, one has

Pr[M(D) ∈O] ≤ exp(ϵ) Pr[M(D[′]) ∈O] + δ.

When δ > 0, we refer to the above condition as approximate differential privacy. The special case
when δ = 0 is called pure differential privacy.
**Definition 2.2 (Empirical Risk Minimization). Given a family of convex loss functions**
_ℓ(θ, d)_ _d_ of θ over R[p] and a set of samples = _d1,_ _, dn_ over the universe,
_{_ _}_ _∈D_ _K ⊆_ _D_ _{_ _· · ·_ _}_ _D_
the objective of Empirical Risk Minimization (ERM) is to minimize


_L(θ;_ ) = [1]
_D_ _n_


_ℓ(θ; di)._
_i=1_

X


The excess empirical loss with respect to a solution θ is defined by

_L(θ; D) −_ _L(θ[∗]; D)_

where θ[∗] arg minθ _L(θ;_ ), measuring the performance of the solution θ compared with the
_∈_ _∈K_ _D_
best solution in K.
**Definition 2.3 (G-Lipschitz Continuity). A function f : R[p]** _→_ R is G-Lipschitz continuous with
respect to ℓ2 norm if the following holds for all θ, θ[′] _∈_ R[p]:

_f_ (θ) _f_ (θ[′]) _G_ _θ_ _θ[′]_ 2 (1)
_|_ _−_ _| ≤_ _∥_ _−_ _∥_


-----

The Chernoff Bound will serve to prove the fingerprinting code constructed in Bun et al. (2018)
satisfies our modified definition of fingerprinting code as well.
**Proposition 2.4 (The Chernoff Bound). Let X =** _i=1_ _[X][i][ where][ X][i][ = 1][ with probability][ p][i][ and]_
_Xi = 0 with probability 1−pi. Assume all Xi are independent random variables. Let u =_ _i=1_ _[p][i][.]_
_Then_

[P][n]
_P_ ( _X_ _u_ _δu)_ 2exp( _uδ[2]/2)._ (2)
_|_ _−_ _| ≥_ _≤_ _−_ [P][n]

3 APPROXIMATE DP

In this section, we consider the lower bound for approximate differential privacy where 2[−][O][(][n][)] _<_
_δ < o(1/n). Such assumption on δ is common in literature, for example in Steinke and Ullman_
(2015). We briefly introduce the (classic) fingerprinting codes first:

3.1 FINGERPRINTING CODES

**Definition 3.1 (Fingerprinting codes). We are given n, p ∈** N, ξ ∈ (0, 1]. A pair of (random)
algorithms (Gen, Trace) is called an (n, p)-fingerprinting code with security ξ ∈ (0, 1] if Gen
outputs a code-book C ∈{0, 1}[n][×][p] and for any (possibly randomized) adversary AF P and any
subset S ⊆ [n], if we set c ←R AF P (CS), then

-  Pr[c _F_ (CS) Trace(C, c) = ] _ξ_
_∈_ _⊥_ _≤_

-  Pr [Trace (C, c[V]) ∈ [n]\S] ≤ _ξ_

where F (CS) = _c ∈{0, 1}[d]_ _| ∀j ∈_ [d], ∃i ∈ _S, cj = cij_, and the probability is taken over the
coins of Gen, Trace and AF P .

There is a very good motivation behind the fingerprinting codes. For example, a software distributor
adds a fingerprint to each copy of her software to protect the IP. A coalition of malicious users
can compare their copies and find the digits that differ which belong to the fingerprint. For other
locations they can’t decide and won’t change them, which is called the marking condition. This is
the reason that we requires c _F_ (CS).
_∈_

The two properties of fingerprinting codes demonstrate that one can identify at least one malicious
user among all with high probability. Bun et al. (2018) extends the definition that the codes can tolerate a small fraction of errors in the marking condition. We further modify this definition, requiring
the codes to have biased means, see below.

3.2 OUR RESULT

We modify the definition of fingerprinting code instead for our analysis.
**Definition** **3.2** (Error Robust Biased Mean Fingerprinting Codes). Given _n, p_ _∈_
fingerprinting code with securityN, ξ, β, α1, α2, α3 ∈ (0, 1]. We say a pair of (random) algorithms ξ and (α1, α2, α3)-biased mean, robust to a (Gen β, Trace) fraction of errors if is an (n, p)Gen outputs a code-book C ∈{0, 1}[n][×][p] and for any (possibly randomized) adversary AF P and
any coalition S ⊆ [n], if we set c ←R AF P (CS), then

-  Pr[c _Fβ(CS)_ Trace(C, c) = ] _ξ_
_∈_ _⊥_ _≤_

-  Pr [Trace (C, c)[V] ∈ [n]\S] ≤ _ξ_

-  Pr[Gα1 (C) (1 _α2)]_ _α3_
_≥_ _−_ _≤_

where Fβ (CS) = _c ∈{0, 1}[p]_ _| Prj←R[p][∃i ∈_ _S, cj = cij] ≥_ 1 − _β_, Gα(CS) = _|{j_ :

_i_ _S_ _[c][ij][/][|][S][| −]_ [1][/][2][| ≤] _[α][}|][ is the number of slightly biased columns in][ C][S][ and the probability is]_

_|_ _∈_ 

taken over the coins of Gen, Trace and AF P .

[P]

We use the fingerprinting code in Bun et al. (2018) for the construction of our lower bound, see
Algorithm 1 in the appendix. We utilize an ℓ1 loss and use the fingerprinting code in Bun et al.


-----

(2018) as our ’hard case’. To proceed, we first introduce a few lemmas which would be of use later.
Similar to Bun et al. (2018), we have the following standard lemma which allows us to reduce any
_ϵ < 1 to ϵ = 1 case without loss of generality, using the well-known ’secrecy of the sample’ lemma_
from Kasiviswanathan et al. (2011).
**Lemma 3.1. A condition Q has sample complexity n[∗]** _for algorithms with (1, o(1/n))-differential_
_privacy (n[∗]_ _is the smallest sample size that there exists an (1, o(1/n))-differentially private algo-_
_rithm A which satisfies Q), if and only if it also has sample complexity Θ(n[∗]/ϵ) for algorithms with_
(ϵ, o(1/n))-differential privacy.

Notice that Lemma 3.1 discusses the sample complexity of the algorithm, therefore is independent of
the (α1, α2, α3)-biased mean appeared in the above definition which only concerns the construction
of the fingerprinting code. The following lemma verifies that the fingerprinting code Algorithm 1
indeed has biased mean as in definition 3.2. The proof is straightforward by using the Chernoff
bound multiple times.
**Lemma 3.2. Algorithm 1 (the fingerprinting code) has (1/100, 999/1000, exp(−Ω(p))-biased**
_mean._

Directly combining Lemma 3.2 and Theorem 3.4 from Bun et al. (2018), we have the following
lemma, which states that for the fingerprinting code Algorithm 1 which we will use in proving our
main theorem to satisfy the error robust biased mean property in definition 3.2, one needs roughly
Ω(˜ _[√]p) samples._
**Lemma 3.3. For every p ∈** N and ξ ∈ (0, 1], there exists an (n, p)-fingerprinting code (Algorithm
_1) with security ξ and (1/100, 999/1000, exp(−Ω(p))-biased mean, robust to a 1/75 fraction of_
_error for_


_n = n(p, ξ) = Ω([˜]_


_p/ log(1/ξ))._


We are ready to prove the main result of this section by using Lemma 3.3 to reach a contradiction.
Consider the following ℓ1 norm loss function. Define

_ℓ(θ; d) = ||θ −_ _d||1, θ, d ∈_ R[p] (3)

For any data-set D = {d1, ..., dn}, we define L(θ; D) = _n[1]_ _ni=1_ _[ℓ][(][θ][;][ d][i][)][.]_

**Theorem 3.4 (Lower bound for (ϵ, δ)-differentially private algorithms). Let n, p be large enough**

P

_and 1 ≥_ _ϵ > 0, 2[−][O][(][n][)]_ _< δ < o(1/n). For every (ϵ, δ)-differentially private algorithm with output_
_θ[priv]_ R[p], there is a data-set = _d1, ..., dn_ 0, 1 [1]2 _[}][p][ such that]_
_∈_ _D_ _{_ _} ⊂{_ _}[p]_ _∪{_


_p log(1/δ)_

)GC) (4)
_nϵ_


E[L(θ[priv]; D) − _L(θ[⋆]; D)] = Ω(min(1,_


_where ℓ_ _is G-Lipschitz, θ[⋆]_ _is a minimizer of L(θ; D), and C is the diameter of the set_
_{arg minθ L(θ; D)|D ⊂{0, 1}[n][×][p]}, which contains all possible true minimizers._

Due to the space limit, we leave the proof of the Theorem 3.4 in the appendix.

The dependence on the diameter C makes sense as one can minimize a substitute loss function
_ℓ[′](x) = ℓ(ax) where a ∈_ (0, 1) is a constant instead, which decreases Lipschitz constant G but
increases the diameter C. Note also that C > 0 whenever all possible D don’t share the same
minimizer of L, which is often the case. This bound improves a log factor over Bassily et al.
(2014) by combining the the group privacy technique in Steinke and Ullman (2015) and our modified
definition of fingerprinting code.

We leave several remarks discussing slight generalizations of Theorem 3.4.
**Remark 3.5. Our lower bound can be directly extended to the constrained setting, by setting**
_the constrained domain to be [0, 1][n][×][p]_ _which contains the convex hull of all possible minimizers_
_{arg minθ L(θ; D)|D ⊂{0, 1}[n][×][p]}._

_√rank log(1/δ)_
**Remark 3.6. Similarly, we can derive an Ω(min(1,** _nϵ_ ) lower bound when we addition
_ally assume the rank of gradient subspace. The analysis remains the same except we first apply_
_orthogonal transformation then set the complement of the gradient subspace to be all 0’s in D._


-----

**Remark 3.7. The third property of definition 3.2 serves the group privacy analysis to further im-**
_prove a log(1/δ) term over Bassily et al. (2014). One can simplify the proof by setting k = 1_
_and borrow the lower bound for 1-way marginals from Bun et al. (2018), at the cost of losing this_
_log(1/δ) term. See appendix for details._

4 PURE DP

In this section, we give a lower bound for ϵ-(pure) differentially private algorithms for minimizing
unconstrained convex Lipschitz loss function L(θ; D). In the construction of lower bounds for
constrained DP-ERM (Bassily et al. (2014)), they chose linear function ℓ(θ; d) = ⟨θ, d⟩ as their
objective function which isn’t applicable in the unconstrained setting because it could decrease to
negative infinity. Instead, we use a novel ℓ2 norm loss function to over come this problem:

_ℓ(θ; d) = ||θ −_ _d||2, θ, d ∈_ R[p] (5)

For any dataset D = {d1, ..., dn}, we define L(θ; D) = _n[1]_ _ni=1_ _[ℓ][(][θ][;][ d][i][)][.][ Clearly, both][ ℓ]_ [and][ L][ are]

convex and 1-Lipschitz. The structure of the proof is similar to that in Bassily et al. (2014), while
technical details are quite different as we need to handle a non-linear objective function. DifferentP
from the simple average of points in Bassily et al. (2014), we need to consider the Fermat point
instead, which is the minimizer of the ℓ2 norm loss function.

4.1 FERMAT POINT

**Definition 4.1 (Fermat point). The set of Fermat points P** (D) of a dataset D = {d1, ..., dn} contains
_points minimizing its ℓ2 distance to all points in D:_


_x_ _di_ 2 (6)
_||_ _−_ _||_ _}_
_i=1_

X


_P_ ( ) = arg min
_D_ _{_ _x_ _R[p]_
_∈_


One obstacle of using ℓ2 norm as our loss is that Fermat points aren’t unique in the worst case.
Given a (finite) dataset D, we can easily see that P (D) is a compact subset of the convex hull of D,
which encourages us to define a unique “maximum” element in P (D). To do so, we introduce the
following well-order on R[p].
**Definition 4.2 (Coordinate dictionary order). A point x is said to be larger than y in coordinate**
_dictionary order if and only if there exists an index i ∈_ [n] such that xi > yi, and for any j < i we
_have that xj = yj._

It’s straightforward to verify that CDO (coordinate dictionary order) is a well-order. Next we use
CDO to select a unique member from the set P (D) of all Fermat points.
**Definition 4.3 (Ordered Fermat point). The Ordered Fermat point q(D) of a dataset D** =
_d1, ..., dn_ _is defined as:_
_{_ _}_
_q(_ ) = arg max (7)
_D_ _x_ _P (_ ) [CDO(][x][)]
_∈_ _D_

Such q(D) must exist for a finite dataset as long as P (D) is compact and non-empty, because there
can’t be an ordered infinite sequence with its limit outside of P (D) which contradicts compactness.
The technical proof of the following proposition is deferred to appendix.
**Proposition 4.1. q(D) always exists for a finite dataset D.**

Note that q(D) is unique by definition and is always a minimizer of L(θ; D) over R[p]. In the following subsection we are going to show that any pure DP algorithm can’t estimate q(D) with good
accuracy, then prove that a large error in estimating q(D) will lead to large error in the excess risk
of ℓ2 norm loss as well, establishing the main lower bound of this section.

4.2 LOWER BOUND

In this subsection, we prove a lower bound on the excess risk incurred by any ϵ-differentially private
algorithm whose output is denoted by θ[priv] _∈_ R[p]. We first introduce the following lemma showing


-----

that it’s impossible to find the location of the ordered Fermat point q(D) with good accuracy using
a pure DP algorithm.

The proof follows the spirit of Bassily et al. (2014), constructing datasets ’far away’ from each other
such that the events of estimating the Fermat point of each dataset accurately are mutually disjoint.
Then by differential privacy as long as one can estimate one dataset accurately, one can estimate any
other one with certain probability as well. The sum of all these probabilities is no more than 1 due
to the disjointness, which leads to the desired bound.

We denote e1 ≜ (1, 0, ..., 0)[⊤] and let ⊕ denote the direct sum of vectors, i.e. α ⊕ _β = (α, β) where_
_α ∈_ R[a], β ∈ R[b] are both vectors. For a vector α and a set S, we denote α ⊕ _S = {(α, β) : β ∈_ _S}._
**Lemma 4.4. Let n, p** _≥_ 2 and ϵ _>_ 0. _There is a number M_ = Ω(min(n, _[p]ϵ_ [))][ such]

_that for any ϵ-differentially private algorithm A, there is a dataset_ = _d1, ..., dn_
_D_ _{_ _}_ _⊂_
0 _√p1_ 1 _,_ _√p1_ 1 _e1,_ _e1, 0_ _with_ _i=1_ _[d][i][||][2][ ≤]_ _[M][ such that, with probability at]_
_⊕{_ _−_ _−_ _−_ _}[p][−][1][]_ _∪{_ _−_ _}_ _||_

_least_ 1/2 (taken over the algorithm random coins), we have

[P][n]

_A(_ ) _q(_ ) 2 = Ω(min(1, [p] (8)
_||_ _D_ _−_ _D_ _||_ _nϵ_ [))]

The classic analysis of Bassily et al. (2014) contains an ’adding dummy points’ which will perturb
the location of the minimizer. In the constrained case, such perturbation won’t change the direction
of the minimizer (seen as a vector), but in the unconstrained case non-linear loss functions no longer
enjoy such good properties. To oversome this issue, we introduce the auxiliary dimension and the
dummy points we add have support only in this dimension. The benefit of doing so is that the Fermat
point q(D) will also only change along its direction after we add dummy points, which simplifies
the computation.

Lemma 4.4 implies that it’s impossible to estimate the ordered Fermat point with good accuracy
using a pure DP algorithm. In the following theorem we are going to show that a bad estimate on
the ordered Fermat point leads to higher ℓ2 norm loss. As the fermat point is a minimizer of ℓ2 norm
loss, we can naturally translate the discrepancy in estimating q(D) to the excess risk.
**Theorem 4.5 (Lower bound for ϵ-differentially private algorithms). Let n, p ≥** 2 and ϵ > 0. For ev_ery ϵ-differentially private algorithm with output θ[priv]_ _∈_ R[p], there is a dataset D = {d1, ..., dn} ⊂
0 _√p1_ 1 _,_ _√p1_ 1 _e1,_ _e1, 0_ _such that, with probability at least 1/2 (over the algo-_
_⊕{_ _−_ _−_ _−_ _}[p][−][1][]_ _∪{_ _−_ _}_
_rithm random coins), we must have that_

_L(θ[priv];_ ) min _L(θ;_ ) = Ω(min(1, [p] (9)
_D_ _−_ _θ_ _D_ _nϵ_ [))]

The proof is based on calculation in the two-dimensional subspace spanned by q(D) and the auxiliary dimension. By observing that q(D) is perpendicular to the auxiliary dimension, we can parameterize θ[priv] by these two unit vectors and write down the expression of L(θ[priv]; D) − minθ L(θ; D)
explicitly. Then by elementary inequality scaling we get the desired result.
**Remark 4.6. In fact the lower bound in Theorem 4.5 also holds for the case p = 1. The only**
_difference is that the case p = 1 doesn’t need the auxiliary dimension because the perturbation_
_of the minimizer is always along its direction. We can simply use dummy points {1, −1, 0} and a_
_similar analysis to Bassily et al. (2014) to achieve this result._

5 CONCLUSION

In this paper, we study differentially private convex ERM in the unconstrained case and give the first
tight lower bounds for approximate-DP ERM for general loss functions. Our results also directly
imply a same lower bound for the constrained case, improving the classic lower bound in Bassily
et al. (2014) by log(1/δ). We also give an Ω( _nϵ[p]_ [)][ lower bound for unconstrained pure-DP ERM]

which recovers the result in the constrained case. Our techniques enrich the quite limited tools in
constructing lower bounds in the private setting and we hope they can find future use, especially
for those problems which are not (easily) reducible from one-way marginals. Designing better
algorithms for general (un)constrained DP-ERM based on our insights would also be an interesting
and meaningful direction, which we leave as future work.


-----

REFERENCES

Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization:
Optimal rates in ℓ1 geometry. arXiv preprint arXiv:2103.01516, 2021.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient
algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of
_Computer Science, pages 464–473. IEEE, 2014._

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic
convex optimization with optimal rates. In Advances in Neural Information Processing Systems,
pages 11282–11291, 2019.

Raef Bassily, Vitaly Feldman, Crist´obal Guzm´an, and Kunal Talwar. Stability of stochastic gradient
descent on nonsmooth convex losses. arXiv preprint arXiv:2006.06914, 2020.

Raef Bassily, Crist´obal Guzm´an, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. arXiv preprint arXiv:2103.01278, 2021.

Dan Boneh and James Shaw. Collusion-secure fingerprinting for digital data. IEEE Transactions on
_Information Theory, 44(5):1897–1905, 1998._

Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate
differential privacy. SIAM Journal on Computing, 47(5):1888–1938, 2018.

Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In NIPS, volume 8, pages 289–296. Citeseer, 2008.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical
risk minimization. Journal of Machine Learning Research, 12(3), 2011.

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity
in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations
_and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014._

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal
rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of
_Computing, pages 439–449, 2020._

Kazuto Fukuchi, Quang Khai Tran, and Jun Sakuma. Differentially private empirical risk minimization with input perturbation. In International Conference on Discovery Science, pages 82–90.
Springer, 2017.

Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the
_forty-second ACM symposium on Theory of computing, pages 705–714, 2010._

Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security
_and Privacy (SP), pages 299–316. IEEE, 2019._

Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pages 476–484. PMLR,
2014.

Peter Kairouz, M´onica Ribero, Keith Rush, and Abhradeep Thakurta. Dimension independence in
unconstrained private erm via adaptive preconditioning. arXiv preprint arXiv:2008.06570, 2020.

Shiva Prasad Kasiviswanathan and Hongxia Jin. Efficient private empirical risk minimization for
high-dimensional learning. In International Conference on Machine Learning, pages 488–497.
PMLR, 2016.

Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam
Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.


-----

Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization
and high-dimensional regression. In Conference on Learning Theory, pages 25–1. JMLR Workshop and Conference Proceedings, 2012.

Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization
and stochastic convex optimization in subquadratic steps. arXiv preprint arXiv:2103.15352, 2021.

Benjamin IP Rubinstein, Peter L Bartlett, Ling Huang, and Nina Taft. Learning in a large function
space: Privacy-preserving mechanisms for svm learning. arXiv preprint arXiv:0911.5708, 2009.

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing,
pages 245–248. IEEE, 2013.

Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and
_Statistics, pages 2638–2646. PMLR, 2021._

Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. arXiv
_preprint arXiv:1501.06095, 2015._

Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private lasso. In Proceedings
_of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages_
3025–3033, 2015.

G´abor Tardos. Optimal probabilistic fingerprint codes. Journal of the ACM (JACM), 55(2):1–24,
2008.

Neal R Wagner. Fingerprinting. In 1983 IEEE Symposium on Security and Privacy, pages 18–18.
IEEE, 1983.

Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited:
Faster and more general. In Advances in Neural Information Processing Systems, pages 2722–
2731, 2017.

Puyu Wang, Yunwen Lei, Yiming Ying, and Hai Zhang. Differentially private sgd with non-smooth
loss. arXiv preprint arXiv:2101.08925, 2021.

Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on
differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the
_2017 ACM International Conference on Management of Data, pages 1307–1322, 2017._

Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient private erm for smooth objectives. arXiv preprint arXiv:1703.09947, 2017.


-----

A ADDITIONAL BACKGROUND KNOWLEDGE

A.1 GENERALIZED LINEAR MODEL (GLM)

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that
allows for response variables that have error distribution models other than a normal distribution. To
be specific,
**Definition A.1 (Generalized linear model (GLM)). The generalized linear model (GLM) is a special**
_class of ERM problems where the loss function ℓ(θ, d) takes the following inner-product form:_

_ℓ(θ; d) = ℓ(⟨θ, x⟩; y)_ (10)

_for d = (x, y). Here, x ∈_ R[p] _is usually called the feature vector and y ∈_ R is called the response.

A.2 PROPERTIES OF DIFFERENTIAL PRIVACY

In this subsection we introduce several very basic properties of differential privacy without proving
them (refer Dwork et al. (2014) for details). Readers familiar with the field of differential privacy
can feel free to skip this section.
**Proposition A.1 (Group privacy). If M : X** _[n]_ _→_ _Y is (ϵ, δ)-differentially private mechanism, then_
_for all pairs of datasets x, x[′]_ _∈_ _X_ _[n], then M(x), M(x[′]) are (kϵ, kδe[kϵ])-indistinguishable when_
_x, x[′]_ _differs on exact k locations._
**Proposition A.2 (Post processing). If M : X** _[n]_ _→_ _Y is (ϵ, δ)-differentially private and A : Y →_ _Z_
_is any randomized function, then A ◦M : X_ _[n]_ _→_ _Z is also (ϵ, δ)-differentially private._
**Proposition A.3 (Composition). Let Mi be an (ϵi, δi)-differentially private mechanism for all i ∈**

[k]. If M[k] is defined to be

[k](x) = ( 1(x), ..., _k(x))_ (11)
_M_ _M_ _M_

_then M[k] is ([P][k]i=1_ _[ϵ][i][,][ P]i[k]=1_ _[δ][i][)][-differentially private.]_

B FINGERPRINTING CODE

In this section we briefly introduce the mechanism of the fingerprinting code Algorithm 1. The
sub-procedure part is the original fingerprinting code in Tardos (2008), with a pair of randomized
algorithms (Gen, Trace). The code generator Gen outputs a codebook C ∈{0, 1}[n][×][p]. The ith row
of C is the codeword of user i. The parameter p is called the length of the fingerprinting code.

The security property of fingerprinting codes asserts that any codeword can be “traced” to a user i.
Moreover, we require that the fingerprinting code can find one of the malicious users even when they
get together and combine their codewords in any way that respects the marking condition. That is,
there is a tracing algorithm Trace that takes as inputs the codebook C and the combined codeword
_c[′]_ and outputs one of the malicious users with high probability.

The sub-procedure Gen[′] first uses a sin[2] _x like distribution to generate a parameter pj (the mean)_
for each column j independently, then generates C randomly by setting each element to be 1 with
probability pj according to its location. The sub-procedure Trace[′] computes a threshold value Z
and a ’score function’ Si(c[′]) for each user i, then report i when its score is higher than the threshold.

The main-procedure was introduced in Bun et al. (2018), where Gen adds dummy columns to the
original fingerprinting code and applies a random permutation. Trace can first ’undo’ the permutation and remove the dummy columns, then use Trace[′] as a black box. This procedure makes the
fingerprinting code more robust in that it tolerates a small fraction of errors to the marking condition.

C OMITTED PROOFS

C.1 PROOF OF LEMMA 3.1

_Proof. The proof uses a black-box reduction, therefore doesn’t depend on Q._ The direction
that O(n[∗]/ϵ) samples are sufficient is equal to proving the assertion that given a (1, o(1/n))

-----

**Algorithm 1 The Fingerprinting Code (Gen, Trace)**

1: Sub-procedure Gen[′]:
2: Let d = 100n[2] log(n/ξ) be the length of the code.
3: Let t = 1/300n be a parameter and let t[′] be such that sin[2]t[′] = t.
4: for j = 1, ..., d: do
5: Choose random r uniformly from [t[′], π/2 − _t[′]] and let pj = sin[2]rj. Note that pj ∈_ [t, 1 − _t]._

6: For each i = 1, ..., n, set Cij = 1 with probability pj independently.

7: end for
8: return C
9: Sub-procedure Trace[′](C, c[′]):

10: Let Z = 20n log(n/ξ) be a parameter.
11: For each j = 1, ..., d, let qj = (1 − _pj)/pj._

12: For each j = 1, ..., d, and eachp i = 1, ..., n, let Uij = qj if Cij = 1 and Uij = −1/qj else wise.

13: for each i = 1, ..., n: do
14: Let Si(c[′]) = _j=1_ _[c]j[′]_ _[U][ij]_

15: Output i if Si(c[′]) _Z/2._
_≥_

16: Output if S[P]i(c[d][′]) < Z/2 for every i = 1, ..., n.
_⊥_

17: end for
18: Main-procedure Gen:
19: Let C be the (random) output of Gen[′], C ∈{0, 1}[n][×][d]

20: Append 2d 0-marked columns and 2d 1-marked columns to C.
21: Apply a random permutation π to the columns of the augmented codebook.
22: Let the new codebook be C _[′]_ _∈{0, 1}[n][×][5][d]._
23: return C _[′]_

24: Main-procedure Trace(C, c[′]):
25: Obtain C _[′]_ from the shared state with Gen.
26: Obtain C by applying π[−][1] to the columns of C _[′]_ and removing the dummy columns.
27: Obtain c by applying π[−][1] to c[′] and removing the symbols corresponding to fake columns.
28: return i randomly from Trace[′](C, c).


differentially private algorithm A, we can get a new algorithm A[′] with (ϵ, o(1/n))-differential privacy at the cost of shrinking the size of the dataset by a factor of ϵ.

Given input ϵ and a dataset X, we construct A[′] to first generate a new dataset T by selecting each
element of X with probability ϵ independently, then feed T to A. Fix an event S and two neighboring
datasets X1, X2 that differs by a single element i. Consider running A on X1. If i is not included
in the sample T, then the output is distributed the same as a run on X2. On the other hand, if i is
included in the sample T, then the behavior of A on T is only a factor of e off from the behavior
of A on T \ {i}. Again, because of independence, the distribution of T \ {i} is the same as the
distribution of T conditioned on the omission of i.

For a set X, let pX denote the distribution of A(X), we have that for any event S,

_pX1_ (S) = (1 _ϵ)pX1_ (S _i /_ _T_ ) + ϵpX1 (S _i_ _T_ )
_−_ _|_ _∈_ _|_ _∈_
(1 _ϵ)pX2_ (S) + ϵ(e _pX2_ (S) + δ)
_≤_ _−_ _·_
exp(2ϵ)pX2 (S) + ϵδ
_≤_

A lower bound of pX1 (S) exp( _ϵ)pX2_ (S) _ϵδ/e can be obtained similarly. To conclude,_
_≥_ _−_ _−_
since ϵδ = o(1/n) as the sample size n decreases by a factor of ϵ, A[′] has (2ϵ, o(1/n))-differential
privacy. The size of X is roughly 1/ϵ times larger than T, combined with the fact that A has sample
complexity n[∗] and T is fed to A, A[′] has sample complexity at least Θ(n[∗]/ϵ).

For the other direction, simply using the composability of differential privacy yields the desired
result. In particular, by the k-fold adaptive composition theorem in Dwork et al. (2006), we can
combine 1/ϵ independent copies of (ϵ, δ)-differentially private algorithms to get an (1, δ/ϵ) one and
notice that if δ = o(1/n), then δ/ϵ = o(1/n) as well because the sample size n is scaled by a factor
of ϵ at the same time, offsetting the increase in δ.


-----

C.2 PROOF OF LEMMA 3.2

_Proof. In line 5 of algorithm 1, every column j is assigned a probability pj independently where_

1

_Pr[_ _pj_ (12)
_|_ _−_ [1]2 _[|][ <][ 0][.][002]][ <]_ 400

by straightforward calculation. By the Chernoff bound (with u < p/400, δ = 1), with probability at
least
, at least 1 − 2001 [fraction of the columns have]1 − 2 exp([ |][p]−[j][ −]p/800)2[1] _[| ≥]_ [0][.][002][. Denote][ m][j][ to be the mean of](13)

entries of column j, then by using the Chernoff bound again (with δ = 0.001), we have that with
probability at least
1 − 2 exp(−n/8000000) (14)
a column2 exp(−n/ j8000000) actually satisfiesp and uδ = 0 |mj −.01p[1]2) together with the union bound, at least 0.99 fraction of[| ≥] [0][.][001][.] Again by the Chernoff bound (with u ≤

all columns have |mj − 2[1] _[| ≥]_ [0][.][001][ with probability at least]

1 − 2 exp(−p/800) − 2 exp(−pe[n/][8000000]/40000) = 1 − _O(e[−][Ω(][p][)])_ (15)


C.3 PROOF OF THEOREM 3.4

_Proof. Let (α1, α2, α3) = (1/100, 999/1000, exp(_ Ω(p)) be the parameters in the statement of
_−_
Lemma 3.3. Let k = Θ(log(1/δ)) be a parameter to be determined later and nk = ⌊n/k⌋.

Consider the case when p ≥ _pnk first, where pnk = O(ϵ[2]n[2]k_ [log(1][/δ][))][. Without loss of generality,]
we assume ϵ = 1 first, and pnk = O(n[2]k [log(1][/δ][))][ corresponds to the number in Lemma 3.3 where]
we set ξ = δ. We will use contradiction to prove that for any (ϵ, δ)-differentially private mechanism
, there exists some 0, 1 with Gα1 1/k( ) 1 _α2 such that_
_M_ _D ∈{_ _}[n][×][p]_ _−_ _D_ _≤_ _−_

E[L(M(D); D) − _L(θ[⋆]; D)] ≥_ Ω(p) (16)

Assume for contradiction that M : {0, 1}[n][×][p] _→_ [0, 1][n][×][p] is a (randomized) (ϵ, δ)-differentially
private mechanism such that

E[L( ( ); ) _L(θ[⋆];_ )] < [α][1][α][2][p]
_M_ _D_ _D_ _−_ _D_ 1000

for all D ∈{0, 1}[n][×][p] with Gα1−1/k(D) ≤ (1 − _α2). We then construct a mechanism Mk =_
_{0, 1}[n][k][×][p]_ with respect to M as follows: with input D[k] _∈{0, 1}[n][k][×][p], Mk will copy D[k]_ for k
times and append enough 0’s to get a dataset 0, 1 . The output is _k(_ ) = ( ).
_D ∈{_ _}[n][×][p]_ _M_ _D[k]_ _M_ _D_
_Mk is (k,_ _[e]e[k][−][−]1[1]_ _[δ][)][-differentially private by the group privacy. According to the construction above,]_

we know that if Gα1 ( ) < 1 _α2, then Gα1_ 1/k( ) < 1 _α2 as well._
_D[k]_ _−_ _−_ _D_ _−_

We consider algorithm AF P to be the adversarial algorithm in the fingerprinting codes, which rounds
the the output _k(_ ) to the binary vector, i.e. rounding those coordinates with values no less than
_M_ _D[k]_
1/2 to 1 and the remaining 0, and let c = AF P (M(D)) be the vector after rounding. As Mk is
(k, _[e]e[k][−]1[1]_ _[δ][)][-differentially private,][ A][F P][ is also][ (][k,][ e]e[k][−]1[1]_ _[δ][)][-differentially private.]_

_−_ _−_

If for some 0, 1 with Gα1 ( ) 1 _α2,_ (constructed from as above) further
_D[k]_ _∈{_ _}[n][k][×][p]_ _D[k]_ _≤_ _−_ _D_ _D[k]_
satisfies

E[L( ( ); ) _L(θ[⋆];_ )] < [α][1][α][2][p]
_M_ _D_ _D_ _−_ _D_ 1000 _[.]_

As we are considering the ℓ1 loss, we can consider the loss caused by each coordinate independently. Recall that _k(_ ) = ( ). The fraction of those nearly unbiased columns (the mean
_M_ _D[k]_ _M_ _D_
is close to 1/2) is at most 1 _α2, and we treat the worst-case error for them. For other ’α1-biased’_
_−_
columns (coordinates) which take at least α2 fraction of all, if M(D) is right for the prediction, then
appending 0’s can’t change the prediction and _k(_ _k) is also right. Thus we have that_
_M_ _D_

_p_
E[L( _k(_ ); ) _L(θ[⋆];_ )] <E[L( ( ); ) _L(θ[⋆];_ )] + (1 _α2)p <_
_M_ _D[k]_ _D[k]_ _−_ _D[k]_ _M_ _D_ _D_ _−_ _D_ _−_ 900 _[.]_


-----

By Markov Inequality we know that

Pr[L( _k(_ ); ) _L(θ[⋆];_ )]
_M_ _D[k]_ _D[k]_ _−_ _D[k]_ _≥_


180 []][ ≤] [1][/][5][.]


_c /_ _Fβ(D[k]) means that there is at least βp all-one or all-zero columns in D[k], but c is inconsistent_
_∈_
in those coordinates. Thus if c / _Fβ(_ ), we have that L( _k(_ ); ) _L(θ[⋆];_ ) _βp/2 =_
_∈_ _D[k]_ _M_ _D[k]_ _D[k]_ _−_ _D[k]_ _≥_
_p/150 > p/180 for the D[k]_ by Lemma 3.3, implying

Pr[c _Fβ(_ )] 4/5. (17)
_∈_ _D[k]_ _≥_

By the first property of the codes, one also has

Pr[L( _k(_ ); ) _L(θ[⋆];_ ) _p/180_ Trace( _, c) =_ ]
_M_ _D[k]_ _D[k]_ _−_ _D[k]_ _≤_ _D[k]_ _⊥_

Pr[c _Fβ(_ ) Trace( _, c) =_ ] _δ._ ^
_≤_ _∈_ _D[k]_ _D[k]_ _⊥_ _≤_
^

Recall that the arguments above are for those 0, 1 with Gα1 ( ) 1 _α2, which_
_D[k]_ _∈{_ _}[n][k][×][p]_ _D[k]_ _≤_ _−_
happens with probability at least 1 − _α3 by the third property of fingerprinting codes. By union_
bound, we can upper bound the probability Pr[Trace( _, c) =_ ] 1/5 + δ + α3 1/2. As a
result, there exists i[∗] [nk] such that _D[k]_ _⊥_ _≤_ _≤_
_∈_

Pr[i[∗] Trace( _, c)]_ 1/(2nk). (18)
_∈_ _D[k]_ _≥_

Consider the database with i[∗] removed, denoted by _i[∗]_ [. Let][ c][′][ =][ A][F P][ (][M][(][D][k] _i[∗]_ [))][ denote the]
_D−[k]_ _−_
vector after rounding. By the second property of fingerprinting codes, we have that

Pr[i[∗] Trace( _i[∗]_ _[, c][′][)]][ ≤]_ _[δ.]_
_∈_ _D−[k]_

By the differential privacy and post-processing property of M,

Pr[i[∗] Trace( _, c)]_ _e[k]_ Pr[i[∗] Trace( _i[∗]_ _[, c][′][)] +][ e][k][ −]_ [1]
_∈_ _D[k]_ _≤_ _∈_ _D−[k]_ _e_ 1 _[δ.]_

_−_

which implies that


1

_e[k][+1]δ._ (19)
2nk _≤_

Recall that 2[−][O][(][n][)] _< δ < o(1/n), and Equation (19) suggests k/n ≤_ 2e[k]/δ for all valid k, but it
is easy to see there exists k = Θ(log(1/δ)) to make this inequality false, which is contraction. As a
result, there exists some D ∈{0, 1}[n][×][p] with Gα1−1/k(D) ≥ (1 − _α2) subject to_

E[L(M(D); D) − _L(θ[⋆]; D)] ≥_ _[α]1000 [1][α][2][p]_ [= Ω(][p][)][.]


For the (ϵ, δ)-differential privacy case, setting Q to be the condition

E[L(M(D); D) − _L(θ[⋆]; D)] = O(p)._

in Lemma 3.1, we have that any (ϵ, δ)-differentially private mechanism M which satisfies Q for all
0, 1 with Gα1 1/k( ) 1 _α2 must have n_ Ω( _p log(1/n)/ϵ)._
_D ∈{_ _}[n][×][p]_ _−_ _D_ _≥_ _−_ _≥_

Now we consider the case when p < pnk, i.e. when n > n[⋆]p≜ Ω( _p log(1/δ)/ϵ). Given any_

dataset 0, 1 with Gα1 1/k( ) 1 _α2, we will construct a new dataset_ based on
_D ∈{_ _}[n][⋆][×][p]_ _−_ _D_ _≥_ _−_ p _D[′]_
_D by appending dummy points to D like in Lemma 4.4. Specifically, if n −_ _n[⋆]_ is even, we append
_n_ _n[⋆]_ rows among which half are 0 and half are 1 . If n _n[⋆]_ is odd, we append _[n][−][n]2[⋆][−][1]_ points
_−_ _{_ _}[p]_ _−_

0, _[n][−][n]2[⋆][−][1]_ points 1 and one point 1/2 .

_{_ _}[p]_ _{_ _}[p]_

Denote the new dataset after appending by D[′], we will draw contradiction if there is an (ϵ, δ)differentially private algorithm M[′] such that E[L(M(D[′]); D[′]) − _L(θ[⋆]; D[′])] = o(n[⋆]p/n) for all D[′],_
by reducing M[′] to an (ϵ, δ)-differentially private algorithm M which satisfies E[L(M(D); D) −
_L(θ[⋆]; D)] = o(p) for all D with Gα1−1/k(D) ≥_ 1 − _α2._


-----

We construct M by first constructing D[′], and then use M[′] as a black box to get M(D) = M[′](D[′]).
It’s clear that such algorithm for D preserves (ϵ, δ)-differential privacy. It suffices to show that if
E[L(M[′](D[′]); D[′]) − _L(θ[⋆]; D[′])] = o(n[⋆]p/n),_ (20)
then L(M(D); D) − _L(θ[⋆]; D) = o(p), which contradicts the previous conclusion for the case_
_n ≤_ _n[⋆]. Specifically, if n −_ _n[⋆]_ is even, we have that
_n[⋆]E[L(M(D); D) −_ _L(θ[⋆]; D)] = nE[L(M[′](D[′]); D[′]) −_ _L(θ[⋆]; D[′])]._
and if n − _n[⋆]_ is odd we have that
_n[⋆]E[L(M(D); )D −_ _L(θ[⋆]; D)] ≤_ _nE[L(M[′](D[′]); D[′]) −_ _L(θ[⋆]; D[′])] + p/2,_
both leading to the desired reduction. We try to explain the above two cases in more detail. If
_n −_ _n[∗]_ is even, then the minimizer of L(; D) and L(θ[∗]; D) are the same. And the distributions of
the M(D) and M[′](D[′]) are the same and indistinguishable. Multiplying n[∗] or n depends on the
number of rows (recall that we normalize the objective function in ERM). The second inequality is
because we append one point 1/2, which can only increase the loss ( 1/2[p] _θ[∗]_ 1) by p/2 in
_{_ _}[p]_ _∥_ _−_ _∥_
the worst case.

Combining results for both cases we have the following:

E[L(θ[priv]; ) _L(θ[⋆];_ )] = Ω(min(p, [pn][∗] (21)
_D_ _−_ _D_ _n_ [))]

To conclude, observe that G = _p and C =_ _p. In particular, let_ = (d, ..., d)[⊤] 0, 1

_[√]_ _[√]_ _D_ _∈{_ _}[n][×][p]_
contain n identical copies of rows d ∈{0, 1}[p], θ[∗] = d. Going over all such D, we find that the
set {arg minθ L(θ; D)|D ⊂{0, 1}[n][×][p]} contains {0, 1}[p], with diameter at least _[√]p. Meanwhile, its_
diameter can’t exceed _p obviously._

_[√]_

C.4 DETAILS OF REMARK 3.7

We give a sketch of Remark 3.7. In Bun et al. (2018) they prove the following lower bound for
1-way marginals:
**Proposition C.1 (Corollary 3.6 in Bun et al. (2018)). The family of 1-way marginals on {0, 1}[d]**

_has sample complexity at least_ Ω([˜] _√d) for (1/3, 1/75)-accuracy and (O(1), o(1/n))-differential_

_privacy._

Inspecting the proof we find that the constant 1/3 in the above proposition is chosen casually, and
can be replaced by any constant c < 1/2 for free, as the proof only requires 1 − _c is rounded to 1_
and c is rounded to 0 respectively.

The third property of the fingerprinting code implies that with high probability, at most 0.01 fraction
of all columns have mean with bias smaller than 0.001. When we assume the opposite for the sake
of contradiction, by union bound, at least 1/75 − 1/100 = 1/300 fraction of columns have both
’large error on 1-way marginal’ and ’large bias on mean’.

For any such column j, the algorithm is forced to predict wrongly on the question ’Is there more 0’s
than 1’s in column j’ as the range of prediction is restricted in [0, 1] and choose c + 0.001 > 1/2,
thus leading to error on ℓ1 norm loss.

C.5 PROOF OF PROPOSITION 4.1

_Proof. We assume P_ (D) ̸= ∅ without loss of generality. To verify P (D) is compact, we first
observe that P (D) is bounded. To prove P (D) is closed, notice that when P (D) ̸= ∅, the function
_f_ (x) = _i=1_

[a, ∞). Therefore the pre-image of the open set[||][x][ −] _[d][i][||][2][ is continuous and non negative, which implies its image is of the form] (a, ∞) is also open, whose complement is exactly_
_P_ ( ).
_D_ [P][n]

To find the ordered Fermat point, we reduce the dimension of P (D) one after another. Because
_P_ (D) is compact, the largest value of the first coordinate a1 ≜ _argmaxx∈P (D)x1 exists, and the_
ordered Fermat point must lie on the restriction of P (D) on {x|x1 = a1} which is also compact
and non-empty. We continue this process until all dimensions are peeled and there is one point left
because the only non-empty set with zero dimension is a single point.


-----

C.6 PROOF OF LEMMA 4.4

_Proof. By using a standard packing argument we can construct K = 2_ _p−2_ 1 points d[(1)], ..., d[(][K][)] in

0 _√p1_ 1 _,_ _√p1_ 1 such that for every distinct pair d[(][i][)], d[(][j][)] of these points, we have
_⊕{_ _−_ _−_ _−_ _}[p][−][1]_


_d[(][i][)]_ _d[(][j][)]_ 2
_||_ _−_ _||_ _≥_ 8[1]


(22)


It is easy to show the existence of such set of points using the probabilistic method (for example, the
Gilbert-Varshamov construction of a linear random binary code).

Fix ϵ > 0 and define n[⋆] = 160p _ϵ_ [. Let’s first consider the case where][ n][ ≤] [4][n][⋆][. We construct][ K]

datasets D[(1)], ..., D[(][K][)] where for each i ∈ [K], D[(][i][)] contains n copies of d[(][i][)]. Note that q(D[(][i][)]) =
_d[(][i][)], we have that for all i ̸= j,_

_q(_ ) _q(_ ) 2 (23)
_||_ _D[(][i][)]_ _−_ _D[(][j][)]_ _||_ _≥_ 8[1]

Letat least A be any 1/2, || ϵA-differentially private algorithm. Suppose that for every(D[(][i][)]) − _q(D[(][i][)])||2 <_ 161 [,i.e.,][Pr][[][A][(][D][(][i][)][)][ ∈] _[B][(][D][(][i][)][)]] D[ ≥][(][i][)][1]2, i[where for any dataset] ∈_ [K], with probability[ D][,]

_B(D) is defined as_

_B(_ ) = _x_ _R[p]_ : _x_ _q(_ ) 2 < [1] (24)
_D_ _{_ _∈_ _||_ _−_ _D_ _||_ 16 _[}]_

Note that for all i ̸= j, D[(][i][)] and D[(][j][)] differs in all their n entries. Since A is ϵ-differentially private,
for all i ∈ [K], we have Pr[A(D[(1)]) ∈ _B(D[(][i][)])] ≥_ 2[1] _[e][−][ϵn][. Since all][ B][(][D][(][i][)][)][ are mutually disjoint,]_

then


_K_

2 _[e][−][ϵn][ ≤]_


_Pr[A(D[(1)]) ∈_ _B(D[(][i][)])] ≤_ 1 (25)
_i=1_

X


which implies that n > 4n[⋆] for sufficiently large p, contradicting the fact that n ≤ 4n[⋆]. Hence,
there must exist a dataset on which A makes an ℓ2-error on estimating q( ) which is at least
_D[(][i][)]_ _D_
1/16 with probability at least 1/2. Note also that the ℓ2 norm of the sum of the entries of such D[(][i][)]
is n.

Next, we consider the case where n > 4n[⋆]. As before, we construct K = 2 _p−2_ 1 datasets

_D˜[(1)], · · ·,_ _D[˜][(][K][)]_ of size n where for every i ∈ [K], the first n[⋆] entries of each dataset _D[˜][(][i][)]_ are
the same as dataset D[(][i][)] from before whereas the remaining n − _n[⋆]_ entries are constructed as follows. The first 2 of those entries are all copies of e1 whereas the following 2 are copies
_⌊_ _[n][−][n][⋆]_ _⌋_ _⌊_ _[n][−][n][⋆]_ _⌋_

of _e1. The last entry is set to be 0 when n_ _n[⋆]_ is odd.
_−_ _−_

Note that any two distinct datasets _D[˜][(][i][)],_ _D[˜][(][j][)]_ in this collection differ in exactly n[⋆] entries. Let A
be any ϵ-differentially private algorithm for answering q. Suppose that for every i ∈ [K], with
probability at least 1/2, we have that

_A( [˜]_ ) _q( [˜]_ ) 2 < [n][⋆] (26)
_||_ _D[(][i][)]_ _−_ _D[(][i][)]_ _||_ 32n


_n[⋆]_
Note that for all i ∈ [K], we have that q( D[˜][(][i][)]) = λq(D[(][i][)]) where λ = _√n[2]−2nn[⋆]_ [if][ n][ −] _[n][⋆]_ [is even]

and
_n[⋆]_ 1
_λ =_ _−_ (27)

4 2 (n[⋆] 1)[2]
_⌊_ _[n][−][n][⋆]_ _⌋[2]_ _−_ _−_

q

if n − _n[⋆]_ is odd. We notice that _[n]n[⋆]_ _[≤]_ _[λ][ ≤]_ [2][n]n[⋆] [, and is independent of the choice of][ i][. Now, we]

define an algorithm _A[˜] for answering q on datasets D of size n[⋆]_ as follows. First, _A[˜] computes λ_
and appends e1, _e1, 0 as above to get a dataset_ [˜] of size n. Then, it runs A on [˜] and outputs
_A( λD[˜])_ . Hence, by the post-processing propertry of differential privacy, − _D_ _A[˜] is ϵ-differentially privateD_

thatsince || AA[˜]( isD ϵ[(][i]-differentially private. Thus for every[)]) − _q(D[(][i][)])||2 <_ 161 [. However, this contradicts our result in the first part of the proof.] i ∈ [K], with probability at least 1/2, we have


-----

Therefore, there must exist a dataset _D[˜][(][i][)]_ in the above collection such that, with probability at least
1/2,

_p_

_A( [˜]_ ) _q( [˜]_ ) 2 (28)
_||_ _D[(][i][)]_ _−_ _D[(][i][)]_ _||_ _≥_ 32[n][⋆]n 5120ϵn

_[≥]_

Note that the ℓ2 norm of the sum of entries of such _D[˜]_ [(][i][)] is always n[⋆].

C.7 PROOF OF THEOREM 4.5


_Proof. Let A be an ϵ-differentially private algorithm for minimizing L and let θ[priv]_ denote its output.
We choose the dataset (with corresponding di) constructed in Lemma 4.4. When n 4n[⋆],
_D_ _≤_ _D_
contains only identical elements di so that minθ L(θ; D) = 0, and

_L(θ[priv];_ ) min _L(θ;_ ) = L(θ[priv]; ) = _θ[priv]_ _q(_ ) 2 = Ω(min(1, [p] (29)
_D_ _−_ _θ_ _D_ _D_ _||_ _−_ _D_ _||_ _nϵ_ [))]

by Lemma 4.4. When n > 4n[⋆], we denote r ≜ _||θ[priv]_ _−_ _q(D)||2 = Ω(min(1,_ _ϵn[p]_ [))][. Notice that]

_di, e1, −e1, 0 all lie in a 2-dimensional subspace and ||di||2 = 1 is perpendicular to e1, we may_
assume di = e2 without loss of generality. Because q(D) = λe2, we parameterize θ[priv] as follows

_θ[priv]_ = (x1, λ + x2, x3, ..., xp) (30)


where _i=1_ _[x]i[2]_ [=][ r][2][. Now the excess loss satisfies]

_L(θ[priv];_ ) min _L(θ;_ )

[P][p] _D_ _−_ _θ_ _D_

(n[⋆][p]r[2] + (λ 1)[2] + 2(λ 1)x2 +
_≥_ _−_ _−_ _⌊_ _[n][ −]2_ _[n][⋆]_

+ _r[2]_ + 1 + λ[2] + 2λx2 2x1)/n
_⌊_ _[n][ −]2_ _[n][⋆]_ _⌋_ _−_

(opening up the expression)p

( _r[2]_ + (λ 1)[2] + 2x1 +

(≥dropping the first term⌊ _[n][ −]2_ _[n][⋆]_ _⌋p_ ) _−_ _⌊_ _[n][ −]2_ _[n][⋆]_


_r[2]_ + 1 + λ[2] + 2λx2 + 2x1

_r[2]_ + (λ 1)[2] 2x1)/n
_−_ _−_


(
_≥_ _⌊_ _[n][ −]2_ _[n][⋆]_


_r[2]_ + (λ − 1)[2])/n


(max _x1,_ _x1_ 0)
_{_ _−_ _} ≥_

_[r]_

_≥⌊_ _[n][ −]2_ _[n][⋆]_ _⌋·_ _n_ [= Ω(min(1][, p]nϵ [))]

(n > 4n[⋆], r = Ω(min(1, [p]

_ϵn_ [)))]


-----