Marco Gaboardi, Ryan Rogers, Or She etryrogers/AISTATS2019_poster.pdfMarco Gaboardi, Ryan Rogers, Or...

1
Locally Private Mean Estimation: Z-test and Tight Confidence Intervals Marco Gaboardi, Ryan Rogers, Or Sheffet Mean estimation (ME) I Setting: We have n samples drawn from a Gaussian X 1 , ..., X n i.i.d N (μ, σ 2 ) such that μ [-R , R ] for some known bound R , and σ is either provided as an input (known variance case) or left unspecified (unknown variance case). I Goal: Determine an estimate of μ useful for Z-test and for releasing confidence intervals: P X i .i .d . ∼N (μ,σ 2 ),M(X) [μ ∈M(X)] 1 - β The Need for Privacy I Data may contain sensitive information. I Releasing the result may leak information Modified Goal: Determine an estimate of μ which preserves the privacy of those in the study and that is useful for Z-test and for releasing confidence interval. Local Differential Privacy [4] (LDP) I Central Model: Data is submitted in the clear to a trusted curator and the output of a statistic on the data is privatized. I Local Model: No trusted curator - data is privatized and then collected. I An algorithm M : X→O is -differentially private if for all inputs, x , x 0 and outcome sets S ⊆O: P [M (x ) S ] e P [M (x 0 ) S ] . I Local model of differential privacy is used in practice. LDP randomizers properties We use the following mechanisms I Gaussian Noise [2]: Suppose each datum is sampled from an interval I of length . Then we add independent noise N (0, 22 ln(2)/ 2 ) to each datum guaranteeing (, δ )-differential privacy. I Randomized Response [5]: Suppose each datum is a bit {0, 1} and on each datum we operate independently, applying RR : {0, 1}→{0, 1} where RR (b)= b w.p. e 1+e 1 - b else I Bit Flipping algorithm [1]: Suppose each datum x i is a d -dimensional vector indicating its type using a standard basis vector. The Bit Flipping mechanism now runs d independent randomized response mechanism for each coordinate separately with parameter /2: BF(x 1 i ,..., x d i ) = (RR /2 (x 1 i ),..., RR /2 (x d i )) LDP ME - Known Variance Our approach is inspired by the work of Karwa and Vadhan [3]. We adapt it to the local model. KnownVar (X; σ, β, , n, R ) (sketch) 1. Find a bin of length σ most likely to hold μ 2. Construct an interval of length 4σ +2σ p 2 log (8n) centered at this bin 3. Project all remaining points onto this interval and add ind. Gaussian noise. KnownVar properties I Privacy: KnownVar is (, δ )-LDP. I Confidence Interval: If n 1600 e /2 +1 e /2 -1 2 log 8d β , then KnownVar returns an interval I such that: P X,KnownVar [ mu I ] 1 - β. whose size is: |I | = O σ · p log (n) · log (1) · log(1) n ! Locally Private Z-test I For any interval on the reals I we can associate a likelihood of p I def = P X ∼P [X I ], and we know that w.p. p I ± β it indeed holds that μ I . I This mimics the power of a Z -test — in particular we can now compare two intervals as to which one is more likely to hold μ, compare populations, etc. I Below are results showing the empirical p-values and power averaged over 100 trials for various privacy parameters. 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 p-Values for Z-Test Alternative Hypotheses for the Mean p-value 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 p-Values for Z-Test Alternative Hypotheses for the Mean p-value 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 p-Values for Z-Test Alternative Hypotheses for the Mean p-value 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 p-Values for Z-Test Alternative Hypotheses for the Mean p-value epsilon 0.5 0.8 1 1.5 Lower Bounds I Main Lemma: Let M be a one-shot (each individual is presented with a single query) local -differentially private mechanism. Let P and Q be two distributions, with Δ def = d TV (P , Q). Fix any 0 <δ< e -1 and set * =8Δ n q 1 2 ln( 2 / δ) + 16Δ n . Then, for any set S of outputs, Pr X i.i.d ∼P ; M [M(X S ] e * Pr X i.i.d ∼Q; M [M(X) S ]+ δ I Lower bound: Any one-shot local differentially private algorithm must return an interval of length Ω σ p log(1) n ! I Lower bound: Let M be a -LDP mechanism which is (α dist quant )-useful for the p -quantile problem over P , given that the true p -quantile lies in the interval [-R , R ]. Then, for any β< 1 6 it must hold that n Ω( 1 α 2 quant 2 · ln( R α dist β )). LDP ME - Unknown Variance Our approach mimics the same approach from Algorithm KnownVar but without the knowledge of the variance. I Goal: Find a suitably large yet sufficiently tight interval [s 1 , s 2 ]. I Problem: This cannot be done using the off-the-shelf Bit Flipping mechanism as that required we know the granularity of each bin in advance. I Solution: We abandon the idea of finding a histogram on the data. Instead, we propose finding a good approximation for σ using a quantile estimation based on a binary search, using the following algorithm. Algorithm BinQuant Require: Data {x 1 , ··· , x N }, target quantile p * ; , [Q min , Q max ], λ, T . Initialize j =0, n = N /T , s 1 = Q min , s 2 = Q max . for j =1, ··· , T do Select users U (j ) = {j · n +1, j · n +2, ··· , (j + 1) · n} Set t (j ) s 1 +s 2 2 Denote φ (j ) (x )= {x < t (j ) }. Run randomized response on U (j ) and obtain Z (j ) = 1 n ˆ θ RR (n(j ) ). if (Z (j ) > p * + λ 2 ) then s 2 t (j ) else if (Z (j ) < p * - λ 2 ) then s 1 t (j ) else break Ensure: t (j ) Our algorithm UnkVar uses the quantile estimation twice: once for p * = 1 2 where t * = μ, and once for the value of p * = Φ(1) 0.8413 for which the corresponding threshold is t * = μ + σ. Using these two values we obtain estimations for μ, σ and we apply a similar approach to Algorithm KnownVar. UnkVar properties I Privacy: UnkVar is (, δ )-LDP. I Confidence Interval: Let X ∼N (μ, σ 2 ) i.i.d. Fix parameters , β (0, 1 / 2). Given that μ [-R , R ] and that σ min σ σ max 2R , if n 1500 log 2 ( 16R σ min ) · e +1 e -1 2 · ln( 16 log 2 ( 16R / σ min ) β ) then the interval ˆ I returned by Algorithm UnkVar satisfies that P X, UnkVar h ˆ I 3 μ i 1 - β, and moreover ˆ I = O σ · p log (n) log (1) log(1) n ! I Very large variance case: If σ> R we give a different algorithm, based on matching quantiles. We estimate p - = Pr[X < -R ] and p + = Pr[X < R ], then plot the Gaussian based on the quantiles of N (0, 1) obtaining p - and p + . References [1] Bassily and Smith. Local, private, efficient protocols for succinct histograms. In STOC’15. [2] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT06, 2006. [3] V. Karwa and S. P. Vadhan. Finite sample differentially private confidence intervals. In ITCS18. [4] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. D. Smith. What can we learn privately? In FOCS08. [5] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statis- tical Association, 60:63–69, 1965. AISTATS 2019 Naha, Japan

Transcript of Marco Gaboardi, Ryan Rogers, Or She etryrogers/AISTATS2019_poster.pdfMarco Gaboardi, Ryan Rogers, Or...

Locally Private Mean Estimation:Z-test and Tight Confidence Intervals

Marco Gaboardi, Ryan Rogers, Or Sheffet

Mean estimation (ME)

I Setting: We have n samples drawn from a Gaussian

X1, ...,Xn ∼i.i.d N (µ, σ2)

such that µ ∈ [−R,R] for some known bound R, and σ iseither provided as an input (known variance case) or leftunspecified (unknown variance case).

I Goal: Determine an estimate of µ useful for Z-test and forreleasing confidence intervals:

PX

i .i .d .∼ N (µ,σ2),M(X)

[µ ∈M(X)] ≥ 1− β

The Need for Privacy

I Data may contain sensitive information.

I Releasing the result may leak information

Modified Goal: Determine an estimate of µ which preservesthe privacy of those in the study and that is useful for Z-testand for releasing confidence interval.

Local Differential Privacy [4] (LDP)

I Central Model: Data is submitted in the clear to a trustedcurator and the output of a statistic on the data is privatized.

I Local Model: No trusted curator - data is privatized andthen collected.

I An algorithm M : X → O is ε-differentially private if for allinputs, x , x ′ and outcome sets S ⊆ O:

P [M(x) ∈ S] ≤ eεP [M(x ′) ∈ S] .

I Local model of differential privacy is used in practice.

LDP randomizers properties

We use the following mechanisms

I Gaussian Noise [2]: Suppose each datum is sampled froman interval I of length `. Then we add independent noise

N (0, 2`2 ln(2/δ)/ε2)

to each datum guaranteeing (ε, δ)-differential privacy.

I Randomized Response [5]: Suppose each datum is a bit0, 1 and on each datum we operate independently,applying RRε : 0, 1 → 0, 1 where

RRε(b) =

b w.p. eε

1+eε

1− b else

I Bit Flipping algorithm [1]: Suppose each datum xi is ad -dimensional vector indicating its type using a standardbasis vector. The Bit Flipping mechanism now runs dindependent randomized response mechanism for eachcoordinate separately with parameter ε/2:

BF(x1i , . . . , x

di ) = (RRε/2(x1

i ), . . . ,RRε/2(xdi ))

LDP ME - Known Variance

Our approach is inspired by the work of Karwa and Vadhan [3].We adapt it to the local model.

KnownVar (X;σ, β, ε, n,R) (sketch)

1. Find a bin of length σ most likely to hold µ

2. Construct an interval of length 4σ + 2σ√

2 log (8n/β)centered at this bin

3. Project all remaining points onto this interval and addind. Gaussian noise.

KnownVar properties

I Privacy: KnownVar is (ε, δ)-LDP.

I Confidence Interval: If n ≥ 1600(

eε/2+1eε/2−1

)2log(

8dβ

),

then KnownVar returns an interval I such that:P

X,KnownVar[ mu ∈ I ] ≥ 1− β. whose size is:

|I | = O

(σ ·

√log (n/β) · log (1/β) · log(1/δ)

ε√

n

)

Locally Private Z-test

I For any interval on the reals I we can associate a likelihood

of pIdef= P

X∼P[X ∈ I ], and we know that w.p. pI ± β it

indeed holds that µ ∈ I .

I This mimics the power of a Z -test — in particular we cannow compare two intervals as to which one is more likely tohold µ, compare populations, etc.

I Below are results showing the empirical p-values and poweraveraged over 100 trials for various privacy parameters.

0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

p−Values for Z−Test

Alternative Hypotheses for the Mean

p−va

lue

0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

p−Values for Z−Test

Alternative Hypotheses for the Mean

p−va

lue

0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

p−Values for Z−Test

Alternative Hypotheses for the Mean

p−va

lue

0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

p−Values for Z−Test

Alternative Hypotheses for the Mean

p−va

lue

epsilon

0.50.811.5

Lower Bounds

I Main Lemma: LetM be a one-shot (each individual ispresented with a single query) local ε-differentially privatemechanism. Let P and Q be two distributions, with

∆def= dTV(P,Q). Fix any 0 < δ < e−1 and set

ε∗ = 8ε∆√

n(√

12

ln(2/δ) + 16ε∆√

n)

. Then, for any

set S of outputs,

PrX

i.i.d∼ P; M[M(X ∈ S] ≤ eε

∗Pr

Xi.i.d∼ Q; M

[M(X) ∈ S] + δ

I Lower bound: Any one-shot local differentially privatealgorithm must return an interval of length

Ω

(σ√

log(1/β)

ε√

n

)I Lower bound: LetM be a ε-LDP mechanism which is

(αdist, αquant, β)-useful for the p-quantile problem over P ,given that the true p-quantile lies in the interval [−R,R].Then, for any β < 1

6it must hold that

n ≥ Ω( 1α2

quantε2 · ln( R

αdistβ)).

LDP ME - Unknown Variance

Our approach mimics the same approach fromAlgorithm KnownVar but without the knowledge of thevariance.

I Goal: Find a suitably large yet sufficiently tight interval[s1, s2].

I Problem: This cannot be done using the off-the-shelf BitFlipping mechanism as that required we know the granularityof each bin in advance.

I Solution: We abandon the idea of finding a histogram on thedata. Instead, we propose finding a good approximation forσ using a quantile estimation based on a binary search,using the following algorithm.

Algorithm BinQuantRequire: Data x1, · · · , xN, target quantile p∗; ε, [Qmin,Qmax],λ, T .Initialize j = 0, n = N/T , s1 = Qmin, s2 = Qmax.for j = 1, · · · ,T do

Select users U (j) = j · n + 1, j · n + 2, · · · , (j + 1) · nSet t(j) ← s1+s2

2

Denote φ(j)(x) = 1x < t(j).Run randomized response on U (j) and obtain

Z (j) = 1n θRR(n, φ(j)).

if (Z (j) > p∗ + λ2

) then

s2 ← t(j)

else if (Z (j) < p∗ − λ2

) then

s1 ← t(j)

elsebreak

Ensure: t(j)

Our algorithm UnkVar uses the quantile estimation twice: oncefor p∗ = 1

2where t∗ = µ, and once for the value of

p∗ = Φ(1) ≈ 0.8413 for which the corresponding threshold ist∗ = µ + σ. Using these two values we obtain estimations forµ, σ and we apply a similar approach to Algorithm KnownVar.

UnkVar properties

I Privacy: UnkVar is (ε, δ)-LDP.

I Confidence Interval: Let X ∼ N (µ, σ2) i.i.d. Fixparameters ε, β ∈ (0, 1/2). Given that µ ∈ [−R,R] andthat σmin ≤ σ ≤ σmax ≤ 2R, if

n ≥ 1500 log2(16Rσmin

) ·(

eε+1eε−1

)2· ln(16 log2(16R/σmin)

β)

then the interval I returned by Algorithm UnkVar satisfies

that PX, UnkVar

[I 3 µ

]≥ 1− β, and moreover

I = O

(σ ·

√log (n/β) log (1/β) log(1/δ)

ε√

n

)I Very large variance case: If σ > R we give a different

algorithm, based on matching quantiles. We estimatep− = Pr[X < −R] and p+ = Pr[X < R], then plot theGaussian based on the quantiles of N (0, 1) obtaining p−and p+.

References

[1] Bassily and Smith. Local, private, efficient protocols for succincthistograms. In STOC’15.

[2] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor.Our data, ourselves: Privacy via distributed noise generation. InEUROCRYPT06, 2006.

[3] V. Karwa and S. P. Vadhan. Finite sample differentially privateconfidence intervals. In ITCS18.

[4] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova,and A. D. Smith. What can we learn privately? In FOCS08.

[5] S. L. Warner. Randomized response: A survey technique foreliminating evasive answer bias. Journal of the American Statis-tical Association, 60:63–69, 1965.

AISTATS 2019 Naha, Japan