STAT 200C: High-dimensional arash.amini/teaching/stat200c/notes/200C_ ¢  STAT 200C:...

download STAT 200C: High-dimensional arash.amini/teaching/stat200c/notes/200C_ ¢  STAT 200C: High-dimensional

of 130

  • date post

    15-Mar-2020
  • Category

    Documents

  • view

    17
  • download

    2

Embed Size (px)

Transcript of STAT 200C: High-dimensional arash.amini/teaching/stat200c/notes/200C_ ¢  STAT 200C:...

  • STAT 200C: High-dimensional Statistics

    Arash A. Amini

    June 4, 2019

    1 / 124

  • • Classical case: • n� d . • Asymptotic assumption: d is fixed and n→∞. • Basic tools: LLN and CLT.

    • High-dimensional setting: • n ∼ d , e.g. n/d → γ or • d � n, e.g. d ∼ en

    • e.g. 104 genes with only 50 samples. • Classical methods fail. E.g., • Linear regression y = Xβ + ε, where ε ∼ N(0, σ2In).

    β̂OLS = argmin β∈Rd

    ‖y − Xβ‖22

    We have MSE(β̂OLS) = O( σ2d n ).

    • Solution: Assume some underlying low-dimensional structure (e.g. sparsity). • Basic tools: Concentration inequalities.

    2 / 124

  • Table of Contents I 1 Concentration inequalities

    Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ2 Concentration Johnson-Lindenstrauss embedding `2 norm concentration `∞ norm

    Bounded difference inequality (Azuma–Hoeffding) Detour: Martingales Azuma–Hoeffding Bounded difference inequality Concentration of (bounded) U-statistics Concentration of clique numbers

    Gaussian concentration χ2 concentration revisited Order statistics and singular values

    Gaussian chaos (Hanson–Wright inequality)

    2 Sparse linear models Basis Pursuit

    Restricted null space property (RNS) 3 / 124

  • Table of Contents II Sufficient conditions for RNS

    Pairwise incoherence Restricted isometry property (RIP)

    Noisy sparse regression Restricted eigenvalue condition Deviation bounds under RE RE for anisotropic design Oracle inequality (and `q sparsity)

    3 Metric entropy Packing and covering Volume ratio estimates

    4 Random matrices and covariance estimation Op-norm concentration: sub-Gaussian matrices, independent entries Op-norm concentration: Gaussian case Sub-Gaussian random vectors Op-norm concentration: sample covariance

    5 Structured covariance matrices Hard thresholding estimator Approximate sparsity (`q balls)

    6 Matrix concentration inequalities 4 / 124

  • Concentration inequalities

    • Main tools in dealing with high-dimensional randomness. • Non-asymptotic versions of the CLT. • General form: P(|X − EX | > t) < something small. • Classical examples: Markov and Chebyshev inequalities: • Markov: Assume X ≥ 0, then

    P(X ≥ t) ≤ EX t .

    • Chebyshev: Assume EX 2

  • Concentration inequalities

    Example 1

    • X1, . . . ,Xn ∼ Ber(1/2) and Sn = ∑n

    i=1 Xi . Then, by CLT

    Zn := Sn − n/2√

    n/4

    d→ N(0, 1).

    • Letting g ∼ N(0, 1),

    P ( Sn ≥

    n

    2 +

    √ n

    4 t

    ) ≈ P(g ≥ t) ≤ 1

    2 e−t

    2/2.

    • Letting t = α√n,

    P ( Sn ≥

    n

    2 (1 + α)

    ) /

    1

    2 e−nα

    2/2.

    • Problem: Approximation is not tight in general.

    6 / 124

  • Theorem 1 (Berry–Esseen CLT)

    Under the assumption of CLT, with ρ = E|X1 − µ|3/σ3,∣∣P(Zn ≥ t)− P(g ≥ t)∣∣ ≤ ρ√ n .

    • Bound is tight since P(Sn = n/2) = 12n (

    n n/2

    ) ∼ 1√

    n , for Bernoulli exa.

    • Conclusion, the approximation error is O(n−1/2) which is a lot larger than the exponential bound O(e−nα

    2/2) that we want to establish.

    • Solution: Directly obtain the concentration inequalities, • often using Chernoff bounding technique: for any λ > 0,

    P(Zn ≥ t) = P(eλZn ≥ eλt) ≤ EeλZn eλt

    , t ∈ R.

    • Leads to the study of the MGF of random variables.

    7 / 124

  • Sub-Gaussian concentration

    Definition 1 A zero-mean random variable X is sub-Gaussian if for some σ > 0.

    EeλX ≤ eσ2λ2/2, for all λ ∈ R. (1)

    A general random variable is sub-Gaussian if X − EX is sub-Gaussian.

    • X ∼ N(0, σ2) satisfies (1) with equality. • A Rademacher variable, also called symmetric Bernoulli P(X = ±1) = 12 is

    sub-Gaussian,

    EeλX = cosh(λ) ≤ eλ2/2.

    • Any bounded RV is sub-Gaussian: X ∈ [a, b] a.s., then (1) with σ = b−a2 .

    8 / 124

  • Proposition 1

    Assume that X is zero-mean sub-Gaussian satisfying (1). Then,

    P(X ≥ t) ≤ exp ( − t

    2

    2σ2

    ) , for all t ≥ 0.

    Same bound holds with X replaced with −X .

    • Proof: Chernoff bound

    P(X ≥ t) ≤ inf λ>0

    [ e−λtEeλX

    ] = inf λ>0

    exp ( −λt + λ

    2σ2

    2

    ) .

    • Union bound gives two-sided bound: P(|X | ≥ t) ≤ 2 exp(− t22σ2 ). • What if µ := EX 6= 0? Apply to X − µ,

    P(|X − µ| ≥ t) ≤ 2 exp ( − t

    2

    2σ2

    ) .

    9 / 124

  • Proposition 2

    Assume that {Xi} are • independent, zero-mean sub-Gaussian with parameters {σi}.

    Then, Sn = ∑

    i Xi is sub-Gaussian with parameter σ := √∑

    i σ 2 i .

    • Sub-Gaussian parameter squared behaves like the variance. • Proof: EeλSn = ∏i EeλXi .

    10 / 124

  • Theorem 2 (Hoeffding)

    Assume that {Xi} are • independent, zero-mean sub-Gaussian with parameters {σi}.

    Then, letting σ2 := ∑

    i σ 2 i ,

    P (∑

    i

    Xi ≥ t ) ≤ exp

    ( − t

    2

    2σ2

    ) , t ≥ 0.

    Same bound holds with Xi replaced with −Xi .

    • Alternative form, assume there are n variables, and • let σ̄2 := 1n

    ∑n i=1 σ

    2 i , and X̄n :=

    1 n

    ∑n i=1 Xi . Then,

    P ( X̄n ≥ t

    ) ≤ exp

    ( − nt

    2

    2σ̄2

    ) , t ≥ 0.

    • Example: Xi iid∼Rad so that σ̄ = σi = 1.

    11 / 124

  • Equivalent characterizations of sub-Gaussianity

    For a RV X , the following are equivalent: (HDP, Prop. 2.5.2)

    1. The tails of X satisfy

    P(|X | ≥ t) ≤ 2 exp(−t2/K 21 ), for all t ≥ 0.

    2. The moments of X satisfy

    ‖X‖p = (E|X |p)1/p ≤ K2 √ p, for all p ≥ 1.

    3. The MGF of X 2 satisfies

    E exp(λX 2) ≤ exp(K 23λ2), for all |λ| ≤ 1

    K3

    4. The MGF of X 2 is bounded at some point,

    E exp(X 2/K 24 ) ≤ 2.

    Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies

    E exp(λX ) ≤ exp(K 25λ2), for all λ ∈ R.

    12 / 124

  • Sub-Gaussian norm

    • The sub-Gaussian norm is the smallest K4 in property 4, i.e.,

    ‖X‖ψ2 = inf { t > 0 : E exp(X 2/t2) ≤ 2

    } .

    • X is sub-Gaussian iff ‖X‖ψ2 0.

    13 / 124

  • Some consequences

    • Recall what a universal/numerical/absolute constant means. • Sub-Gaussian norm is within a constant factor of the sub-Gaussian

    parameter σ: for numerical constant c1, c2 > 0,

    c1‖X‖ψ2 ≤ σ(X ) ≤ c2‖X‖ψ2 .

    • Easy to see that ‖X‖ψ2 . ‖X‖∞. (Bounded variables are sub-Gaussian) • a . b means a ≤ Cb for some universal constant C .

    Lemma 1 (Centering)

    If X is sub-Gaussian, then X − EX is sub-Gaussian too and

    ‖X − EX‖ψ2 ≤ C‖X‖ψ2

    where C is a universal constant.

    • Proof: ‖EX‖ψ2 . |EX | ≤ E|X | = ‖X‖1 . ‖X‖ψ2 . • Note: ‖X − EX‖ψ2 could be much smaller than ‖X‖ψ2 .

    14 / 124

  • Alternative forms

    • Alternative form of Proposition 2:

    Proposition 3 (HDP 2.6.1)

    Assume that {Xi} are • independent, zero-mean sub-Gaussian RVs.

    Then ∑

    i Xi is also sub-Gaussian and

    ‖ ∑ i

    Xi‖2ψ2 ≤ C ∑ i

    ‖Xi‖2ψ2

    where C is an absolute constant.

    15 / 124

  • • Alternative form of Theorem 2:

    Theorem 3 (Hoeffding)

    Assume that {Xi} are independent, zero-mean sub-Gaussian RVs. Then,

    P (∣∣∣∑

    i

    Xi

    ∣∣∣ ≥ t) ≤ 2 exp(− c t2∑ i ‖Xi‖ψ2

    ) , t ≥ 0.

    • c > 0 is some universal constant.

    16 / 124

  • Sub-exponential concentration

    Definition 2 A zero-mean random variable X is sub-exponential if for some ν, α > 0.

    EeλX ≤ eν2λ2/2, for all |λ| < 1 α

    . (2)

    A general random variable is sub-exponential if X − EX is sub-exponential.

    • If Z ∼ N(0, 1), then Z 2 is sub-exponential.

    Eeλ(Z 2−1) =

    { e−λ√ 1−2λ λ < 1/2

    ∞ λ > 1/2 .

    • We have Eeλ(Z 2−1) ≤ e4λ2/2, |λ| < 1/4. • hence sub-exponential with parameters (2, 4). • Tails of Z 2 − 1 are heavier than a Gaussian.

    17 / 124

  • Proposition 4

    Assume that X is zero-mean sub-exponential satisfying (2). Then,

    P(X ≥ t) ≤ exp ( −1

    2 min

    { t2 ν2 , t

    α

    }) , for all t ≥ 0.

    Same bound holds with X replaced with −X .

    • Proof: Chernoff bound

    P(X ≥ t) ≤ inf λ≥0

    [ e−λtEeλX

    ] ≤ inf

    0≤λ< 1α exp

    ( −λt + λ

    2ν2

    2

    ) .

    • Let f (λ) = −λt + λ2ν2/2. • Minimizer of f over R is λ = t/ν2.

    18 / 124

  • • Hence minimizer of f over [0, 1/α] is

    λ∗ =

    { t/ν2 t/ν2 < 1/