Nonparametric Estimation via Convex .Nonparametric Estimation via Convex Programming Anatoli...

download Nonparametric Estimation via Convex .Nonparametric Estimation via Convex Programming Anatoli Juditskyâ„

of 29

  • date post

    30-Aug-2018
  • Category

    Documents

  • view

    222
  • download

    0

Embed Size (px)

Transcript of Nonparametric Estimation via Convex .Nonparametric Estimation via Convex Programming Anatoli...

  • Nonparametric Estimation via Convex Programming

    Anatoli Juditsky Arkadi Nemirovski

    Abstract

    In the paper, we focus primarily on the problem of recovering a linear form gT x of unknownsignal x known to belong to a given convex compact set X Rn from N independentrealizations of a random variable taking values in a finite set, the distribution p of beingaffinely parameterized by x: p = Ax + b. With no additional assumptions on X and A, wedevelop minimax optimal, within an absolute constant factor, and computationally efficientestimation routine. We then apply this routine to recovering x itself in the Euclidean norm.

    1 Introduction

    In the sequel, we mainly focus on the estimation problem as follows:

    Problem I: We observe N independent realizations i1, ..., iN of a random variable taking values in a finite set, say, the set I = {1, ...,M}. The distribution of (which isidentified with a vector p from the standard simplex PM = {y RM : y 0,

    i yi = 1}

    by setting pi = Prob{ = i}, 1 i M) is affinely parameterized by n-dimensionalsignal vector of unknown parameters x known to belong to a given convex compactset X Rn: p = A(x) = [A1(x); ...; AM (x)], where A() is a given affine mapping withA(X) PM .Our goal is to infer from the observations certain information on x, primarily, to estimatea given linear form gT z of z Rn at the point x underlying our observations.

    While the unknown x is assumed to be finite-dimensional, we allow the dimension to be arbi-trarily large, thus addressing, essentially, a nonparametric estimation problem. In NonparametricStatistics, there exists an immense literature on various versions of Problem I, including numerouspapers on estimating linear forms, see, e.g., [24, 68, 1013, 1520, 22, 2426, 28, 30, 3239]and references therein. To the best of our knowledge, the majority of papers on the subject focuson concrete domains X (e.g., distributions on fine grids obtained by discretization of densitiesfrom Sobolev balls), and investigate lower and/or upper bounds on the worst-case, w.r.t. x X,quality to which the problem of interest can be solved. These bounds depend on the number ofobservations N , and the question of primary interest is the behaviour of the bounds as N .When the lower and the upper bounds coincide within a constant factor (or, ideally, within factor(1 + o(1))) as N , the estimation problem is considered as being essentially solved, and theestimation methods underlying the upper bounds are treated as optimal.

    LJK, Universite J. Fourier, Grenoble, France, Anatoli.Juditsky@imag.frSchool of ISyE, Georgia Institute of Technology, Atlanta, USA, nemirovs@isye.gatech.eduResearch of the second author was partly supported by the NSF grant # 0619977

    1

  • The approach we take in this paper is of a different spirit. Except for the concluding Section 4,we make no structural assumptions on X, aside of crucial for us assumptions of convexity andcompactness, and we make no assumptions on the affine mapping A(x). Clearly, with no structuralassumptions on X and A(), explicit bounds on the risks of our estimates, same as bounds onthe minimax optimal risk, are impossible. What is possible and this is our major goal in whatfollows is to demonstrate that when estimating linear forms, the worst-case risk of the estimatewe develop is within an absolute constant factor of the ideal (i.e., the minimax optimal) risk. Itshould be added that while the optimal, within an absolute constant factor, worst-case risk of ourestimates is not available in a closed analytical form, it is available algorithmically it can beefficiently computed, provided that X is computationally tractable1.

    While we are not aware of general results of the outlined spirit for Problem I, results of thistype do exist for the regression counterpart of Problem I, namely, for

    Problem II: Given indirect noisy observations

    y = Ax + (1)

    of unknown signal x known to belong to a given convex compact set X Rn (A is agiven m n matrix, N (0, Im), > 0 is given), we want to estimate the value gT zof a given linear form gT z of z Rn at the point x underlying our observations.

    As shown by D. Donoho [8], for all commonly used loss functions, the minimax optimal affine iny estimate in Problem II (this estimate can be easily built, provided that X is computationallytractable) is minimax optimal, within an absolute constant factor, among all possible estimates.In a sense, our results establish similar fact for estimating linear forms of the signal in the contextof Problem I, since our estimates also are affine they are affine functions of the empiricaldistribution of the discrete random variable induced by our observations.

    The rest of this paper is organized as follows. In Section 2 we consider the Hypotheses Testingversion of Problem I, where one, given two convex subsets X1, X2 in X, is interested to test thehypothesis x X1 vs. the alternative x X2. The central Section 3 focuses on the version ofProblem I where the goal is to estimate via the observations a given linear form of x. In theconcluding Section 4, we discuss briefly how our results from Section 3 related to Problem I andthe aforementioned results of Donoho [8] related to Problem II can be used in order to recoverthe entire signal x underlying our observations, the model of observations being either ProblemI, or Problem II; as a loss function, we use the standard Euclidean norm on Rn. When passingfrom recovering linear forms of the unknown signal to recovering the signal itself, we do imposestructural assumptions on X, but still make no structural assumptions on the affine mapping A(x)(Problem I) and matrix A (Problem II), and our optimality results become weaker instead ofoptimality within an absolute constant factor we end up with statements like the worst-case riskof such-and-such estimate is in-between the minimax optimal risk and the latter risk to the power, with depending on the geometry of X (and close to 1 when this geometry is good enough).The appendix contains an alternative proof (in our opinion, much simpler than the original one) ofthe aforementioned Donohos theorem on minimax almost-optimality of affine estimates in thecontext of Problem II.

    1For details on computational tractability and complexity issues in Convex Optimization, see, e.g., [1, Chapter 4].A reader not familiar with this area will not lose much when interpreting a computationally tractable convex set asa set given by a finite system of inequalities pi(x) 0, i = 1, ..., m, where pi(x) are convex polynomials.

    2

  • 2 Problem I: Hypotheses testing

    In this Section, we focus on the case of Problem I as follows:

    Hypotheses Testing (HT): In the situation of Problem I, given two closed convex subsetsXi, i = 1, 2, of X, test the hypothesis x X1 vs. the alternative x X2.

    2.1 The test

    Let Y1, Y2 be two closed convex subsets in PM (recall that we identify vectors from PM withprobability distributions on the M -element index set I = {1, ..., M}). Assume that we are given Nindependent realizations iN = [i1; ...; iN ] of a random variable distributed according to y PMand want to distinguish between two hypotheses, 1 and 2, stating, respectively, that y Y1 andthat y Y2. A candidate decision rule in this problem is a function (iN ) taking values 1 and 2;for such a decision rule, its error probabilities (), = 1, 2, are defined as

    () = supyY

    ProbiNy...y{(iN ) 6= } .

    The test we intend to use is as follows.

    Test ,c: We choose a weight vector RM and a threshold c R and accepthypothesis 1 when

    Nt=1 it c, otherwise we accept 2.

    We start with a construction of test parameters (, c) based on Bernstein approximation [31].Let us fix Y2. With iN N = ... , the probability for a (, c)-test to accept hypothesis1 is ProbiNN {

    t it c}. For every > 0, this probability does not exceed the quantity

    EiNN

    {exp{

    t

    1it}}

    exp{1c} = exp{1c}(

    iIi exp{i/}

    )N.

    We conclude that if (0, 1), then the condition

    > 0 : N ln(

    iIi exp{i/}

    ) 1c ln()

    is a sufficient condition for the N -probability to accept 1 with test ,c to be . We rewritethis sufficient condition equivalently as

    > 0 : N ln(

    iIi exp{i/}

    ) c + ln(1/)

    (,,c;)

    0, (2)

    the benefit being the fact that the function (, , c; ) is convex in (, , c) in the domain > 0and is concave in PM . Indeed, the concavity in is evident; to verify the convexity, note thatthe function H(, c; ) = N ln

    (iI i exp{i}

    ) c + ln(1/) clearly is convex in (, c), and the

    3

  • projective transformation F (u) 7 F (1u) is known to convert a convex function of u into aconvex in the domain > 0 function of (u, ).

    By similar argument, the condition

    > 0 : N ln(

    iIi exp{i/}

    )+ c + ln(1/) 0 (3)

    guarantees that the N -probability to accept 2 with the test ,c is . We have arrived at thefollowing

    Proposition 2.1 Assume that RM and , are such that

    N maxY1

    ln(

    iIi exp{i/}

    )+ N max

    Y2ln

    (iI

    i exp{i/})

    + ( + ) ln(1/) 0, > 0, > 0

    (4)

    Setting

    c =12

    [N max

    Y2ln

    (

    iIi exp{i/}

    ) N max

    Y1ln

    (

    iIi exp{i/}

    )+ ( ) ln(1/)

    ],

    (5)we ensure that

    1(,c) and 2(,c) .Proof. We have

    N maxY1

    ln(

    iIi exp{i/}

    )+ c + ln(1/)

    = 12

    [N max

    Y1ln

    (iI

    i exp{i/})

    + N maxY2

    ln(

    iIi exp{i/}

    )+ ( + ) ln(1/)

    ] 0

    and similarly

    N maxY2

    ln(

    iIi exp{i/}

    ) c + ln(1/)

    = 12

    [N max

    Y1ln

    (iI

    i exp{i/})

    + N maxY2

    ln(

    iIi