CDA chapter 7anna/Stat697L/CDAchap7.pdfExample: Modelingflourbeetlemortality...

27
CDA chapter 7

Transcript of CDA chapter 7anna/Stat697L/CDAchap7.pdfExample: Modelingflourbeetlemortality...

  • CDA chapter 7

  • Probit modelsLet φ be a CDF function, the probit model is

    Φ−1[π(x)] = α + βx

    It corresponds to a latent/threshold varaible model. Let Y ∗ be anunobserved continuous response such that

    y ={

    0 if y∗ ≤ 01 if y∗ > 0

    Suppose thatY ∗ = α + βx + �, � ∼ N(0, 1)

    Therefore,

    P(Y = 1) = P(Y ∗ > 0) = P(α + βx + � > 0) = Φ[(α + βx)]

    which is a probit model.

  • Probit models: interpreting effectsI The response curve for π(x) is the normal cdf with meanµ = −α/β and standard deviation σ = 1/|β|. The logisticregression curve for π(x) is a logistic cdf with the same meanand standard deviation π/|β|

    √3.

    I Since 68% of the normal density falls within a standarddeviation of the mean, 1/|β| is the distance between x valueswhere π(x) = 0.16 or 0.84 and where π(x) = 0.50.

    I The instantaneous rate of change in π(x) is∂π(x)/∂x = βφ(α + βx)

    I When π(x) = 1/2, x = −α/β, the rate of change in π(x) atthis point is 0.40β. It is 0.25β for logistic regression

    I Therefore, when both models fit well, parameter estimates inlogstic regression are about 1.6=0.40/0.25 to π/sqrt3=1.8times those in probit models

    I Although probit model parameters are on a different scale thanlogistic model parameter estimates, the probability summariesof effects are similar.

  • Example: Modeling flour beetle mortality

    Table 7.1 reports the number of adult flour beetles killed after 5hours of exposure to gaseous carbon disulfide at variousconcentrations.

  • Example: Modeling flour beetle mortality

  • Example: Modeling flour beetle mortality

    The ML fit fo the probit model is

    Φ−1[π̂(x)] = −34.94 + 19.73x

    I π̂(x) = 0.5 at x = −α̂/β̂ = 34.94/29.73 = 177I The fit corresponds to a N(1.77, 1/19.732 = 0.052) cdfI As x increases from 1.69 to 1.88, the estimated probability of

    death increases from 0.057 to 0.987I For a 0.10-unit increase in x , the latent continuous dose

    tolerence variable y∗ shifts up by 0.10*(19.73) which is roughly2 standard deviation.

    I The deviance G2 is 11.23 for the logit model and 10.12 for theprobit model with df = 6.

  • Complementary log-log models

    For logit and probit models, the response curve π(x) is symmetricabout π(x) = 0.5, with π(x) approaching 0 at the same rate that itapproaches 1.

    Complementary log-log models

    π(x) = 1− exp[− exp(α + βx)]

    or equivalently

    log[− log(1− π(x))] = α + βx

    are asymmetric, π(x) appraching 0 fairly slowly but approaching 1quite sharply.

  • Complementary log-log models

  • log-log models

    π(x) = exp[− exp(α + βx)]

    or equivalentlylog[− log(π(x))] = α + βx

    I Log-log models have π(x) approaches 0 sharply but approaches1 slowly.

    I When the log-log model holds for the probability of a success,the complementary log-log model holds for the probability of afailure.

    I Log-log models correspond to latent continuous variablesfollowing a Gumbel distribution, whose cdf is

    exp[− exp[−(x − a)/b]]

    It is highly skewed to the right.

  • Example: Beetle Mortality revisitedInterpretation of complementary log-log models:

    1− π(x2) = [1− π(x1)]exp[β(x2−x1)]

    For x2 − x1 = 1, the complement probability at x2 equals thecomplement probability at x1 raised to the exp(β) power.

    For the data, the complementary log-log model has ML estimate

    log(− log(1− π(x))) = −39.57 + 22.04x

    I The probability of survival at dosage 1.7 is 0.885, whereas atx=1.8, it is 9.33, at x = 1.9, it is 4x10−5. The probability ofsurvival at x + .1 equals the probability at dosage x rasied tothe exp(22.04x0.1) = 9.06 power

    I The G2 = 3.45 with df = 6 suggests adequate fit.I The AICs are 41.3 for the logit link, 40.2 for the probit link,

    33.7 for the complementary log-log link and 57.8 for the log-loglink.

  • Bayesian inference for binary regression

    Choices of priors for the coefficient β:

    I Flat priorI The posterior distribution is a scaling of the likelihood function

    so that it integrates out to 1I The mode of the posterior distribution is then the ML estimate.

    The mean of the posterior distribution is the Bayesion estimate.When the posterior distribution is approximately normal, thetwo are similiar.

    I Some flat priors are improper, not integrating out to 1 especiallywhen the parameters can take value over the entire real line. Inthis case, the posterior distributions can also be imporper forsome models.

  • Bayesian inference for binary regression

    I Diffuse priors: normal density with large standard deviationI Subject Bayesian whereby prior distributions represent prior

    beliefs about β. For example, take µ and σ in the normal priorsuch that µ± 3σ contains all plausible values.

    I Construct a prior distribution on the probability scale ratherthan on the logit scale

    I Jeffreys prior

  • Example: Risk factors for endometrial cancer grade

  • Example: Risk factors for endometrial cancer grade

    Consider the main-effects model

    logit[P(Y = 1)] = α + β1x1 + β2x2 + β3x3

    I x1 is coded as -0.5 and 0.5 so that the induced priordistribution on the logit is the same for each group of x1.

    I x2 and x3 are standardized.

  • Example: Risk factors for endometrial cancer grade

    Consider hypothesisβ1 ≤ 0 vs β1 > 0

    I The frequentist likelihood ratio test gives p-value 0.001I The Bayesian p-value is 0.002 with prior σ = 10 and 0.007

    with prior σ = 1.

  • Example: Modeling the probability a Trauma patientsurvives

  • Example: Modeling the probability a Trauma patientsurvives

    Probability-based prior specifications require selecting priordistributions for at seast as many probabilities values as there areparameters in the model.

    I when x1 = 25, x2 = 7.84, x3 = 60, x4 = 0, representing aperson with normal vital signs who was not badly hurt, assumea beta(1.1, 8.5) prior for the probability of death. This isequivalent to augment the data with 9.6 observations with 1.1death for this predictor setting.

    I When x1 = 41, x2 = 3.34, x3 = 60, x4 = 1, representing aperson with considerably more severe injury score and poortrauma score, assume a beta(5.9, 1.7) prior for the probabilityof death. This is equivalent to augment the data with 7.6observations with 5.9 death for this predictor setting.

  • Example: Modeling the probability a Trauma patientsurvives

    With six prior distributions for six probabilites, they induce priordistributions for the logistic model parameters. With dataaugmentation priors, the posterior distribution has the shape ofthe liekihood for the augmentated data set. So we can use standardfrequentist software to find the posterior mode by finding the MLestimate for the augmented data.

  • Example: Posterior probability estimate

  • Kernel smoothingKernel estimation is a smoothing method that non-model based. Toestimate a mean (such as cell probability) at a particular point, itsmooths the data by using not only the data at that point (such asa sample proportion) but also the data at other points.

    At any value x , the kernel smoothed estimate of P(Y = 1|X = x) is

    π̃(x) =∑

    i yiφ[(x − xi )/λ]∑i φ[(x − xi )/λ]

    where

    I φ(·) is a symmetric unimodal kernel functionI If φ is a constant, the estimate is the overall sample proportion

    of successesI If φ(u) = 1 when u = 0 and 0 otherwise, the estimate is the

    sample proportion of successes at x = xk . Then there is nosmoothing.

  • Kernel smoothing

    I λ is the smoothing parameter. Consider φ(u) = exp[−u2/2],I for very small λ, only points close to x have much weight. Then

    the estimate at x uses mainly very local data. There is littlebias but high variance.

    I as λ increases, more weight is given to points greatly distant,the smoother is more like the overall sample proportion, beingmore highly biased but with smaller variance.

    I λ controls the bias-variance tradeoff. Copas recommendselecting λ by plotting the rsulting function for several valuesof λ, varying around a value equal to 10 times the averagespacing of the x values. Other ways of selecting λ includescross-validation and asymptotic theory.

  • Example: Smoothing to portray probability of Kyphosis

  • Nearest Neighbors smoothing

    Let sij be a measure of similarity between subjects i and j , such asinverse Euclidean distance between values xi and xj of explanatoryvariables for the two subjects. Then the nearest neighbor estimateof P(Yi = 1|Xi = xi) is

    π̂i =∑

    j∈N(i) sijyj∑j∈N(i) sij

    where N(i) is the set of k subjects who ar the nearest neighbors ofsubject i .

    I k is the smoothing parameter which can be determined bycross-validation

    I The method is simple to implement, however, it depends onsimilarity measure and its decision boundary can be highlyirregular.

  • Smoothing using penalized likelihood estimationThe penalized likelihood estimator maximizes

    L∗(β) = L(β)− λ(β)Examples of the roughness penalty terms are:

    I two way table:λ(π) = λ

    ∑i

    ∑j

    [log(πijπi+1,j+1)/(πi+1,jπi+1,jπi ,j+1)]2

    I L2 penalty in logistic regressionλ(β) = λ

    ∑jβ2j

    I L1 penalty in logistic regressionλ(β) = λ

    ∑j|βj |

    I L0 penalty in logistic regressionλ(β) = λ

    ∑j

    I(βj 6= 0)

  • Smoothing using penalized likelihood estimation

    λ is the smoothing parameter controlling the tradeoff betweenvariance and bias, which can be estimated with cross-validation orother methods.

  • Example: complete separation but finite logistic estimatesFor the data in Table7.2 on risk factors for the histological grade ofendometrial cancer, we considered model

    logit[P(Y = 1)] = α + β1x1 + β2x2 + beta3x3

    I The ML estimate β̂1 =∞, the 95% profile likelihood interval is(1.28,∞)

    I The Firth penalized likeihood estimates areβ̂1 = 2.93(SE = 1.55), the 95% profile pnealized likelihoodconfidence interval for β1 is (0.61, 7.85), a considerablyshrinkage toward 0.

    I Firth penalized likelihood in logistic regression

    λ(β) = 12 log |J |

    where J is the information matrix.

  • Generalized additive models (GAMs)

    GAMs generalize GLMs in the systematic component. The linearpredictor in GLMs g(µi ) =

    ∑j βjxij is replaced by

    g(µi ) =∑

    jsj(xij)

    where sj(·) is an unspecified smooth function of predictor j . Auseful smooth function is the cubic spline. It has separate cubicpolynomials over sets of disjoint intervals, joined together smoothlyat boundaries of thos intervals.

    Examples include the horseshoe crab data with poisson response andbinary response.