Biostatistics-Lecture 10 Regression

Ruibin Xi

Peking University

School of Mathematical Sciences

Analysis of Variance (ANOVA)

• Consider the Iris data again

• Want to see if the average sepal widths of the three species are the same

– μ1 , μ2, μ3 : the mean sepal width of Setosa, Versicolor, Virginica

– Hypothesis:

H0: μ1 = μ2= μ3

H1: at least one mean is different

Analysis of Variance (ANOVA)

• Used to compare ≥ 2 means

• Definitions

– Response variable (dependent)—the outcome of interest, must be continuous

– Factors (independent)—variables by which the groups are formed and whose effect on response is of interest, must be categorical

– Factor levels—possible values the factors can take

Sources of Variation in One-Way ANOVA

• Partition the total variability of the outcome into components—source of variation

– the sepal width of the jth plant from the ith species (group)

jiy , jnjki 1,1

)()( yyyyyy iiijij

Grand mean The ith group mean

Sources of Variation in One-Way ANOVA

• SST: sum of squares total

• SSB: sum of squares between

• SSW (SSE): sum of squares within (error)

yySSWSSBSST

i ii yynSSB

F-test in one-way ANOVA

• The test statistic is called F-statistic

Under the null hypothesis, follows an F-distribution with

(df1,df2) = (k-1,n-k)

• For the Iris data – SSB=11.34, MSB = 5.67, SSE=16.96, MSE=0.12 – f = 49.16, df1=2,df2=147 – Critical value 3.06 at α=0.05, reject the null – Pvalue = P(F>f)=4.49e-17

One-way ANOVA

• ANOVA table

One-way ANOVA

• ANOVA table

ANOVA model

• The statistical model

Yij = μ + αi + eij

The ith response in the jth group

grand mean

The effect of group j

ANOVA assumptions

• Normality

• Homogeneity

• Independence

Regression—an example

• Cystic fibrosis (囊胞性纤维症) lung function data

– PEmax (maximal static expiratory pressure) is the response variable

– Potential explanatory variables

• age, sex, height, weight,

• BMP (body mass as a percentage of the age‐specific median)

• FEV1 (forced expiratory volume in 1 second)

• RV (residual volume)

• FRC(funcAonal residual capacity)

• TLC (total lung capacity)

• Let’s first concentrate on the age variable

• The model

• Plot PEmax vs age

• Let’s first concentrate on the age variable

• The model

• Plot PEmax vs age

Simple Linear regression

Assumptions

• Normality

– Given x, the distribution of y is normal with mean α+βx with standard deviation σ

• Homogeneity

– σ does not depend on x

• Independence

Residuals

Fitting the model

Goodness of Fit

Inference about β

Inference about β: the CF data

Plotting the regression line

Residual plot

• The CF patients data

Linear Regression

Summary: simple linear regression

Multiple regression

• See blackboard

GENERALIZED LINEAR MODELS Regression

Generalized liner models

• GLM allow for response distributions other than normal

– Basic structure

• The distribution of Y is usually assumed to be independent and

Generalized Linear model-an example

• An example

– A study investigated the roadkills of amphibian

• Response variable: the total number of amphibian fatalities per segment (500m)

• Explanatory variables

• An example

• For now, we are only interested in Distance to Natural Park

• An example

• For now, we are only interested in Distance to Natural Park

• An over-simplified model

• For Poisson we have

GLM-exponential families

• Exponential families

– The density

– Normal distributions is an exponential family

• Consider the log-likelihood of a general exponential families

• Differentiating the likelihood one more time

using the equation

We often assume

Define

Fitting the GLM

• In a GLM

• The joint likelihood is

• The log likelihood

Fitting the GLM

• Assuming ( is known)

• Differentiating the log likelihood and setting it to zero

Fitting the GLM

• By the chain rule

• Since

• We have

)( ii b

Canonical Link Function

• The Canonical Link Function is such that

• Remember that

• So

)( ii b

Iteratively Reweighted Least Squares (IRLS)

• We first consider fitting the nonlinear model

by minimizing

where f is a nonlinear function

• Given a good guess

by using the Taylor expansion

Define the pseudodata

• Note that in GLM, we are trying to solve

• If are known, this is equivalent to minimizing

• We are inspired to use the following algorithm

– At the kth iteration, define

• update as in the nonlinear model case

• set k to be k+1

– But the second step also involves iteration, we may perform one step iteration here to obtain

Deviance

• The deviance is defined as

where is the log likelihood with the saturated model: the model with one parameter per data point

Also note that deviance is defined to be independent of

Deviance

• GLM does not have R2

• The closest one is the explained deviance

• The over-dispersion parameter may be estimated by

or by the Pearson statistic

Model comparison

• Scaled deviance

• For the hypothesis testing problem

under H0

Residuals

• Pearson Residuals

– Approximately zero mean and variance

• Deviance Residuals

Negative binomial

• Density

Biostatistics-Lecture 10 Regression

Documents

Transcript of Biostatistics-Lecture 10 Regression

UVA CS 4774: Machine Learning Lecture 5: Linear Regression ...

Lecture 5 Hypothesis Testing in Multiple Linear Regression · Lecture 5 Hypothesis Testing in Multiple Linear Regression BIOST 515 January 20, 2004

PUBH 8403: Research Skills in Biostatistics [10pt ...brad/8400/Wolfson_simulationstudy.pdf · PUBH 8403: Research Skills in Biostatistics Simulation Studies & Reproducibility Julian

Lecture 6: Perceptron & Logistic Regression

Biostatistics Case Studies

Simple Linear Regression II Lecture 2

Econometrics: Multiple Linear Regression - — OCW - …ocw.uc3m.es/economia/econometrics/lecture-notes-1/Topic3...The Multiple Linear Regression Model I Many economic problems involve

Lecture 13. Inference for regression

Lecture 19 Multiple (Linear) Regression - Statistical Science · Lecture 19 Multiple (Linear) Regression Thais Paiva STA 111 - Summer 2013 Term II August 1, 2013 1/30 Thais Paiva

Karl Bang Christensen Dept. of Biostatistics Univ. of ...

Lecture 7 Logistic Regression with Random Intercept

Lecture 7: Assessing Regression Studies

Stat 101: Lecture 6fei/Teaching/stat101/Lect/handout6.pdf · Stat 101: Lecture 6 Summer 2006. Outline Review and Questions Example for regression Transformations, Extrapolations,

Logistic Regression - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_3022/10c.pdf · Logistic Regression: Regression for outputting Probabilities Intuitions

Econometrics - Lecture 2 Introduction to Linear Regression – Part 2.

Lecture 4: Multivariate Regression Model in Matrix Formyamanota/Lecture Note 4 to 7 OLS.pdf · 1 Takashi Yamano Lecture Notes on Advanced Econometrics Lecture 4: Multivariate Regression

Lecture 4: Regression ctd and multiple classesaz/lectures/ml/lect4.pdf · Lecture 4: Regression ctd and multiple classes C19 Machine Learning Hilary 2015 A. Zisserman • Regression

Lecture 7 Logistic Regression with Random · PDF file• In logistic regression the error is assumed to have a logistic cumulative density function given x, εi Pr( ) ... Logistic

Lecture 10 Introduction to Linear Regression and Correlation Analysis.

Bayesian Biostatistics Using BUGSjbn/courses/bugs2/... · Bayesian Biostatistics Using BUGS (3) 3.13 Bayesian Biostatistics Using BUGS 5.4. ΠΑΡΑ∆ΕΙΓΜΑΤΑ ΣΤΟ BUGS Department