Stat 101: Lecture 6fei/Teaching/stat101/Lect/handout6.pdf · Stat 101: Lecture 6 Summer 2006....

Stat 101: Lecture 6

Summer 2006

Outline

Review and Questions

Example for regression

Transformations, Extrapolations, and Residual

Review

Mathematical model for regression

I Each point (Xi , Yi) in the scatterplot satisfies:

Yi = a + bXi + εi

I εi ∼ N(0, sd = σ). σ is usually unknown. The ε’s havenothing to do with one another (independent). e.g., big εidoes not imply big εj .

I We know Xi ’s exactly. This imply that all error occurs in thevertical direction.

Estimating the regression line

ei = Yi − (a + bXi) is called residuals. It measures the verticaldistance from a point to the regression line.One estimates a and b by minimizing,

f (a, b) =n∑

i=1

(Yi − (a + bXi))2

Take the derivative of f (a, b) w.r.t a and b, and set them to 0, weget,

a = Y − bX ; b =1n

∑n1 XiYi − X Y

1n

∑n1 X 2

i − X 2

f (a, b) is also referred as Sum of Squared Errors (SSE).

An example

A biologist wants to predict brain weight from body weight,based on a sample of 62 mammals. A scatter plot showsbelow. Ecological correlation?

I The regression equation is,

Y = 90.996 + 0.966X

I The correlation is 0.9344. But it is heavily influenced by afew outliers.

I The sd of the residuals is 334.721. This stands for thetypical distance of a point to the regression line in thevertical direction.

I Under the “Parameter Estimates” portion of the printout,the last column tells whether the intercept and slope aresignificantly different from 0. Small numbers indicatesignificant differences; values less than 0.05 are usuallytaken to indicate real differences from zero, as opposed tochance errors.

I The root mean square (RMSE) is the standard deviation ofthe vertical distances between each point and theestimated line. It is an estimate of the standard deviation ofthe vertical distances between the observations and thetrue line.Formally,

RMSE =

√√√√1n

n∑1

(Yi −

(a + bXi

)2)

I Note that a + bXi is the mean of the Y-value at Xi .

I The regression line predicts the average value for the Yvalues at a given X.

I In practice, one wants to predict the individual value for aparticular value of X. e.g. if my weight is 50 (kg), then howmuch would my brain weigh?

The prediction (g) is,

log Y = a + b log X= 90.96 + 0.9665 ∗ 50= 98.325

I But this is just the average for all mammals who weigh asmuch as I do.

I The individual value is less exact than the average value.To predict the average value, the only source of uncertaintyis the exact location of the regression line (i.e. a, b areestimates of the true intercept and slope.)

I In order to predict my brainweight, the uncertainty aboutmy deviation from the average is added to the uncertaintyabout the location of the line.

I For example, if I weights 50 (kg), then my brain shouldweigh 98.325(g) + ε. Assuming the regression model iscorrect, then ε has a normal distribution with mean zeroand standard deviation 334.721.

I Note: with this model, my brain could easily have“negative” weight. This could make us question theregression assumptions.

Transformations

I The scatterplot of the brainweight against body weightshowed the line was probably controlled by a few largevalues (high-leverage points). Even worse, the scatterplotdid not resemble the football-shaped point cloud thatsupports the regression assumptions listed before.

I In cases like this, one can consider making atransformation of the response variable or the explanatoryvariable or both. For this data, consider taking thelogarithm (10 base) of the brainweight and the bodyweight.

I The scatterplot is much better.

I Taking the log shows that the outliers are not surprising.The regression equation is now:

log Y = 0.908 + 0.763 log X

I Now 91.23% of the variation in brain weight is explained bybody weight. Both the intercept and the slope are highlysignificant. The estimated sd of ε is 0.317. This is thetypical vertical distance between a point and the line.

I Makeing transformations is an art. here the analysissuggests that,

Y = 8.1× X 0.763

I So there is a power-law relationship between brain massand body mass.

Extrapolation

I Predicting Y values for X values outside the range of Xvalues observed in the data is called extrapolation.

I This is risky, because you have no evidence that the linearrelationship you have seen in the scatterplot continues tohold in the new X region. Extrapolated values can beentirely wrong.

I It is unreliable to predict the brain weight of a blue whale orthe hog-nosed bat.

Residuals

I Estiamte the regression line (using JMP software or bycalculating a and b by hand).

I Then find the differnece between each observed Yi andthe predicted value Yi using the fitted line. Thesedifferences are called the residuals.

I Plot each difference against the corresponding Xi value.This plot is called a residual plot.

I If the assumptions for linear regressin hold, what should onsee in the residual plot?

I If the pattern of the residuals around the horizontal line atzero is:

I Curved, then the assumption of linearity is violated.I fan-shaped, then the assumption of constant sd is violated.I filled with many outliers, then again the assumption of

constant sd is violated.I shows a pattern (e.g. positive, negative, positive, negative),

then the assumption of independent errors is violated.

I When the residuals have a histogram that looks normaland when the residual plot shows no pattern, then we canuse the normal distribution to make inferences aboutindividuals.

I Suppose we do not make the log transformation. Whatpercentage of 20-kilogram mammals have brain that weighmore than 180 grams?

I The regression equation says that the mean brainweightfor 20 kilogram animals is 90.996 + 0.966 * 20 = 110.33.The sd of the residuals is 334.721. Under the regressionassumptions, the 20-kilogram mammals have brainweightsthat are normally distributed with mean 110.33 and sd334.721.

I The z-transformation is (180 - 110.33) / 334.72 = 0.208.From the table, the area under the curve to the right of0.208 is (100 - 15.85) / 2 = 42.075%

Midterm I InstructionI We will have Midterm I Thursday, July 13th.I The exam is 12:30pm - 2:30pm. Do not be late!I Office hour: 10:00am - 12:00am, Wednesday, July 12th,

211 Old Chem.I The exam will cover all the materials we have discussed so

far.I The exam is open book, open lecture. You can use laptop

if you wish. And if you choose to type, you should manageto send your answer as attachment to my [email protected] by 2:30pm. Otherwise, the answer is notacceptable.

I The questions are expected similar to the exercises /review exercises / quiz 1. You should be able to finish theexam in 2 hours. When time is up, put your pens / pencilsdone while I am collecting the answers. Otherwise, you willget 0 score.

Designed Experiments and Observational Studies

I Double-blind, randomized, control study versusObservational Studies.

I Drug-placebo study. Lung cancer and smoking.I Association does not imply causation.I Confounding factors.I Subgroup study or weighted average can help to

understand the confounding factors.

Descriptive Statistics

I Central tendency: Mean, median (quantile, percentile),mode.

I Diespersion: standard deviation, range, IQR.I Histograms, boxplots, and scatterplots.

Normal Distributions

I Use of the normal table.I For a normal distribution, the probability that you observe a

value within 1 sd is 68%, within 2 sd is 95%, and within 3sd is 99.7%.

I Use of the z-transformation.I Always draw pictures.

Correlation

I Correlation r measures the linear association between twovariables.

I Calculate the correlation by z-transformation.I r2 is the coefficient of determination. It is the proportion of

the variation in Y that is explained by X.I No linear assocation does not imply no assocation. And

association is not causation.I Ecological correlation may be misleading.

Regression

I Fit the “best” line to the data.I Regression effect in test-retest example.I The formula for regression is,

Yi = a + bXi + εi

We are assuming εi ∼ N(0, sd = σ). And the ε’s areindependent.

I Residuals: ei = Yi − (a + bXi).I Find the regression line by minimizing the Sum of Squared

Errors (SSE).

f (a, b) =n∑

i=1

(Yi − (a + bXi))2

I The Least Squred Estimators (LSE) are,

a = Y − bX ; b =1n

∑n1 XiYi − X Y

1n

∑n1 X 2

i − X 2

And estimates for the residuals are ei = Yi − (a + bXi)

I Data transformation.I Extrapolation is risky.

Stat 101: Lecture 6fei/Teaching/stat101/Lect/handout6.pdf · Stat 101: Lecture 6 Summer 2006....

Documents

Transcript of Stat 101: Lecture 6fei/Teaching/stat101/Lect/handout6.pdf · Stat 101: Lecture 6 Summer 2006....