1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

47
1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009

Transcript of 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

Page 1: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

1

Psych 5510/6510

Chapter Nine: Outliers and Data Having Undue Influence

Spring, 2009

Page 2: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

2

Effect of OutliersSet 1: 1, 3, 5, 9, 14:

Sample Mean = est. μ = 6.4, MSE = S2 = 26.8Confidence interval mean: 0 μ 12.8

Set 2: 1, 3, 5, 9, 140: Sample Mean = est. μ = 31.6, MSE = S2 = 3680.8Confidence interval mean: -43.7 μ 106.9

1. Parameter estimate greatly effected.2. Confidence interval based on the second data set is

much wider; less likely to reject the null hypothesis.

Page 3: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

3

Causes of Outliers

1. Error in measurement or inputting the score, i.e. the score does not measure the variable. Could be a data inputting error, or the subject not understanding the instructions.

2. The sample is essentially being drawn from two distinct populations, with one population having a greater frequency. Example, sampling from a gym class mainly consisting of average students plus a few members from the track team.

3. Sampling from a error distribution with ‘thick tails’ (ones that have a greater than normal chance of sampling a score far from the mean).

Page 4: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

4

1) Error in Measurement or Inputting the Score

This is less likely to be caught when inputting a series of data for the computer to analyze, than when computing the analysis yourself with a calculator.

Page 5: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

5

2) Outliers due to non-homogeneous set

Discovering that more than one kind of thing is being measured (e.g. that more than one population is appearing in the group) can be very interesting.

Rather than get rid of outliers, need to identify them so they can be examined and possibly the existence of two different populations can be incorporated into the model.

Page 6: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

6

3) Outliers due to thick tails

Thick tails lead to more frequent extreme scores and greatererror variance than sampling from a normal distribution.

(The taller curve with thinner tails is normally distributed)

Page 7: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

7

Example of Accidentally Reversing ScoresNon-Reversed Data Reversed Data

Student SAT HSRANK SATR HSRANKR1 42 90 42 902 48 87 48 873 58 85 58 854 45 79 45 795 45 90 45 906 48 86 86 487 51 83 51 838 56 99 56 999 51 81 51 81

10 58 94 58 9411 42 86 42 8612 55 99 55 9913 61 99 61 99

bo = 5.95 bo = 96.55

b1 = 0.5 b1 = -0.5PRE = 0.29 PRE = 0.33

F*1,11 = 4.53 F*1,11 = 5.35

Note error in data entry for student #6. PRE and F don’t change much, but there is a

huge difference in the parameters, including a reversal in the direction of the slope.

Page 8: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

8

Original Data

Original Data

HSRANK

100908070

SA

T

90

80

70

60

50

40

Page 9: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

9

With OutlierWith Outlier

HSRANK_R

100908070605040

SA

T_R

90

80

70

60

50

40

Note reversal of slope and big change in intercept, yet model stillstatistically significant, without looking at graph you might thinkthat you have a pretty good model.

Page 10: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

10

Identifying Outliers

1. Is X unusual?

Leverage

2. Is Yi unusual?

Discrepancy

3. Would omission of the observation produce a dramatic change in the model (i.e in the values of the parameter estimates bo, b1, b2,...) Happens when both X and Y are unusual.

Influence

Page 11: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

11

Identifying Outliers

1. Using graphs to look at your data should be your first and last resort. This gets increasingly complicated as the number of predictor variables increases, and it is nice to have some criteria for when to worry, so...

2. There are many, many approaches for using statistical analyses to flag potential outliers (which you can then look at with graphs to see what is going on).

Page 12: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

12

LeverageLeverage involves determining whether any

particular observation has unusual values for X.

The approach we will use involves looking at the ‘lever’ that goes with each observation.

Page 13: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

13

Levers

Buried within the regression equation is the fact that all of the X and all of the Y scores in the data set go into computing the values of the b’s in the regression equation. This means that for any one observation the X scores for all of the observations influence it’s predicted value of Y.

pp22110i Xb...XbXbb Y

Page 14: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

14

Levers

In a much more complicated but equivalent version of the regression equation you plug in all of the X scores for all of the observations to predict a particular value of Y.

Alternative regression equation:

Ŷi= a complicated formula that includes all of the X scores in the data set, not just the X score for observation ‘i’.

Page 15: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

15

The Basis of Using Levers

If an observation has unusual X scores then its predicted value of Y is not very strongly influenced by the X scores of the other observations, instead its prediction is influenced mainly by its own X scores.

If, on the other hand, an observation has X scores similar to those of the other observations, then its predicted value of Y is influenced not only by its own X scores but also by those of the other observations.

Page 16: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

16

Levers

This gives us a way of determining whether an observation has unusual X scores. If its predicted value of Y is heavily influenced by its own X scores then those X scores must have been unusual. If its predicted value of Y is influenced by the X scores of other observations then its X scores must have been similar to those of the other observations.

Page 17: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

17

Levers

A ‘lever’ (symbolized as hii) is a measure of how much an observation’s own X scores influences its predicted value of Y (it measures how much leverage its own X scores had in the prediction). If the observation has unusual X scores (compared to other observations) then its lever is a large value, if the observation has X scores similar to other observations then its lever is a smaller value.

Page 18: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

18

Interpreting levers

n

PAh

1 h 0

ii

ii

Levers will always have a value between 0 and 1. Themean (expected) value of a lever--if its values of X conformto those of the other observations--is PA/n. If an observationhas a lever much greater than that then it is a flag that it hassome unusual X values.

Page 19: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

19

Interpreting levers (cont)

The bigger the lever, the more its X scores stood out as different from the rest. How big does a lever have to be to draw our attention?

1. If the value of the lever is 2 or 3 times the mean of the levers (see formula on the previous slide for the mean of the levers) then it deserves special attention, or,

2. The value of 1/hii can be thought of as being roughly equivalent to how many observations go into predicting that value of Y. So if hii is the maximum value of 1, then 1/hii = 1, and we can say that the X scores from just one observation went into making that prediction (i.e. itself). If hii =.1, then 1/hii = 10 and we can say that the equivalent of ten observations went into making that prediction. If N is large, then any value of 1/hii 5 should grab out attention, as that predicted value of Y was based upon the X scores of only 5 or less other observations.

Page 20: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

20

Levers TableStudent SAT HSRANK Lever 1/Lever

1 42 90 .08 12.5

2 48 87 .08 12.5

3 58 85 .08 12.5

4 45 79 .10 10.0

5 45 90 .08 12.5

6 86 48 .76 1.30

7 51 83 .08 12.5

8 56 99 .15 6.70

9 51 81 .09 11.1

10 58 94 .11 9.10

11 42 86 .08 12.5

12 55 99 .15 6.70

13 61 99 .15 6.70

Expected (mean) value of the levers is PA/n = 2/13 = .15

Page 21: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

21

Discrepancy

Discrepancy involves determining whether any particular observation has unusual values for Y.

Q: Is Yi unusual in respect to what? A: To the model.

In other words, look for observations that differ greatly from the regression line (giving them a large error). An examination of error terms is referred to as an ‘analysis of the residuals’.

iii YYe ˆ

Page 22: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

22

Two problems1) The magnitude of error depends upon the

scale, e.g. an error of 12 is large when predicting number of children in a household, but it is a small error when predicting the weight of a car in ounces.

2) Outliers grab the model (particularly with small n), greatly influencing the regression line in the model’s attempt to reduce squared error. Thus an unusual Y tends to make the regression line move towards it to reduce error.

Page 23: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

23

Discrepancy

Approach: if an observation is unusual (way off the regression line that would fit all of the other observations) then creating a parameter just to handle it should greatly reduce error (visually think of the original regression line being freed from the pull of the outlier...look back the original scatter plots with and without the outlier). We will start by looking at how that works with our outlier.

Page 24: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

24

DiscrepancyApproach:

Model C is the original Model, in this case:

For Model A, add another variable (X2) that has a score of X2= 0 everywhere but at the outlier, where X2=1

If PRE is significant, then it was worthwhile to handle that one outlier individually in the model, i.e. it doesn’t belong with the other scores.

110i Xbb Y

22110i XbXbb Y

Page 25: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

25

Data With Dummy VariableID SAT HSRANK dummy1 42 90 02 48 87 03 58 85 04 45 79 05 45 90 06 86 48 17 51 83 08 56 99 09 51 81 010 58 94 011 42 86 012 55 99 013 61 99 0

The dummy variable is there to handle observation #6.

Page 26: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

26

Studentized Deleted Residual

Model CSAT=96.55 - .50(HSRANK)

Model ASAT=6.71 + .50(HSRANK)+55.49(Dummy)

PRE=0.68 F*1,10=21.4, t*=4.6, p<.01

Thus it was worthwhile to introduce a dummy variable to account for the outlier.

Stat programs will do this for each observation, one at a time, to determine whether or not it is worthwhile to create a dummy variable to handle just that observation. They report this as the Studentized Deleted Residual, which is the square root of the value of F* above.

Page 27: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

27

ID SAT HSRANK PRE t1 42 90 0.1 -12 48 87 0.03 -0.53 58 85 0.01 0.44 45 79 0.15 -1.35 45 90 0.05 -0.76 86 48 0.68 4.67 51 83 0.02 -0.48 56 99 0.08 0.99 51 81 0.03 -0.510 58 94 0.07 0.911 42 86 0.14 -1.312 55 99 0.06 0.813 61 99 0.2 1.6

Do this for each observation, compare a model with justHSRANK (X1) to one that has a dummy variable (X2) just to handle that one value of Y, report the t value for the PRE.

Page 28: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

28

The problem of error rate

Problem: If each t test has a .05 chance of making a type 1 error, then the overall error rate is too large with this approach..

Solutions:1. Use = alpha / n as your alpha (alpha=.05/13=.0038 for this example), but note that

p values are not provided by SPSS, so,2. Use the test only to draw your attention to possible

outliers, look for |t| values greater than 2 (worth a look), or 3 (careful attention), or 4 (alarm bells)

Page 29: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

29

Influence

The third approach to identifying unusual scores is to see if dropping the score would dramatically change the model, this is known as influence.

Procedure: compare the estimates of the parameters in the model with the outlier, to the estimates of the parameters in the model without the outlier.

Page 30: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

30

With Outlier

With Outlier

HSRANK_R

100908070605040

SA

T_R

90

80

70

60

50

40

Page 31: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

31

Without Outlier

Original Data

HSRANK

100908070

SA

T

90

80

70

60

50

40

Page 32: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

32

Looking for Influence

"k''n observatiowithout " means where

XbbY

XbbY

[k]

1i10[k]i,

1i10i

We want to compare the following:

If deleting the k’th observation greatly changes the values of the b’s, then it must have been having a large influenceon the values when it was included, as can be seen in the previous 2 slides, omitting the outlier greatly changes the b’s (notice how the slope and intercept both changed).

Page 33: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

33

Looking for Influence

"k''n observatiowithout " means where

XbbY

XbbY

[k]

1i10[k]i,

1i10i

We want to compare the following:

If the values of the b’s change in the two models then the predictions made by the two models will also change. The easiest way to see if the models differ is by comparing their predictions, specifically looking at (Ŷi - Ŷi,[k]) for each observation. As you might guess, to see the total difference between the two models across all observations we will use:

2

][,ˆˆkii YY

Page 34: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

34

Cook’s D

)(

ˆˆ 2

][,

MSEPA

YYD

kiik

Cook’s D (distance)

The size of D is influenced by how much the predictionschange when observation k is removed. It ends up thatyou get the greatest change in predicted values when both: a) The X value is unusual (leverage); and b) The Y value is unusual (discrepancy). So Cook’s D is largest when both occur.

Page 35: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

35

Evaluating Outliers with Cook’s D

There are only informal guidelines for when Cook’s D is considered large:

1. D > 1 or 2, or,

2. definite gaps between largest Dks

Again, this will be used simply to draw your attention to where to look for outliers.

Page 36: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

36

Values of Cook’s D

.216113

.065512

.064211

.055810

.02519

.08568

.01517

11.86866

.02455

.09454

.01583

.01482

.05421

Cook’s DSATi

The 11.86 really stands out, may also want to take a look at .21

Page 37: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

37

Leverage, Discrepancy, & Influence

Identify what is exemplified by ‘A’, ‘B’, and ‘C’

Page 38: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

38

Usual Effects of Outliers

Index Parameter Estimates Increase Error TypesLeverage Ok Type IDiscrepancy Ok Type IIInfluence Biased Either

Leverage: leads us to falsely think we have found somethinginteresting. Unusual X scores inflate SSx, which leads to smallerconfidence intervals, making it easier to reject H0.

Discrepancy: shooting ourselves in the foot by causing usto miss something interesting. Scores off the regression line addto SSE(A), which reduces SSR, making PRE smaller.

Influence: all bets are off, model just doesn’t fit the majority of scores.

Page 39: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

39

The Importance of Always Looking at Your Data

The following four scatter plots all have the same model, PRE, and significance!

Ŷi=3.0 + .5Xi

PRE=.666 F*=17.95 p<.01

Page 40: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

40Regression line looks like a good fit.

Page 41: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

41

Clearly it would be good to do something with the outlier so thatthe regression line would better fit the other scores.

Page 42: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

42A polynomial model is called for.

Page 43: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

43The one point on the far right is completely responsible fordetermining the slope, what the heck is X doing?

Page 44: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

44

Complex Models

Partial regression plots can be of help in visually identifying outliers when there is more than one predictor variable.

Page 45: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

45

Doing Something About Outliers

• The authors argue against the bias that doing nothing about an outlier is somehow better than doing something about it.

• Always mention in your report anything you do to handle outliers.

• If you think others might question what you have done, provide both analyses (with and without your action to handle the outlier)

Page 46: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

46

What To Do

• If it seems reasonable to conclude that the outlier is an error of measurement or recording, or leads to a model that is less accurate than when you omit the outlier, then omit it (explain that in your report). Removing an error is far better than leaving it in.

• If further exploration leads to the conclusion that the outliers represent essentially a second population in the sample, then find an independent way (other than an extreme score) to measure which population the observation came from, and include that variable in your model, test this in further research.

Page 47: 1 Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009.

47

(From text): “We think it is far more honest to omit outliers from the analysis with the explicit admission in the report that there are some observations which we do not understand and to report a good model for those observations which we do understand....To ignore outliers by failing to detect and report them is dishonest and misleading.”