Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread...

10
Unit 1: Introduction to data Lecture 3: EDA (cont .) and Introduction to statistical inference via simulation Statistics 101 Nicole Dalzell May 15, 2015 Spread Measures of Spread The population Variance, σ 2 , measures each observation’s deviation from the mean. The population Standard Deviation, σ, is the square root of the variance. The Inner Quartile Range (IQR) measures the spread of the middle 50% of your data, and is visually depicted in Boxplots. Link Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 2/1 Spread Box Plot The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median. # of study hours / week 10 20 30 40 Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 3/1 Spread Anatomy of a Box Plot # of study hours / week 0 10 20 30 40 lower whisker Q 1 (first quartile) median Q 3 (third quartile) upper whisker max whisker reach suspected outliers Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 4/1

Transcript of Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread...

Page 1: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Unit 1: Introduction to dataLecture 3: EDA (cont.) and Introduction to statistical

inference via simulation

Statistics 101

Nicole Dalzell

May 15, 2015

Spread

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Link

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 2 / 1

Spread

Box Plot

The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.

# of study hours / week10 20 30 40

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 3 / 1

Spread

Anatomy of a Box Plot

# of

stu

dy h

ours

/ w

eek

0

10

20

30

40

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

upper whisker

max whisker reach

suspected outliers

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 4 / 1

Page 2: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Spread

Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 5 / 1

Spread

Whiskers and Outliers

Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.

max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35

max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5

An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 6 / 1

Spread

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 7 / 1

Spread

Why visualize?

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 8 / 1

Page 3: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Spread Robust Statistics

Extreme observations

How would sample statistics such as mean, median, SD, and IQR ofhousehold income be affected if the largest value was replaced with$10 million? What if the smallest value was replaced with $10 million?

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

● ●

● ●

●●

● ●

●●

● ●

● ●

●●

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 9 / 1

Spread Robust Statistics

Income Example

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

● ●

● ●

●●

● ●

●●

● ●

● ●

●●

robust not robustscenario median IQR x̄ soriginal data 165K 150K 211K 180Kmove largest to $10 million 165K 150K 398K 1,422Kmove smallest to $10 million 190K 163K 4,186K 1,424K

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 10 / 1

Spread Robust Statistics

Robust statistics

Since the median and IQR are more robust to skewness and outliersthan mean and SD:

skewed→ median and IQR

symmetric→ mean and SD

If you were searching for a car, and you are price conscious, wouldyou be more interested in the mean or median vehicle price when con-sidering a car?

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 11 / 1

Spread Robust Statistics

Range and IQR

Range

Range of the entire data.

range = max −min

IQRRange of the middle 50% of the data.

IQR = Q3 − Q1

Is the range or the IQR more robust to outliers?

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 12 / 1

Page 4: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Spread Robust Statistics

Example: Visualizing

What does our Energy Data look like?

050

0010

000

1500

0

Energy Use Data Boxplot

Ene

rgy

Usa

ge

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 13 / 1

Spread Robust Statistics

Who uses the most energy?

Country.Name X20111 Iceland 17964.442 Qatar 17418.693 Trinidad and Tobago 15691.294 Kuwait 10408.285 Brunei Darussalam 9427.096 Oman 8356.297 Luxembourg 8045.908 United Arab Emirates 7407.019 Bahrain 7353.16

10 Canada 7333.2811 North America 7062.2212 United States 7032.3513 Saudi Arabia 6738.4214 Singapore 6452.3315 Finland 6449.04

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 14 / 1

Spread Robust Statistics

Participation question

Which of the following is false about the distribution of average numberof hours students study daily?

2 4 6 8 10

Average number of hours students study daily

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 15 / 1

Spread Robust Statistics

Side-by-side box plot

How does the number of the average number of times students goout per week vary by involvement? Do the two variables appear to beassociated or independent?

●●

●●

Greek Independent SLG

01

23

45

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 16 / 1

Page 5: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Spread Robust Statistics

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Link

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 17 / 1

Spread Deviation

Deviation

The distance of an observation from the mean is its deviation: xi − x̄.

s o r t ( d$sleep )[ 1 ] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[ 3 0 ] 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5[ 5 9 ] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 7 7 7 7 7 7 7 7 8 9 9 9mean( d$sleep )[ 1 ] 4.6

x1 − x̄ = 1 − 4.6 = −3.6

x2 − x̄ = 1 − 4.6 = −3.6

x3 − x̄ = 2 − 4.6 = −2.6...

x86 − x̄ = 9 − 4.6 = 4.4

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 18 / 1

Spread Deviation

Variance

Population Variance, σ2

Roughly the average squared deviation from the mean

σ2 =

∑Ni=1(xi − µ)2

N

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 19 / 1

Spread Deviation

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 20 / 1

Page 6: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Spread Deviation

Variance

Sample Variance, s2

Roughly the average squared deviation from the mean

s2 =

∑ni=1(xi − x̄)2

n − 1

Given that the sample mean is 4.6, the sample variance of the hoursof sleep students get per night can be calculated as:

s2 =(1 − 4.6)2 + (1 − 4.6)2 + · · ·+ (9 − 4.6)2

86 − 1= 2.76

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 21 / 1

Spread Deviation

Notation Recap

mean variance SD

sample x̄ s2 s

population µ σ2 σ

Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?

Latin letters for sample statistics, Greek letters for populationparameters.

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 22 / 1

Spread Deviation

Application exercise: Variability

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 23 / 1

Spread Deviation

Variability vs. diversity

Which of the following sets of cars has more diverse composition ofcolors?

Set 1:

Set 2:

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 24 / 1

Page 7: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Spread Deviation

Variability vs. diversity (cont.)

Which of the following sets of cars has more variable mileage?

Set 1:

10 20 30 40 50 60

less variable

01

23

Set 2:

10 20 30 40 50 60

more variable

01

23

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 25 / 1

Spread Standard Deviation

Standard deviation

Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.

s =√

s2 =

√∑ni=1(xi − x̄)2

n − 1

The standard deviation of the number of hours the students slept is:

s =√

2.759 ≈ 1.66

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 26 / 1

Spread Standard Deviation

Standard Deviation

The standard deviation gives a rough estimate of the typicaldistance of a data point from the mean.

The larger the standard deviation, the more variability there is inthe data and the more spread out the data are.

Standard Deviation of 2

rnorm(1000,0,2)

Fre

quen

cy

−15 −10 −5 0 5 10 15

050

100

150

200

Standard Deviation of 4

rnorm(1000,0,4)

Fre

quen

cy

−15 −10 −5 0 5 10 15

050

100

150

200

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 27 / 1

Variability and Z-scores

Variability in Student Sleep

sleep, x = 4.6, sx = 1.66

2 4 6 8

● ●● ●

●●

●● ●● ●

● ●

●●

●●●

●●●●●

●●

●●

●●●●●●

●●

●●

●●●●

●●●●●

●●●

●●●

●●

69 out of 86 students (80%) are within 1 SD of the mean.

80 out of 86 students (93%) are within 2 SDs of the mean.

86 out of 86 students (100%) are within 3 SDs of the mean.

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 28 / 1

Page 8: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Variability and Z-scores

95% Rule

95 % RuleIf a distribution of data is approximately symmetric and bell-shaped,about 95% of the data should fall within two standard deviations of themean.

For a population, 95% of the data will be between µ − 2σ andµ + 2σ

http:// rchsbowman.files.wordpress.com/ 2008/ 09/ empirical-rule-3.jpgStatistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 29 / 1

Variability and Z-scores

Z-Scores

Z-ScoreThe z-score for a data value, xi , is

z =xi − x̄

s

For a population, x̄ is replaced with µ and s is replaced with σ.

Values farther from 0 are more extreme.

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 30 / 1

Variability and Z-scores

Z-Scores: Why?

A z-score puts values on a common scale

A z-score is the number of standard deviations a value falls fromthe mean

95% of all z-scores fall between -2 and 2 .

z-scores beyond -2 or 2 can be considered extreme

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 31 / 1

Variability and Z-scores

Z-Scores: Example

Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ?

ACT: x̄ = 21, s = 5

SAT: x̄ = 1500, s = 325

ACT:

z =28 − 21

5=

75

= 1.4

SAT:

z =2100 − 1500

325=

600325

= 1.85

Histogram of Z−Scores

Z−Score

Fre

quen

cy

−3 −2 −1 0 1 2 3

010

020

030

0

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 32 / 1

Page 9: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Categorical Variables Relationship between two categorical variables

Mosaic plots

A survey question asked students, “Have you ever used Adderall foran exam or to study?” Based on their responses, does there appearto be a relationship between gender and having used Adderall for anexam or to study?

female male

no

yes

% female who used Adderall < % malesStatistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 33 / 1

Categorical Variables Relationship between two categorical variables

Contingency table and mosaic plot

In 1973, the University of California-Berkeley was sued for sexdiscrimination. The numbers looked pretty incriminating: the graduateschools had just accepted 44% of male applicants but only 35% offemale applicants.

Admit Deny TotalMale 3738 4704 8442

Female 1494 2827 4321Total 5232 7531 12763

% Males admitted:3738 / 8442 = 44%

% Females admitted:1494 / 4321 = 35%

stat

us

female male

adm

itde

ny

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 34 / 1

Categorical Variables Relationship between two categorical variables

Further analysis of these data:

“If the data are properly pooled...there is a small but statisticallysignificant bias in favor of women.”

Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975).Sex bias in graduate admissions: Data from Berkeley.Science, 187(4175), 398-404.

http:// www.unc.edu/∼nielsen/ soci708/ cdocs/ Berkeley admissions bias.pdf

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 35 / 1

Categorical Variables Relationship between two categorical variables

Proper pooling

Let’s take a closer look at the top 6 departments:

vs.

Play with it at http:// vudlab.com/ simpsons .

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 36 / 1

Page 10: Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread Measures of Spread The population Variance, ... If you were searching for a car,

Categorical Variables Relationship between two categorical variables

Simpson’s paradox

Every Simpson’s paradox involves at least three variables:

1 the response variable (accepted/not accepted)2 the observed explanatory variable (male/ female)3 the lurking explanatory variable (what department did you apply

to)

If the effect of the observed explanatory variable on the responsevariable changes directions when you account for the lurkingexplanatory variable, you’ve got a Simpson’s Paradox.

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 37 / 1