Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread...

Unit 1: Introduction to dataLecture 3: EDA (cont.) and Introduction to statistical

inference via simulation

Statistics 101

Nicole Dalzell

May 15, 2015

Spread

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Link

Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 2 / 1

Spread

Box Plot

The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.

# of study hours / week10 20 30 40


Spread

Anatomy of a Box Plot

# of

stu

dy h

ours

/ w

eek

0

10

20

30

40

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

upper whisker

max whisker reach

suspected outliers


http://mih5.github.io/statapps/

Spread

Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10


Spread

Whiskers and Outliers

Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.

max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35

max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5

An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.


Spread

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.


Spread

Why visualize?

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk


Spread Robust Statistics

Extreme observations

How would sample statistics such as mean, median, SD, and IQR ofhousehold income be affected if the largest value was replaced with$10 million? What if the smallest value was replaced with $10 million?

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

●

● ●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●●

●

● ●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●



Income Example

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

●

● ●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●●

●

● ●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

robust not robustscenario median IQR x̄ soriginal data 165K 150K 211K 180Kmove largest to $10 million 165K 150K 398K 1,422Kmove smallest to $10 million 190K 163K 4,186K 1,424K



Robust statistics

Since the median and IQR are more robust to skewness and outliersthan mean and SD:

skewed→ median and IQR

symmetric→ mean and SD

If you were searching for a car, and you are price conscious, wouldyou be more interested in the mean or median vehicle price when con-sidering a car?



Range and IQR

Range

Range of the entire data.

range = max −min

IQRRange of the middle 50% of the data.

IQR = Q3 − Q1

Is the range or the IQR more robust to outliers?



Example: Visualizing

What does our Energy Data look like?

050

0010

000

1500

0

Energy Use Data Boxplot

Ene

rgy

Usa

ge



Who uses the most energy?

Country.Name X20111 Iceland 17964.442 Qatar 17418.693 Trinidad and Tobago 15691.294 Kuwait 10408.285 Brunei Darussalam 9427.096 Oman 8356.297 Luxembourg 8045.908 United Arab Emirates 7407.019 Bahrain 7353.16

10 Canada 7333.2811 North America 7062.2212 United States 7032.3513 Saudi Arabia 6738.4214 Singapore 6452.3315 Finland 6449.04



Participation question

Which of the following is false about the distribution of average numberof hours students study daily?

●

2 4 6 8 10

Average number of hours students study daily

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 15 / 1


Side-by-side box plot

How does the number of the average number of times students goout per week vary by involvement? Do the two variables appear to beassociated or independent?

●

●

●●

●

●

●●

●

Greek Independent SLG

01

23

45



Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Link


Spread Deviation

Deviation

The distance of an observation from the mean is its deviation: xi − x̄.

s o r t ( d$sleep )[ 1 ] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[ 3 0 ] 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5[ 5 9 ] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 7 7 7 7 7 7 7 7 8 9 9 9mean( d$sleep )[ 1 ] 4.6

x1 − x̄ = 1 − 4.6 = −3.6

x2 − x̄ = 1 − 4.6 = −3.6

x3 − x̄ = 2 − 4.6 = −2.6...

x86 − x̄ = 9 − 4.6 = 4.4


Spread Deviation

Variance

Population Variance, σ2

Roughly the average squared deviation from the mean

σ2 =

∑Ni=1(xi − µ)2

N


Spread Deviation

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily


http://mih5.github.io/statapps/

Spread Deviation

Variance

Sample Variance, s2

Roughly the average squared deviation from the mean

s2 =

∑ni=1(xi − x̄)2

n − 1

Given that the sample mean is 4.6, the sample variance of the hoursof sleep students get per night can be calculated as:

s2 =(1 − 4.6)2 + (1 − 4.6)2 + · · ·+ (9 − 4.6)2

86 − 1= 2.76


Spread Deviation

Notation Recap

mean variance SD

sample x̄ s2 s

population µ σ2 σ

Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?

Latin letters for sample statistics, Greek letters for populationparameters.


Spread Deviation

Application exercise: Variability


Spread Deviation

Variability vs. diversity

Which of the following sets of cars has more diverse composition ofcolors?

Set 1:

Set 2:


Spread Deviation

Variability vs. diversity (cont.)

Which of the following sets of cars has more variable mileage?

Set 1:

10 20 30 40 50 60

less variable

01

23

Set 2:

10 20 30 40 50 60

more variable

01

23


Spread Standard Deviation

Standard deviation

Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.

s =√

s2 =

√∑ni=1(xi − x̄)2

n − 1

The standard deviation of the number of hours the students slept is:

s =√

2.759 ≈ 1.66


Spread Standard Deviation

Standard Deviation

The standard deviation gives a rough estimate of the typicaldistance of a data point from the mean.

The larger the standard deviation, the more variability there is inthe data and the more spread out the data are.

Standard Deviation of 2

rnorm(1000,0,2)

Fre

quen

cy

−15 −10 −5 0 5 10 15

050

100

150

200

Standard Deviation of 4

rnorm(1000,0,4)

Fre

quen

cy

−15 −10 −5 0 5 10 15

050

100

150

200


Variability and Z-scores

Variability in Student Sleep

sleep, x = 4.6, sx = 1.66

2 4 6 8

● ●● ●

●●

●

●● ●● ●

●

● ●

●

●

●●

●●●

●●●●●

●

●●

●

●●

●

●●●●●●

●●

●

●●

●

●

●●●●

●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●

●

●

●●

●

●

●

●

●

69 out of 86 students (80%) are within 1 SD of the mean.

80 out of 86 students (93%) are within 2 SDs of the mean.

86 out of 86 students (100%) are within 3 SDs of the mean.



95% Rule

95 % RuleIf a distribution of data is approximately symmetric and bell-shaped,about 95% of the data should fall within two standard deviations of themean.

For a population, 95% of the data will be between µ − 2σ andµ + 2σ

http:// rchsbowman.files.wordpress.com/ 2008/ 09/ empirical-rule-3.jpgStatistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 29 / 1


Z-Scores

Z-ScoreThe z-score for a data value, xi , is

z =xi − x̄

s

For a population, x̄ is replaced with µ and s is replaced with σ.

Values farther from 0 are more extreme.



Z-Scores: Why?

A z-score puts values on a common scale

A z-score is the number of standard deviations a value falls fromthe mean

95% of all z-scores fall between -2 and 2 .

z-scores beyond -2 or 2 can be considered extreme



Z-Scores: Example

Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ?

ACT: x̄ = 21, s = 5

SAT: x̄ = 1500, s = 325

ACT:

z =28 − 21

5=

75

= 1.4

SAT:

z =2100 − 1500

325=

600325

= 1.85

Histogram of Z−Scores

Z−Score

Fre

quen

cy

−3 −2 −1 0 1 2 3

010

020

030

0


http://rchsbowman.files.wordpress.com/2008/09/empirical-rule-3.jpg

Categorical Variables Relationship between two categorical variables

Mosaic plots

A survey question asked students, “Have you ever used Adderall foran exam or to study?” Based on their responses, does there appearto be a relationship between gender and having used Adderall for anexam or to study?

female male

no

yes

% female who used Adderall < % malesStatistics 101 ( Nicole Dalzell) U1 - L3: EDA + Inference May 15, 2015 33 / 1


Contingency table and mosaic plot

In 1973, the University of California-Berkeley was sued for sexdiscrimination. The numbers looked pretty incriminating: the graduateschools had just accepted 44% of male applicants but only 35% offemale applicants.

Admit Deny TotalMale 3738 4704 8442

Female 1494 2827 4321Total 5232 7531 12763

% Males admitted:3738 / 8442 = 44%

% Females admitted:1494 / 4321 = 35%

stat

us

female male

adm

itde

ny



Further analysis of these data:

“If the data are properly pooled...there is a small but statisticallysignificant bias in favor of women.”

Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975).Sex bias in graduate admissions: Data from Berkeley.Science, 187(4175), 398-404.

http:// www.unc.edu/∼nielsen/ soci708/ cdocs/ Berkeley admissions bias.pdf



Proper pooling

Let’s take a closer look at the top 6 departments:

vs.

Play with it at http:// vudlab.com/ simpsons .


http://www.unc.edu/~nielsen/soci708/cdocs/Berkeley_admissions_bias.pdf

http://vudlab.com/simpsons


Simpson’s paradox

Every Simpson’s paradox involves at least three variables:

1 the response variable (accepted/not accepted)2 the observed explanatory variable (male/ female)3 the lurking explanatory variable (what department did you apply

to)

If the effect of the observed explanatory variable on the responsevariable changes directions when you account for the lurkingexplanatory variable, you’ve got a Simpson’s Paradox.


Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread...

Documents

Transcript of Spread Measures of Spread U 1: I I - Statistical Sciencenmd16/courses/Summer15/sta101... · Spread...