Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population...

25
Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2 , is the sum of the squared deviations about the population mean divided by the number of observations in the population, N : σ 2 = (x i - μ) 2 N = (x 1 - μ) 2 +(x 2 - μ) 2 + ··· (x N - μ) 2 N . Another alternative formula is: σ 2 = x 2 i N - x i N 2 = x 2 i N - μ 2 . REMARK: To avoid round-off errors, which accumulate quickly in these formulas, do not round until the last computation, and use as May 15, 2012

Transcript of Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population...

Page 1: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 1

Lecture 3

The Population Variance

The population variance, denoted σ2, is the sum

of the squared deviations about the population

mean divided by the number of observations in

the population, N :

σ2 =

∑(xi − µ)2

N=

(x1 − µ)2 + (x2 − µ)2 + · · · (xN − µ)2

N.

Another alternative formula is:

σ2 =

∑x2iN−(∑

xiN

)2

=

∑x2iN− µ2.

REMARK: To avoid round-off errors, which

accumulate quickly in these formulas, do not

round until the last computation, and use as

May 15, 2012

Page 2: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 2

many decimal places as allowed in your calculator.

May 15, 2012

Page 3: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 3

The Sample Variance

When the population is large, we approximate the

population mean µ with the sample mean, x̄.

Similarly, we approximate the population variance

σ2 by the sample variance, denoted s2:

s2 =

∑(xi − x̄)2

n− 1=

(x1 − x̄)2 + (x2 − x̄)2 + · · ·+ (xn − x̄)2

n− 1.

The alternative form is:

s2 =

∑(xi − x̄)2

n− 1− (

∑xi)

2

n(n− 1).

REMARK: Notice that we divide by the sample

size minus one (this is different from the formula

for the population variance).

May 15, 2012

Page 4: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 4

Informally, we say: a sample of size n has n

degrees of freedom; one degree of freedom is “used

up” in computing x̄, so there are only n− 1

degrees of freedom available for the sample

variance.

May 15, 2012

Page 5: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 5

The Standard Deviation

For both cases (the population or the sample),

the standard deviation is the square root of the

corresponding variance:

The population standard deviation is denoted

by σ:

σ =√σ2.

The sample standard deviation is denoted by

s:

s =√s2.

Advantage of the (population or sample) standard

May 15, 2012

Page 6: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 6

deviation: it is given in the same units as the

observations.

Advantage of the (population or sample)

variance: it is easier to manipulate algebraically,

in some cases.

Both the standard deviations and variances are

interpreted as follows: the larger they are, the

more spread is the distribution (if they equal 0,

the smallest possible value, then all observations

must be equal).

Remark 1. Standard deviation measures spread

about the mean and should be used only when

the mean is chosen as the measure of center.

Remark 2. Standard deviation is not robust.

May 15, 2012

Page 7: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 7

Remark 3. The sum of the deviations of the

observations from their mean will always be zero.

May 15, 2012

Page 8: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 8

Density curves

Histograms are approximations to an exact

variable distribution. Increasing the number of

classes in a histogram makes each rectangle less

wide and as the number of rectangles approaches

infinity, the graph becomes a curve, called

density curve.

Properties of the density curve

1. The curve is always above the x-axis: the

function f(x) describing the curve is

nonnegative (could be zero) for all x

2. The total area underneath the curve and

above the x-axis equal 1.

May 15, 2012

Page 9: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 9

Density curves, as we saw, have mean, medians

and modes as well as standard deviation. the

notations are similar to the one for the population

mean and standard deviation (why?).

Most of the time we use software to estimate

density curves. Many times we assume that data

follows a certain density curve.

May 15, 2012

Page 10: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 10

The normal distribution

Often called Gaussian curve, the normal curve

was introduced by Carl Friedrich Gauss in 1809

as an error curve of least square regression, about

which we will talk next time.

There are other symmetric bell-shaped density

curves that are not normal.

Remark 4. The curve is described completely by

2 parameters: µ-the mean and σ-the standard

deviation.

May 15, 2012

Page 11: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 11

The Empirical Rule

If the distribution is approximately bell shaped

(not only normal), then:

1. Approximately 68% of the data will lie within

one standard deviation of the mean. That is,

about 68% of the data will be between µ− σ

and µ+ σ.

2. Approximately 95% of the data will lie within

two standard deviations of the mean.

3. Approximately 99.7% of the data will lie

within three standard deviations of the mean.

For exact values, we need to integrate to find the

area between two points.

May 15, 2012

Page 12: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 12

In general, for any distribution, not only the

normal distribution, Chebyshev’s rule could be

applied:

The proportion of values from a data set that will

fall within k standard deviations of the mean will

be at least

(1− 1

k2)100%

where k > 1. his rule could be applied to samples

too.

May 15, 2012

Page 13: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 13

Finding the area under the normal density curve

is not an easy task. It requires a lot of calculus.

One way of avoiding this is to use tables that give

us these areas (probabilities). But for each µ and

σ we would need a new table. How can we avoid

this? By transforming somehow all these curves

into a standard one. Choose µ = 0 and σ2 = 1

Standardizing

Convert other values to standard units or

z-scores, by subtracting the mean and dividing

by standard deviation

z =x− µσ

May 15, 2012

Page 14: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 14

Example: Standardize x = −3 with µ = 2 and

σ = 4. What z-score range corresponds to (8, 17)

with µ = 12 and σ2 = 9?

May 15, 2012

Page 15: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 15

Interpretation: z is the number of standard

deviations that x is away from the mean.

The z-score is unit free. We can use it to compare

observations from different sources (“apples to

oranges”).

Notation The standard normal distribution is

denoted by N(0, 1) and any other normal

distribution with mean µ and variance σ2 by

N(µ, σ).

May 15, 2012

Page 16: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 16

Relations between variables. Scatter

diagrams

In practice statisticians are interested in multiple

variable relationships. For 2 variables, the pairs of

data points match forming an observation.

Sometimes we use the value of one variable in

order to predict another variable.The response

variable is the variable whose value can be

explained by, or is determined by, the value of the

explanatory variable. The response variable

measures the outcome of a study. An explanatory

variable explains or causes changes in the

response variable.

Example:

May 15, 2012

Page 17: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 17

The relationship between two variables could be

represented by crosstabulation, side by side

or clustered bar graphs, and scatterplots.

May 15, 2012

Page 18: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 18

Definition 5. A scatter diagram is a graph

that shows the relationship between two

quantitative variables measured on the same

individual.

How to draw a scatter diagram:

• The explanatory variable is plotted on the

horizontal axis and the response variable is

plotted on the vertical axis.

• Each individual in the data set is represented by

a point in the scatter diagram.

• Do not connect the points when drawing a

scatter diagram.

May 15, 2012

Page 19: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 19

How we interpret a scatter diagram

Scatter diagrams imply a

• linear relationship

• nonlinear relationship

• no relation

Definition 6. Two variables that are linearly

related are said to be positively associated if,

whenever the values of the predictor variable

May 15, 2012

Page 20: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 20

increase, the values of the response variable also

increase, and it is said to be negatively

associated if, whenever the values of the

predictor variable increase, the value of the

response variable decrease.

May 15, 2012

Page 21: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 21

Be careful!! Do not conclude causation through

association.

May 15, 2012

Page 22: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 22

Definition 7. The linear correlation

coefficient is a measure of the strength of linear

relation between two quantitative variables. The

sample correlation correlation coefficient is

computed by:

r =

∑ni=1(xi−x

sx)(yi−y

sy)

n− 1

where x is the sample mean of the predictor

variable

sx is the sample standard deviation of the

predictor variable.

y is the sample mean of the response variable

sx is the sample standard deviation of the

response variable.

n is the number of individuals in the sample.

May 15, 2012

Page 23: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 23

The population correlation coefficient is denoted

by ρ

Example: (0, 0)(1, 2)(2, 2)(3, 5)(4, 6)

May 15, 2012

Page 24: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 24

Interpretation and properties

of r

• −1 ≤ r ≤ 1

• If r = 1 there is a perfect positive linear relation

between the 2 variables.

• If r = −1 there is a perfect negative linear

relation between the 2 variables.

• The closer r is to 1 the stronger the evidence of

a positive linear relation and the closer to -1 the

stronger the evidence of negative association

between the two variables.

• If r is close to 0 there is evidence of no linear

relation between the 2 variables. This does not

mean no relation, just no linear relation.

May 15, 2012

Page 25: Lecture 3 - Kentoana/math10041/2012InterssesionLecture3.pdf · Lecture 5 1 Lecture 3 The Population Variance ... within three standard deviations of the mean. For exact values, we

Lecture 5 25

• r is a untiles measure of association.

• r is not resistant. It is strongly affected by

outlier.

• Both variables should be quantitative.

May 15, 2012