Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem:...

46
Statistical Sampling ∆ιατμηματικό πρόγραμμα μεταπτυχιακών σπουδών Τεχνο-οικονομικά συστήματα ∆ημήτρης Φουσκάκης

Transcript of Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem:...

Page 1: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

Statistical Sampling

∆ιατµηµατικό πρόγραµµαµεταπτυχιακών σπουδών

Τεχνο-οικονοµικά συστήµατα

∆ηµήτρης Φουσκάκης

Page 2: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

2

What do you think about Statistics?

Page 3: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

3

Introduction to Statistics

Why do we need statistics?

Descriptive statistics.

Inferential statistics.

Page 4: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

4

Why do we need statistics?

Why indeed?

“A distinctive function of statistics is this: it enables the scientist to make a numerical evaluation of the uncertainty of his conclusion.”(Snedecor, 1950)

Page 5: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

5

The fundamental problem: sampling

How representative is my sample?

Page 6: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

6

The fundamental problem: sampling

Statistics can tell us how good the chances are that the characteristics of a given sample represent characteristics of the target population,

ifeach individual of the target population had the same chance to be sampled!(Assumption of randomness)

Page 7: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

7

The fundamental problem: sampling

Population: Set of all units of interest – X random variable.X follows a distribution f with unknown mean µ and standard deviation σ and more general with an unknown parameter θ.Random Sample: X1, …, Xn independent and identically distributed random variables (follow the same distribution as X).Observed values of the random sample (sample values – sample data) x1,…, xn.

help us make inference

Page 8: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

8

Descriptive and inferential statistics

Descriptive statistics:helps to describe the characteristics of a sample.

Inferential statistics:a collection of methods, which help to quantify how certain we can be when we make inferences from a given sample.

Page 9: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

9

Types of data

Categorical• nominal (married, single, divorced . . .)• ordinal (minimal, moderate, severe . . .)• binary (success, failure)

Quantitative• discrete (0,1,2,3,4,5 . . .)

- e.g. Number of road accidents• continuous

- e.g. Height

Page 10: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

10

Descriptive statistics

Measures of location:- Sample Mean (the sum of all

the scores divided by the number of observations).

- Median (the score that lies midpoint when the data are ranked in order).

- Mode (the most frequently occurring score).

- Trimmed Mean (some of the largest and smallest observations are removed before calculating the mean).

x

n

xx

n

ii∑

== 1

Page 11: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

11

Descriptive statistics (continued)

( )2

2 1

1

n

ii

x xs

n=

−=

VarianceSD =

Measures of spread:- Range (the lowest and highest

values).- Centiles (two values that

encompass most rather than all of the data values, e.g. quartiles).

- Standard Deviation (SD) s (the idea is based on averaging the distance each value is from the mean).

- Variance s2 (the square of SD).

Page 12: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

12

Graphical representations of variability

HistogramBoxplot

Frequency polygon

Steam-and-leaf diagram0 1110 2223330 4450 6666666770 891 0000000111111111 22222222222233333333

Page 13: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

13

Estimate the shape of the p.d.f. of X

In order to estimate the shape of the p.d.f. f(x) of X, one can create a frequency table of the sample values x1,…, xn. This is done by dividing up the range of the values of x1,…, xn into a set of intervals. Then create a histogram and use it as an estimate of the shape of the pdf f(x) of X.

Page 14: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

14

Estimating probabilities

Suppose that we want to calculate the probability p=P(a§ X§ b). Let denote the fraction of the sample data x1,…, xnthat are between the values a,b. Then is an estimate of the required probability.

Page 15: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

15

Estimate the mean and the variance

The observed sample mean

n

xx

n

ii∑

== 1

can be used to estimate the true mean µ of X.

The observed variance

( )2

2 1

1

n

ii

x xs

n=

−=

can be used to estimate the true variance σ2 of X.

Page 16: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

16

Sample Mean

The definitions of the observed sample mean and variance pertain to the observed values x1,…, xn. Let us instead look at the problem before the random sample is collected. Recall that before the sample is collected the random variables X1, …, Xn denote the uncertain values that will be obtained from the random sample.

Page 17: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

17

Sample Mean

1

n

ii

XX

n==∑

( )22 1

1

n

ii

X XS

n=

−=

∑ RANDOMVARIABLES

Page 18: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

18

Sample Mean

( )E X µ=2

( )Var Xnσ

= 2 2( )E S σ=How good an estimate of the mean µ is the observed sample mean is? How reliable is this estimate? x

( , / )X N nµ σ∼

from the Central limit theorem (n ≥ 30)

Page 19: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

19

Example

Berkshire Power Company (BPC) is an electric utility company that provides electric power. Has recently implemented a variety of incentive programs to encourage households to conserve energy in winter months. They would like to estimate the mean µ and standard deviation σ of the distribution of household electricity consumption for January. Sample of n=100 households.

Page 20: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

20

Example

Page 21: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

21

Example

1 3011== =∑

n

ii

xx KWH

n

( )2

2 1 540483.7 735.181

=

−= = ⇒ =

∑n

ii

x xs s KWH

nSuppose we choose now a different sample of 100 households. How different theanswers would be? If instead would choose n=10?

Remember that from the Central Limit Theorem ( , / )X N nµ σ∼

The standard deviation of the distribution of the sample mean is lower when n is larger.

Page 22: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

22

Example

Page 23: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

23

Confidence Intervals for the Mean for Large Sample Size

Observed sample mean will be more reliable estimate for µ when the sample size n is larger. We can quantify the intuitive notion of reliability of an

estimate by developing the concept of a confidence interval (C.I.).Consider the following problem: Compute the quantity b:

x

( ) 0.95p P b X bµ µ= − ≤ ≤ + =

( ) 0.95/ / /b X bp P

n n nµ

σ σ σ−

= − ≤ ≤ =

Z~N(0,1) for n>29

1.96 1.96(1.96 1.96) 0.95 ( )P Z P X Xn nσ σµ≤ ≤ = ⇒ − ≤ ≤ +

Page 24: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

24

Confidence Intervals for the Mean for Large Sample Size

If n 30 then a 95% confidence interval for the mean µ is the interval ≥

1.96 1.96,s sX Xn n

⎡ ⎤− +⎢ ⎥⎣ ⎦

Interpretation of a confidence Interval: Since both the sample mean and the sample variance are random variables, each time we take a random sample, we find different values for the observed sample mean

and the observed variance . This results to a different confidence interval each time we sample. A 95% confidence interval means that 95% of

the resulting intervals will contain the actual mean µ.

x

2S

2s

X

Page 25: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

25

Confidence Intervals for the Mean for Large Sample Size

In our previous example with the Berkshire Power Company with a sample size of n=100 we get the following 95% confidence interval for the true mean:

1.96 1.96, [2866.9, 3155.1]⎡ ⎤− + =⎢ ⎥⎣ ⎦

s sX Xn n

If instead our sample size was smaller, our uncertainty about the true value of µ becomes larger, and thus we should expect a wider confidence interval.

Page 26: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

26

Confidence Intervals for the Mean for Large Sample Size

Suppose that is the observed sample mean and is the observed variance. If n 30 then a β% confidence interval for the mean µ is the

interval:

x 2s≥

/ 2 / 2,a az s z sX Xn n× ×⎡ ⎤− +⎢ ⎥⎣ ⎦

where zα/2 is such that:

/ 2 / 2( ) /100 and α = 1- (β/100)a aP z Z z β− ≤ ≤ =For β=90%, α=0.10, zα/2 =1.645For β=95%, α=0.05, zα/2 =1.960For β=98%, α=0.02, zα/2 =2.326For β=99%, α=0.01, zα/2 =2.576

Thus in our previous example with our sample of 100 households a 99% confidence interval

for the true mean is:2 .5 7 6 2 .5 7 6, [ 2 8 2 1 .6 , 3 2 0 0 .3]⎡ ⎤− + =⎢ ⎥⎣ ⎦

s sX Xn n

wider than the 95% one

Page 27: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

27

Normal Table

Page 28: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

28

Confidence Intervals for the Mean for Small Sample Size

What if our sample size is less than 30. The procedure for constructing a confidence interval for the true mean is the same as before, but this time

follows approximately a t-distribution with k=(n-1) degrees of

freedom (this approximation works well only if the Xi are almost Normally distributed).

Thus the β% confidence interval for the true mean is:

/XT

σ−

=

,c s c sX Xn n× ×⎡ ⎤− +⎢ ⎥⎣ ⎦

where c is such that:

( ) /100P c T c β− ≤ ≤ = and T follows the t-distribution with (n-1) degrees of freedom

Page 29: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

29

Example

In the Berkshire Power Company example lets suppose that our sample was from only n=10 households, and gave us an observed sample mean of

3056 KWH and an observed sample standard deviation of 800 KWH. Then a 99% C.I. For the true mean is:

3.250 800 3.250 800, 3056 , 305610 10

× × × ×⎡ ⎤ ⎡ ⎤− + = − −⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

c s c sX Xn n

where the value 3.250 can be easily obtained from the tables of the t distribution with k=10-1=9 degrees of freedom and β=99%.

Page 30: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

30

Student Table

Page 31: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

31

Confidence Interval for the population proportion

Suppose that the national Institute of Health (NIH) would like to estimate the proportion of teenagers that smokes. They randomly sampled 1000 teenagers and found that 253 of them are smokers. Thus the observed sample proportion 253 /1000 0.253p = =

We would like to construct a C.I. for the estimate of the true proportion of teenagers that smoke.

Let X be the number of teenagers in the sample of size n that smokes. Then X~B(n,p) and therefore E(X)=np and Var(X)=np(1-p).

If is the sample proportion (random variable) then E( )=p and

Var( )=[p(1-p)] /n.

XPn

= P

P

Page 32: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

32

Confidence Interval for the population proportion

If then from the Central Limit Theorem we have that5 and (1 ) 5≥ − ≥np n p

(1 ) /P pZ

P P n−

=−

obeys approximately the standard Normal distribution.

Using the above fact we can derive the following result:

If is the observed sample proportion in a sample of size n and then a β% C.I. For the population proportion p is:

p5 and (1 ) 5≥ − ≥np n p

/ 2 / 2(1 ) (1 ), a a

p p p pp z p zn n

⎡ ⎤− −− +⎢ ⎥

⎣ ⎦

/ 2 / 2( ) /100 and α = 1- (β/100)a aP z Z zwhere zα/2 is such that:

β− ≤ ≤ =

Page 33: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

33

Confidence Interval for the population proportion

So in our example lets compute a 99% C.I. For the proportion of teenagers that smoke. Note that and so we can use the preceding method. From the tables of the standardnormal distribution we find that c=2.576 and thus the required C.I. is:

5 and (1 ) 5≥ − ≥np n p

(1 ) (1 ) 0.253(1 0.253) 0.253(1 0.253), 0.253 2.576 , 0.253 2.5761000 1000

p p p pp c p cn n

⎡ ⎤ ⎡ ⎤− − − −− + = − +⎢ ⎥ ⎢ ⎥

⎣ ⎦ ⎣ ⎦

= [0.218, 0.288].

Page 34: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

34

Experimental Design for Estimating the Mean µ

Sample size n Affects the width of the C.I.

How large should n be in order to to satisfy a pre-specific tolerance in the width of the β% C.I. ?

Experimental Design L=tolerance level, i.e. Our estimate is within plus or minus L of the true value µ with probability β/100.

x

2 2/ 2

2az snL

=where zα/2 is such that:

/ 2 / 2( ) /100 and α = 1- (β/100)a aP z Z z β− ≤ ≤ =

Page 35: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

35

Experimental Design for Estimating the Mean µ

If the value of n computed in the previous expression is less than 30, then we set n=30. One difficulty in using the previous expression is that we have to know the value of the sample standard deviation in advance. However, one can typically obtain a rough estimate of the sample standard deviation s by conducting a small pilot sample first.

Page 36: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

36

Example

Suppose that a marketing research firm wants to conduct a survey to estimate the mean µ of the distribution of the amount spent on entertainment by each adult who visits a certain popular resort. The firm would like to estimate the mean of this distribution to within $120.00 with 95% confidence. From data regarding past operations at the resort, it has been estimated that the standard deviation of entertainment expenditures is no more than $400.00. How large the sample size should be?

2 2 2 2/ 2

2 2

1.96 400 42.68 43120

az snL

×= = = ≈

Page 37: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

37

Experimental Design for Estimating the Proportion p

Suppose we want a β% C.I. for the proportion with a tolerance level of L. Then we obtain that:

2

2

(1 )c p pnL−

= where c is such that: ( ) /100P c Z c β− ≤ ≤ =

The problem with using the above formula directly is that we don’t know the value of the observed sample proportion in advance. However it can be easily proved that :

p

1(1 )4

p p− ≤ Thus if we use the value of ¼ instead of we obtain the “conservative” estimate:

(1 )p p−

2/ 2

24aznL

=

Page 38: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

38

Example

Suppose that a major American television network is interested in estimating the proportion p of American adults who are in favor of a particular national issue such as a handgun control. They would like to compute a 95% C.I. whose tolerance level is plus or minus 3%. How many adults would the television network need to poll?

2 2/ 2

2 2

1.96 1, 067.11 1, 0684 4 0.03aznL

= = = ≈×

This is a rather remarkable fact. No matter how small or large is the proportion we want to estimate, if we randomly sample 1,068 adults, then in 19 cases out of 20 (95%), the results based on such a sample will differ by no more than 3% in either direction from what would have been obtained by polling all American adults.

Page 39: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

39

Comparing Estimates of the Mean of Two Distributions

Suppose that a national department store chain is considering whether or not to promote its products via direct mail promotion campaign. They have chosen two randomly selected groups of consumers with n1 and n2consumers in each group. They plan to mail the promotional material to all the consumers in the first group but not to any of the second group. Then they plan to monitor the spending of each consumer in each group in their stores in the coming month in order to estimate the effectiveness of the promotional campaign.Suppose that the true mean of the first group is µ1 with a standard deviation of σ1 and for the second group µ2 with a standard deviation of σ2. Our objective is to estimate the difference µ1-µ2. Suppose that we plan to randomly sample n1 observations X1,…,Xn1 from the first population and n2 observations Y1,…,Yn2 from the second population.

Page 40: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

40

Comparing Estimates of the Mean of Two Distributions

1

11

1 ,n

ii

X Xn =

= ∑2

12

1 n

ii

Y Yn =

= ∑The two sample means then are:

1 2( ) ,E X Y µ µ− = −2 2

1 2( )Var X Yn nσ σ

− = +and:

From the Central Limit Theorem then we have that:

1 22 2

1 2

1 2

( ) (0,1)− − −=

+

∼X YZ N

n n

µ µσ σ

when n1, n2 30≥

Page 41: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

41

Comparing Estimates of the Mean of Two Distributions

If , are the two observed sample means and , the two observed standard deviations then the estimate for the µ1-µ2 is the difference between the observed sample means .

x y1s 2s

x y−

A β% C.I. for the true difference µ1-µ2 of the two population means is:

2 2 2 21 2 1 2

/ 2 / 21 2 1 2

, a as s s sx y z x y zn n n n

⎡ ⎤− − + − + +⎢ ⎥

⎢ ⎥⎣ ⎦

where zα/2 is such that:

/ 2 / 2( ) /100 and α = 1- (β/100)a aP z Z z β− ≤ ≤ =

Page 42: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

42

Comparing Estimates of the Mean of Two Distributions

Back in our example suppose that n1= 500 and n2 = 400 consumers. Suppose that the observed sample mean of consumer sales in the first group is $387 and in the second group $365 with an observed standard deviation in the first group of $233 and in the second of $274. Let us compute a 98% C.I. for the difference between the means µ1-µ2 of the distribution of sales between the first group and the second group.

2 2 2 21 2 1 2

/2 /21 2 1 2

2 2 2 2

,

223 274 223 274387 365 2.326 , 387 365 2.326 [ $17.43, $61.43]500 400 500 400

a as s s sx y z x y zn n n n

⎡ ⎤− − + − + + =⎢ ⎥

⎢ ⎥⎣ ⎦⎡ ⎤

− − + − + + = −⎢ ⎥⎢ ⎥⎣ ⎦

Because this C.I. contains zero, we are not 98% confident that the promotional campaign will result in any increase in consumer spending.

Page 43: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

43

Comparing Estimates of the Population Proportion of Two Populations

We need to estimate the difference p1-p2 between the proportions of two independent populations. Suppose we sample from both populations obtaining n1 and n2 observations respectively. Let X denote the number of observations in the first population with the characteristic of interest and Y denote the number of observations in the second population with the characteristic of interest. The sample proportions of the two populations then are:

11

,XPn

= 22

YPn

= and: 1 2 1 2( )E P P p p− = −

1 1 2 21 2

1 2

(1 ) (1 )( ) p p p pVar P Pn n− −

− = +

From the Central Limit Theorem then we have that:

1 2 1 2

1 1 2 2

1 2

( ) ~ (0,1)(1 ) (1 )

P P p pZ NP P P P

n n

− − −=

− −+

Page 44: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

44

Comparing Estimates of the Population Proportion of Two Populations

If also then a β% C.I. for the difference between the proportions p1-p2 is:

If the observed sample proportions are then the estimate for the difference between the proportions p1-p2 is the difference between the observed sample proportions .

1 1 2 2 1 1 2 2, , (1 ), (1 ) 5− − ≥n p n p n p n p

1 2, p p

1 2p p−

1 1 2 2 1 1 2 21 2 / 2 1 2 / 2

1 2 1 2

(1 ) (1 ) (1 ) (1 ), a ap p p p p p p pp p z p p z

n n n n⎡ ⎤− − − −

− − + − + +⎢ ⎥⎣ ⎦

where zα/2 is such that:

/ 2 / 2( ) /100 and α = 1- (β/100)a aP z Z z β− ≤ ≤ =

Page 45: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

45

ExampleIn a ten year study 3,806 middle-age men with high cholesterol levels but no known heart problems were randomly divided into two equal groups. Members of the first group received a new drug designed to lower cholesterol levels, while the second group received a daily dosages of a placebo. Besides lowering cholesterol levels, the drug appeared to be effective in reducing the incidence of heart attacks. During the 10 years, 155 of those in the first group had a heart attack, compared to 187 in the second group. Let p1 denote the proportion of middle-aged men with high cholesterol who will suffer a heart attack within ten years if they receive the new drug, and let p2denote the proportion of middle-aged men with high cholesterol who will suffer a heart attack within ten years if they do not receive the new drug. Let us compute the 90% C.I. Of the difference between the proportions p1-p2.Here we have: n1=1,903, n2=1,903 and

1 2155 /1903 0.08145, 187 /1903 0.09827p p= = = =

For β=90% we find that c=1.645. Therefore a 90% C.I. is:

Page 46: Research Methodology and Statistics Modulefouskakis/Dim-Stats.pdf · 6 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of

46

Example

1 1 2 2 1 1 2 21 2 1 2

1 2 1 2

(1 ) (1 ) (1 ) (1 ), p p p p p p p pp p c p p cn n n n

⎡ ⎤− − − −− − + − + +⎢ ⎥

⎣ ⎦

= [ -0.032, -0.0016].

Note that this entire range is less than zero, therefore we are 90% confident that the new drug is effective in reducing the incidence of heart attacks in middle-age men with high cholesterol.