Applied Statistics And Doe Mayank

Applied Statistics and DOE

Mayank

MeanMedianMode

Measures of dispersion (spread of data)

VarianceStandard deviationCoefficient of variation

Measures of central tendency (central position of data)

Applied Statistics

Population : µ xSample:

Population : Sample:σ2 s2

Population : Sample:σ s

Mean

Mode

Median

Measures of Central tendency

Data: 34, 43, 81, 106, 106 and 115

Average Σx/n = 80.83

Highest frequency = 106

Middle score (81+106)/2 = 93.5

Variance:

Most of the data lies between 44.5±4,57 = 39to 49

Standard deviation:

44503849424740394650

188.5

44.5

-0.55.5-6.54.5-2.52.5-4.5-5.51.55.5

0.330.342.320.36.36.3

20.330.32.3

30.3

20.9

4.57

x

2)( xx

n

ii xx

1

2)( SS

SS/(n-1) MS

sd√MS

)( xx

Measures of dispersion

x

Coefficient of VarianceCV = S/ *100%

4.57/44.5*100% = 10.28%

Standard deviation is 10.28% of the mean


Grade ScoreGenius 145Gifted 130-144

Above average 115-129Higher average 100-114Lower average 85-99Below average 70-84Borderline low 55-69

Low <55

Normal Distribution

Example: IQ Score


115 130 145100857055 145<<55

Coun

t

Score

IQ ScoreNormal Distribution


34.13%34.13%

13.59%13.59 %

2.14%2.14%0.13% 0.13%

Prob

abili

ty

Score

-6σ -5σ -4σ -3σ -2σ -1σ 1σ 2σ 3σ 4σ 5σSd from

0.0031% 0.000028%0.0031%0.000028%

6σ

Normal Distribution


μ68.2689%95.4499%99.7300%99.9936%99.999942669%99.999999802 %

-6σ -5σ -4σ -3σ -2σ -1σ 1σ 2σ 3σ 4σ 5σSd from6σ μ

Normal Distribution


99.999999802 %

0.0000001980.00198

USLLSL

DPMODPHOSix Sigma


Normal Distribution

USLLSL USLLSL

Normal Distribution


USLLSL

-6σ -5σ -4σ -3σ -2σ -1σ 1σ 2σ 3σ 4σ 5σ 6σ μ

1.5 σ

3.4 DMPO

Statistical significance tests

Significance tests

Z- test

t- test

F- test

ANOVA

+ve z: values are above the mean,

-ve z: values are below the mean

1 point compared to population Group compared to population

Population

ii

xz

n

xz


Z - test Z-value :

How many standard deviations away from mean?

s

xxz

Sample

s

xxz

07.1

57.6

20.262.19

So this person has a BMI 1.07 standard deviations below the mean

What is the probability that of a person having BMI

19.2 sd below the mean

19.2 sd above the mean


Z - test

Mean ( ) = 26.20Standard deviation (s) = 6.57

x

Sample :

BMI

A person with a BMI of 19.2 has a z score of:

Prob

abili

ty

Sd

-1σ μ

<19.6 >19.6

0

Standard deviationZ score


Z - test Sample :

-1

84 %16 %

Test group : Employee having two wheelerTest : Commuting time from home to BioconClaim : Average commuting time is less than 24 min

Samples : 30

18 16 23 19 25 48 13 17 20 23

16 21 18 16 29 15 8 19 20 7

15 16 24 15 6 11 14 23 18 12

At 0.01 level of significance (α=0.01):Is there enough evidence to support the research claim???


Z - test Population :



Assumption: Population is normally distributed

X24Mean

Prob

abili

ty

Score

Hypothesis testing

Null hypothesis : H0

Alternate hypothesis : H1

Comparison of means:

H1 : x < µ

H0 : x ≥ µ



No difference (Claim not true)

µ = 24

It is different (Claim is true)

Test vs Population

Prob

abili

ty

Z value Z0Criticalvalue

Level of significanceα = 0.01

24Mean X

Prob

abili

ty

Score



-2.33

nsx

z

Z-2.33

Rejection region

= 18.2s = 7.7x

Z = - 4.13

Acceptance region



µ = 24n = 30

Ztest< Zcritical Ztest>Zcritical

Z

Rejection region

-2.33- 4.13

H0 : s ≥ 24 Rejected

So is test value is significantly different (lower) than the mean

Yes: There are significant evidence to reject the null hypothesis

and therefore accept the claim

H1 : s < 24 Significantly supported



H0:

H1:


t - test Comparison of means between two groups

ttest > tcritical Null hypothesis will be rejected

ttest < tcriticalNull hypothesis will not be rejected

t = Signal

Noise

Difference between group means

Variability of groups=

21

21

xxs

xxt

2

22

1

21

21 n

s

n

ss xx


t - test Comparison of means between two groups

35 240 2712 3815 3121 1114 1946 1110 3428 1048 1116 1230 1532 2248 1131 1222 1212 1239 2919 3725 2

Fertilizer w/o Fertilizer

x 27.15 17.9

156.45 122.61s2

t test = 27.15 – 17.9

20

61.122

20

45.156

= 2.4

t critical with 38 dfat 0.05 significance level= 2.03

ttest > tcritical

H0:

H1:

21 xx

21 xx

Rejected

1xSo is significantly different from 2x

Plan

t hei

ght


t - test Case 1 Effect of fertilizer on plant height

df = 2n-2

2 227 2738 3831 31

100 11115 1911 1134 3410 1011 1112 1215 1522 2211 1112 1212 1212 1229 2937 372 2

Fertilizer w/o Fertilizer

x 27.15 17.9

880.1 122.61s2

t test = 1.3

t critical =2.03

ttest < tcritical

H0:

H1: 21 xx 21 xx

Not rejected

1xSo is not significantly different from 2x

Plan

t hei

ght


t - test

Rejected

Case 2


t - test Overview

Comparison of variances

F = where and are the sample variances

The F hypothesis test is defined as:

H0: =

If Ftest > Fcritical (at significant level)

Rejected


F - test

Ha: <

>

≠

ANalysis Of VAriance

One way :

Two way :

• Effect of one factor (variable)

• Effect of two factors (variables)

• Effect of interaction


ANOVA

Strategy:

F = MSbg

MSwg

Compare variability within group MSwg to between groups MSbg

Between groups Within groups

Group 1 Group 2 Group 1 Group 2


One way ANOVA

Factor ( Independent Variable): Temperature (cold, optimum, hot)

Effect ( Dependent Variable): Score (marks obtained)

Null hypothesis (H0) : No effect (µ1= µ2 = µ3)

Alternate hypothesis (H1) : There is an effect (µ1 ≠ µ2 ≠ µ3)

Is there any impact of exam room temperature on student performance?


One way ANOVA

SSbg 748.44n x ( + + )

55

60

51

65

72

65

55

72

68

60

75

67

75

65

80

75

67

68

77

83

67

56

65

83

67

53

65

49

54

61

65

72

63

64

54

65

63.75 71.75 61

65.5

Cold Opt Hot7714

1632

682

77681814

12711

11466811231428

12723

24846

127

366416

144490

16121

49

4916

638 768 524

374.25

3.06 39.06 20.25

SSbg/df

58.5

C O HN

umbe

r of A

tten

dees

3.06 39.06 20.25 =

MSbg=

SSM SSW SSS

SSwg

+ +

= 1930

SSwg/dfMSwg=

SS

= =

(df = 3-1 = 2) (df = (12x3)-3 = 33)

2)( xx

x

3/x = X̄�2)( xx


One way ANOVA

=374.25

58.5=F =

MSbg

MSwg

Fcritical for

Numerator degrees of freedom : 2Denominator degrees of freedom : 33 At significance level (α) : 0.05

= 4.17

Ftest > Fcritical

So there are enough evidence to reject null hypothesis

At 95% confidence level we can say:

That the variation between means is not just by chance

6.40

H0: All means are same (no effect of Temperature) Rejected

Examination Room temperature matters significantly


One way ANOVA

Factors ( Independent Variable): 1) Gender:

Effect ( Dependent Variable): 1) Number of participants

Relative impact of gender or type of sprot?

Null hypothesis (H0a) : No effect of gender

Alternate hypothesis (H1) : There is an effect

2) Type of sport

Any interaction between gender and type of sport?

Null hypothesis (H0b) : No effect of type of sportNull hypothesis (H0c) : No interaction


Two way ANOVA

Man Woman

Indoor Outdoor

30, 40, 50 60, 70, 80

140, 150, 160 5, 10, 15

Man Woman

Indoor

Outdoor

Source Df SS MS F

Gender g-1 SSG MSG MSG /Mswithin

Sports s-1 SSs MSs MSs /Mswithin

G x S (g-1)(s-1) SSG x s MSG x s MSG x s /MSwithin

Within (k-1) x I x j SSwithin MSwithin

Source Df SS MS F Fcritical (α=0.01)

Gender 1 9600 9600 118.15 11.3

Sport 1 1875 1875 23.07 11.3

G x S 1 21675 2165 266.75 11.3

Within 8 81.25 231.25

g→ s↓


Two way ANOVA

Woman Man

Ind 70 50

Otd 10 150

0

20

40

60

80

100

120

140

160

ManWoman

Indoor Outdoor

Null hypothesis (H0a) : No effect of gender Rejected

Null hypothesis (H0b) : No effect of type of sports Rejected

Null hypothesis (H0c) : No interaction Rejected


Two way ANOVA

0

20

40

60

80

100

120

140

160

30o C 35o C

30o C 35o C

pH7 70 50

pH5 10 150


Two way ANOVA

pH 5pH 7

Factors ( Independent Variable): 1) Temperature:

Effect ( Dependent Variable): 1) Total product (g)

2) pH

30 35

5 7

Investigation of relationship between variables

X Y2 4819 3034 17.540 118 4112 4220 3520 3137 1819 3530 1646 8.3

0 5 10 15 20 25 30 35 40 45 50X

Y

Regression and correlation

Regression analysis:

Investigation of relationship between variables

X Y2 4819 3034 17.540 118 4112 4220 3520 3137 1819 3530 1646 8.3

0 5 10 15 20 25 30 35 40 45 50X

Y

R² = 0.955

y = -0.951x + 50.49 y = ax +b

Simple linear regression

One independent variable



y = ax + b

y = a1x1 + a2x2 + a3x3 + b

Simple linear regression

Multiple linear regression

Linear Non Linear



Non linear

y = a1x1 + a2x2 + a11 x2 + a12 x1x2 +b

Is the relationship we have described statistically significant?-Significant tests

To find how well (or badly) a line fits the observation

What is the strength of this relationship- r2 (coefficient of determination) or adjusted r2


Correlation analysis:

ε

ŷ = ax + b

slope intercept

= ŷ, predicted value

ε = residual error =

= y i , true value

y - ŷ

A and b values are calculated that minimize Sum of Squares (SS) of residuals =Σ (y – ŷ)2 : minimum



Total Error

SSTotal

SSErrorr2 = 1-



SSTotal/(n-1)

SSError/(n-p-1)Adjusted r2= 1-

n= total observationp= Number of predictor

(yi – y)2 (y – ŷ)2

r2 : Coefficient of determination

Always between 0 and 1Increase with number of predictor

It can be negative alsoTrue representative of relationship strength

Group 1 Group 2Group 1 Group 2

MSwg

MSbgF =

MSError

MSModelF =

Model Error


Correlation analysis: Statistical significance of relationship

One factor at time (OFAT)

Multiple factor at time (MFAT)

Design of experiment

Traditional method

Statistical method

Number of factors Screening Optimization Robustness

2-4 Full or fractional factorial

Central composite or Box-Behnken

Taguchi

5 or more Fraction factorial or Plackett Burman

Screen first to reduce factors Taguchi

How to select a design?


Continuous

Categorical

Independent variable/s

Numeric: any value between lower and upper value

eg. Temperature, pH, concentration

Numeric/non-numeric : only characters or levelseg. Gender, operator, type, temperature

Range of a factor/s -1 (lower) +1 (higher)0 (middle)

Dependent variable/s: Response

Main effect/s Effect/s due to individual factor/s

Interaction effect/s Effect/s due to interaction of multiple factors

When two or more effects can not be distinguished

eg. Main effect is confounded with interaction effects Main effects and interaction effects are aliased

Design of experiment- terminology

Factors

Levels

Effects

Confounding/Aliasing

Resolution type

Order of interaction effects confounded with main effect

Experiment type

III 2 (eg. A with A.B or A.C or B.C etc) Screening

IV 3 (eg. A with ABC) Optimization

V 4 (eg A with ABCD) Optimization

Higher order interaction are less significant than lower order interaction


Resolution of a design Power of a design

Full factorial: Lf

Level

Factor

No. of Levels No. of Factors Design type Number of experiments

2 2 22 2x2=4

2 3 23 2x2x2=8

3 2 32 3x3=9

3 3 33 3x3x3=27


Factorial design

22

4 experiments


Factorial design

ab

a

cb

8 experiments

23


Factorial design

9 experiments

32


Factorial design

ab

27 experiments

33


Factorial design

cb

23

8 experiments

23-1

4 experiments


Fractional Factorial design


Response surface methodology

12 experiments

Box - Behnken


Geometry of some important response surface designs

eg. 3 factor 3 level

Central composite design


eg. 2 factor 2level

+ =


Taguchi design

Inner array:

Outer array:

Controllable variables during production

Uncontrollable variables during production

Signal

Noise

Media, pH, feed rate

Temp, DO,



Applied Statistics And Doe Mayank

Documents

Transcript of Applied Statistics And Doe Mayank