STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

33
STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1

description

Sample quantile function  Suppose we have obtained data on the survival r.v. Y. How can we tell whether the data follow certain prob model? See Eg: 2.8, page 32.  In general, QQ-plot and hazard plot will help.  Def 4.1: Let Sn(y) be the empirical survival function. For 0≤u≤1, the sample quantile function Qn(u) is defined as: Qn(u)=inf{y: Sn(y) ≤1-u}.  Eg: For the following ranked data: 3, 5, 7, 8, find  a) Find Sn(y) and plot the graph.  b) Find Qn(0.2), b) Qn(0.6), and c) Qn(0.9), by definition. STT

Transcript of STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Page 1: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

STT520-420: BIOSTATISTICS ANALYSIS

Dr. Cuixian Chen

Chapter 4: Data Plot

STT520-420 1

Page 2: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Review: STT315 for Population Quantile

STT520-420

2

Eg1: For Y~Binom(2,0.5), find a) 25th percentile of Y; b) 30th percentile; c) 60th percentile; d) 90th percentile. Eg2: find the 25th percentile of

Discussion: in the def of percentile Why P(Y≤ φp)≥p? 22222

200

1,8/,4/

,0

)( 2

yyyy

yy

yF

Page 3: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Sample quantile function Suppose we have obtained data on the survival r.v. Y.

How can we tell whether the data follow certain prob model? See Eg: 2.8, page 32.

In general, QQ-plot and hazard plot will help. Def 4.1: Let Sn(y) be the empirical survival function.

For 0≤u≤1, the sample quantile function Qn(u) is defined as: Qn(u)=inf{y: Sn(y) ≤1-u}.

Eg: For the following ranked data: 3, 5, 7, 8, find a) Find Sn(y) and plot the graph. b) Find Qn(0.2), b) Qn(0.6), and c) Qn(0.9), by

definition.

STT520-420

3

Page 4: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Review: Empirical survival function with confidence bands for Example 1.3

STT520-420

4

Drawback: this graph did NOT shows right-continuity property of empirical survival function! Question: How to improve it?

Page 5: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Reiwe: Plot Empirical survivor function section 4.1, page 55-56:Assume we have sorted data: Starting at Sn(0)=1;

STT520-420

5

.)(1#)(1

n

Iin yYI

nnyobsofyS

)()2()1( ... nyyy

However, this graph is difficult to draw using software in a way that shows both right-continuity and piece-wise linearity.Convention: Plot points located in the middle of “jumps”

Page 6: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Plotting Positions

STT520-420

6

Convention: Plot points located in the middle of “jumps” on the vertical axis and at the observed data points on the horizontal axis.

Page 7: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Sample quantile function We call plotting position.

Then the empirical survivor plot for ranked data:

That means is the sample percentile.

Or is the sample quantile.

STT520-420

7

njp j

5.0

)()2()1()( ...),1,( nij yyywherepy

)( jy thn

j )5.0(100

thn

j 5.0)( jy

Page 8: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Example 4.1, page 58: empirical survival plot using plotting position

STT520-420

8

Page 9: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

## Example 4.1, page 58 : Review Emperical Survivor function introduced previouslypar(mfrow=c(1, 2))x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52)x<-sort(x); a<-rle(x); values<-a$values; values #distinct values from the observationslength<-a$length; length #replcation for each distict value(f<-table(x)); #We need the fractions to plot the curve - so get the sample size first in n(n=length(x)); #we want S(0)=1surv1=1-cumsum(f)/n; surv2=c(1,surv1) ; surv2#now let's plot this curve… use the type="s" to get a step functiont=c(0, values) ; surv2 #t is the vector of x's and surv2 is the vector of y'splot(t,surv2,type="s",xlab="Failure age",ylab="Relative Frequencies", col="orange", pch=15, lwd=3) ## Or use type=“P” for pointtitle(main = "Emperical Survivor function") ############################################################# Emperical Survivor function using plotting position (sample quantile) x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52)x<-sort(x); n<-length(x); p=1-ppoints(n, a=0.5) ## or equivalently: kjp=1-(1:n-.5)/n;plot(x, p, type="n",xlab="Failure age",ylab="quantile");points(x,p,col = "blue", pch=15)title(main = "Emperical Survivor plot using plotting points") STT520-420

9

Page 10: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Probability Plots (QQ-plot)

If a probability model fits the data well, then the p-th sample quantile and p-th population quantile should be in close agreement.

Def 4.4: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points: Recall --

in the usual rectangular coordinate system. Note: In general, the ranked data are plotted on the

vertical axis, and the theoretic/population percentile from specified distribution on the horizontal axis.

STT520-420

10

.,...,2,1),),5.0(())(),(( )(1 njy

njFpQpQ jjnj

Page 11: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Probability Plots (QQ-plot)to check Normality

For data: 3, 5, 7, 8. 1. Find the sample quantile for each observation. For example: y(1) = 3 = (1-0.5)/4 th sample quantile

=0.125th sample quantile. 2. Now assume the data come from Normal distribution,

find the population quantile from R or your calculator.For Example, in R: qnorm(0.125) = ?; or in calculator: invNorm(0.125) = ? 3. Plot the QQ plot manually.

STT520-420

11

.,...,2,1),),5.0(())(),(( )(1 njy

njFpQpQ jjnj

Page 12: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Manual QQ plot for Normality par(mfrow=c(1,2))pop.quant=c(-1.15, -0.319, 0.319, 1.15)sample.quant=c(3, 5, 7, 8)qqplot(pop.quant, sample.quant)

qqnorm(sample.quant)

STT520-420

12

Page 13: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

############################################################ The 100-quantiles are called percentiles. The 4-quantiles are called quartiles.# We have learn Q1, median, and Q3. That is 25th percentile, 50th percentile , 75th percentile.############################################################# QQ-Normal for Normal distribution ##set.seed(100);par(mfrow=c(1,2)); # 1 x 2 pictures on one plotx=rnorm(1000, mean=0, sd=4);qqnorm(x); qqline(x, col = 4);

## QQ-Norm for t-distribution ##y <- rt(1000, df = 4)qqnorm(y); qqline(y, col = 2)dev.off() ## clear the setting of par# Another exampley <- rt(1000, df = 4)qqplot(y, rt(300, df = 5));qqline(y);

STT520-420

13

QQ-Norm for Normal and t distribution

Page 14: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

QQ-Norm for Normal and t distribution

STT520-420

14

Page 15: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Probability Plots A probability plot provides evidence as to whether

(continuous) data have come from a population having a stated probability model S.

In survival analysis, the prob models mostly checked are the Exponential and Weibull.

Survival times are often transformed by logarithms to create a natural measurement scale.

Recall Example 2.1 and Example 2.2, we use stem-leaf plot and calculate the empirical hazard function to determine whether an Exponential or Weibull are possible models.

STT520-420

15

Page 16: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Probability Plots (QQ-plot) If the points plotted from the QQ-plot fall close to a

45 degree line through the origin (0, 0) of the plot, then the prob model fits the data well.

Points in the plot lying above/below 45 degree line represent data points that are larger/smaller than the prob model predicts they should be.

If the prob plot looks like a “S-shape” plot, then only tails are shorter or longer than they should be.

For QQ-plot of Weibull data is most suitably presented on the logarithm scale (for each axis). The interpretation of a straight-line fit is then relative to the scale involved.

STT520-420

16

Page 17: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Consider a simple hazard function, the constant hazard h(y)= for all y≥0. Here we assume , where 0. We have seen that

so if we evaluate this for h(y)= , we get

Since f(y)=-d(S(y))/dy, we have

the exponential probability density with parameter . This means the expected value is and the variance is 2.

S(y) exp( h(u)duu0

y

)

S(y) exp( 1/ duu0

y

) exp( y /)

f (y) ddy

(exp( y /)) 1

exp( y /)

Review: Constant Hazard model

STT520-420

17

Page 18: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Review: Exponential distribution From STT315: In R: dexp(x, 1/β); pexp (x, 1/β); qexp (per, 1/β);

rexp (N, 1/β).## Note that in R, exponential distribution is defined in a

different way than we used to have in STT315 class. set.seed(100)y=rexp(10000, 0.1)mean(y) ## beta= 10.07184, not 0.1!

Page 19: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Probability Plots (QQ-plot)to check Exponential dist

For data: 3, 5, 7, 8. 1. Find the sample quantile for each observation. For example: y(1) = 3 = (1-0.5)/4 th sample quantile

=0.125th sample quantile. 2. Now assume the data come from Normal distribution,

find the population quantile from R or your calculator.For Example, in R: qexp(0.125, 1/mean(y)) = ?;

3. Plot the QQ plot manually.

STT520-420

19

.,...,2,1),),5.0(())(),(( )(1 njy

njFpQpQ jjnj

Page 20: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

How do we decided whether a set of survival data is following the exponential distribution? That the hazard is constant? Look over Example 2.1: 200 randomly generated exponential

variables with mean=100. Characteristic skewed distribution, sample mean=107.5, sample s.d.=106.1; (Recall that if Y~exp( then E(Y)=SD(Y)= . ) The sample stemplot and the sample mean and sd approximate the true shape, center and spread of the exponential.

The estimated hazards (rightmost column) approx. .01 (1/100) - constant - see the formula on p.22 for getting these values…

But another way to check the distribution is to compare the quantiles of the exponential distribution with the sample quantiles in a plot known as a qqplot. See R-Quantile plot for a way to compute the quantiles and do the plot… Recall that the p-th quantile of a distribution of a r.v. Y is the value Q s.t. P(Y<=Q)=p. So we must compute the quantiles of the theoretical distribution and compare them (smallest to smallest, next smallest to next smallest, etc.) to the sample quantiles.

Exponential Prob Plot for Constant Hazard model

STT520-420

20

Page 21: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

STT520-420

21

Constant Hazard model: Example 2.1, page 21

The decimal point is 1 digit(s) to the right of the |

0 | 001234555667799011344455999 2 |

01133455566777888901222344455677889

4 | 0122334455566667779013688 6 | 000012346666679900124689 8 | 334445883467 10 | 0082445556779 12 | 2358880125 14 | 0144467 16 | 12457868 18 | 13493 20 | 17792369 22 | 1934446 24 | 2113 26 | 05839 28 | 77 30 | 66 32 | 4 34 | 0 36 | 6 38 | 3 40 | 42 | 44 | 78

set.seed(100)y=rexp(10000, 0.1)mean(y) ## beta= 10.07184, not 0.1!############################

# random generation for the exponential # distribution with rate rate (i.e., mean=1/rate). # rexp(n, rate) # From Example 2.1, we first generate 200 #exponential varaibles with mean=100.set.seed(100)x<-rexp(n=200,rate=1/100);mean(x); # meansd(x); #st1andard deviation

# do a stemplot of 200 randomly generated # exponential variables with beta=100stem(x);

plot(qexp(ppoints(200,a=.5),rate=1/mean(x)), sort(x)); abline(0,1)

Page 22: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

QQ-plot: Example 2.1, page 21

STT520-420

22

Page 23: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Check Example 1.3, page 6 to see whether the placebo data comes from an Exponential distributionplacebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23);

Step1: Fit the placebo data to an exponential with mean = mean(placebo)#Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis.

Step2: Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line??#The function in R that draws lines on plots is abline(intercept, slope), so we wantabline(0,1)

Step3: What do you think? Does the placebo dataset follow an exponential distribution?

#Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions.

STT520-420

23

Page 24: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Check Example 1.3, page 6 to see whether it comes from an Exponential distributionplacebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23);n=length(placebo); n#The first plot we can do is the so-called QQ-plot:plot(ppoints(n,a=.5),sort(placebo));#This shows the distribution of placebo; e.g. you can see the median, the quartiles, etc.############################################################To see whether a set of data is following a particular distribution, plot the quantiles of the data against the quantiles of the specific distribution.#Recall that there are many distributions built in to R: qexp, qnorm, etc. Try help(qexp) to see the general format of these functions…############################################################We'll fit our placebo distribution to an exponential with mean = mean(placebo)plot(qexp(ppoints(n,a=.5),rate=1/mean(placebo)), sort(placebo))#Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis.#Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line??#The function in R that draws lines on plots is abline(intercept, slope), so we wantabline(0,1)#What do you think? Does the placebo dataset follow an exponential distribution?#Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions.

STT520-420

24

Page 25: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Check Example 1.3, page 6 to see whether it comes from an Exponential distribution

STT520-420

25

Page 26: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Review: Power Hazard model Power Hazard:

Note this is of the form (constant)yconstant and if 1 this reduces to the constant hazard we just considered.

Note that and so

STT520-420

26

h(y) y

1, for y 0; , 0

S(y) exp( ( y

) )

f (y) y

1 exp( ( y

) ), y 0

Page 27: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Weibull Prob Plots Recall Power hazard model:

Substitute , then take logarithm:

Take logarithm of base 10 again to create log-life variable:

We now have: Note: base-10 log are traditionally used in lifetime,

but natural log is equivalent with a difference of scale. STT520-420

27

)()(log )(

)(j

je

yyS

))(exp()(

yyS

]log[log)](log[log 10)(10)(10 jje yyS

)( jyy

)](log[log1

loglog )(1010)(10 jej ySy

Page 28: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Weibull Prob Plots If data fit Weibull model:

Weibull Prob plot is:

It follows a straight line with slope and intercept .

Reacll: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points:

STT520-420

28

)]5.01(log[log1loglog 1010)(10 njy ej

njyn

jje ,...,2,1)],log)],5.01(log[(log )(1010

/1 10log

.,...,2,1),),5.0(())(),(( )(1 njy

njFpQpQ jjnj

Page 29: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Summary: To check graphically if a distribution is Weibull, do essentially

a qqplot on the log-log scale (see section 4.4, p. 61-63). The key formulas are: (4.2) Now substitute the ordered data and take natural logs

(4.2a) Take logs again…

(4.2b)

Write as a linear equation: (4.3)

(4.3a)

So plot the points in 4.4, look for a straight line and the slope will equal 1/ and the intercept will equal log()

S(y) exp( (y /) )

ln(S(y( j )) (y( j )

)

]log[log))(ln((log 10)(10)(10 jj yyS

log10 y( j ) log10 (1/)log10[ ln(S(y( j ))]

log10 y( j ) log10 (1/)log10[ ln(1j .5n

)]

STT520-420

29

Page 30: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Example 4.2 on page 62. Ranked data: 10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27,

29, 29, 30, 35. [Note: n=17] Step 1: take y=log10 to ranked data;

Step 2: find the plotting position:

Step 3: find x= Step 4: plot the Weibull Prob plot

STT520-420

30njy

nj

je ,...,2,1),log),5.01(log[(log )(1010

njp j

5.0

)5.01(log[log10 nj

e

Page 31: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Example 4.2, page 62#Now let's do the Weibull probability plot as described in Chapter 4 on p. 61-63#Formula (4.4) shows that the slope is 1/shape and intercept is log10(scale) #when you do a Weibull probability plot. Let's try it with the data in Ex. 4.2#############lifetime=c(10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27, 29, 29, 30, 35)#now take logs base 10 and compute the values needed in formula (4.4)y=log10(lifetime)n=length(lifetime)k=(1:n-.5)/n; ##or k=ppoints(n, a=0.5);x=log10(-log(1-k))#now plot the x's and y's and put in a vertical line at x=0 to see the interceptplot(x,y) ; #similarly find the slope both graphically or you may use regressionlm(y~x) #from the output we see intercept=1.3813 and slope=0.2660abline(lm(y~x), col="blue") ## regression line#so notice that the intercept of the regression line is 1.38;#so log10(beta)=1.38 which implies beta=10^1.38 =23.98833#and slope=0.2660 which implies alpha=1/.2660=3.75938

STT520-420

31

Page 32: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

More about Weibull Prob Plots The Weibull prob plot follows a straight line with

slope and intercept

Standard Least square regression can provide the estimates to the slope and intercepts.

If slope= is estimated to be close to 1, then it is reasonable to assume the data come from an Exponential model.

STT520-420

32

/110log

/1

Page 33: STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.

Cumulative Hazard plot A hazard plot is: , where

which is called (cumulative) hazard plot scores.

For Exponential model, h(y)= thenyy/

If the data come from an exponential distribution, then hazard plot will be a straight line. It is a simpler diagnostic test than the prob plot.

STT520-420

33

),( )( jj y

j

ij in1 1

1