STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.
-
Upload
jocelin-haynes -
Category
Documents
-
view
229 -
download
2
description
Transcript of STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 4: Data Plot STT520-420 1.
STT520-420: BIOSTATISTICS ANALYSIS
Dr. Cuixian Chen
Chapter 4: Data Plot
STT520-420 1
Review: STT315 for Population Quantile
STT520-420
2
Eg1: For Y~Binom(2,0.5), find a) 25th percentile of Y; b) 30th percentile; c) 60th percentile; d) 90th percentile. Eg2: find the 25th percentile of
Discussion: in the def of percentile Why P(Y≤ φp)≥p? 22222
200
1,8/,4/
,0
)( 2
yyyy
yy
yF
Sample quantile function Suppose we have obtained data on the survival r.v. Y.
How can we tell whether the data follow certain prob model? See Eg: 2.8, page 32.
In general, QQ-plot and hazard plot will help. Def 4.1: Let Sn(y) be the empirical survival function.
For 0≤u≤1, the sample quantile function Qn(u) is defined as: Qn(u)=inf{y: Sn(y) ≤1-u}.
Eg: For the following ranked data: 3, 5, 7, 8, find a) Find Sn(y) and plot the graph. b) Find Qn(0.2), b) Qn(0.6), and c) Qn(0.9), by
definition.
STT520-420
3
Review: Empirical survival function with confidence bands for Example 1.3
STT520-420
4
Drawback: this graph did NOT shows right-continuity property of empirical survival function! Question: How to improve it?
Reiwe: Plot Empirical survivor function section 4.1, page 55-56:Assume we have sorted data: Starting at Sn(0)=1;
STT520-420
5
.)(1#)(1
n
Iin yYI
nnyobsofyS
)()2()1( ... nyyy
However, this graph is difficult to draw using software in a way that shows both right-continuity and piece-wise linearity.Convention: Plot points located in the middle of “jumps”
Plotting Positions
STT520-420
6
Convention: Plot points located in the middle of “jumps” on the vertical axis and at the observed data points on the horizontal axis.
Sample quantile function We call plotting position.
Then the empirical survivor plot for ranked data:
That means is the sample percentile.
Or is the sample quantile.
STT520-420
7
njp j
5.0
)()2()1()( ...),1,( nij yyywherepy
)( jy thn
j )5.0(100
thn
j 5.0)( jy
Example 4.1, page 58: empirical survival plot using plotting position
STT520-420
8
## Example 4.1, page 58 : Review Emperical Survivor function introduced previouslypar(mfrow=c(1, 2))x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52)x<-sort(x); a<-rle(x); values<-a$values; values #distinct values from the observationslength<-a$length; length #replcation for each distict value(f<-table(x)); #We need the fractions to plot the curve - so get the sample size first in n(n=length(x)); #we want S(0)=1surv1=1-cumsum(f)/n; surv2=c(1,surv1) ; surv2#now let's plot this curve… use the type="s" to get a step functiont=c(0, values) ; surv2 #t is the vector of x's and surv2 is the vector of y'splot(t,surv2,type="s",xlab="Failure age",ylab="Relative Frequencies", col="orange", pch=15, lwd=3) ## Or use type=“P” for pointtitle(main = "Emperical Survivor function") ############################################################# Emperical Survivor function using plotting position (sample quantile) x=c(33,8,5,12,41,8,11,39,16,14,89,81,28,88,1,29,3,46,1,54,194,24,2,39,22,36,89,106,52)x<-sort(x); n<-length(x); p=1-ppoints(n, a=0.5) ## or equivalently: kjp=1-(1:n-.5)/n;plot(x, p, type="n",xlab="Failure age",ylab="quantile");points(x,p,col = "blue", pch=15)title(main = "Emperical Survivor plot using plotting points") STT520-420
9
Probability Plots (QQ-plot)
If a probability model fits the data well, then the p-th sample quantile and p-th population quantile should be in close agreement.
Def 4.4: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points: Recall --
in the usual rectangular coordinate system. Note: In general, the ranked data are plotted on the
vertical axis, and the theoretic/population percentile from specified distribution on the horizontal axis.
STT520-420
10
.,...,2,1),),5.0(())(),(( )(1 njy
njFpQpQ jjnj
Probability Plots (QQ-plot)to check Normality
For data: 3, 5, 7, 8. 1. Find the sample quantile for each observation. For example: y(1) = 3 = (1-0.5)/4 th sample quantile
=0.125th sample quantile. 2. Now assume the data come from Normal distribution,
find the population quantile from R or your calculator.For Example, in R: qnorm(0.125) = ?; or in calculator: invNorm(0.125) = ? 3. Plot the QQ plot manually.
STT520-420
11
.,...,2,1),),5.0(())(),(( )(1 njy
njFpQpQ jjnj
Manual QQ plot for Normality par(mfrow=c(1,2))pop.quant=c(-1.15, -0.319, 0.319, 1.15)sample.quant=c(3, 5, 7, 8)qqplot(pop.quant, sample.quant)
qqnorm(sample.quant)
STT520-420
12
############################################################ The 100-quantiles are called percentiles. The 4-quantiles are called quartiles.# We have learn Q1, median, and Q3. That is 25th percentile, 50th percentile , 75th percentile.############################################################# QQ-Normal for Normal distribution ##set.seed(100);par(mfrow=c(1,2)); # 1 x 2 pictures on one plotx=rnorm(1000, mean=0, sd=4);qqnorm(x); qqline(x, col = 4);
## QQ-Norm for t-distribution ##y <- rt(1000, df = 4)qqnorm(y); qqline(y, col = 2)dev.off() ## clear the setting of par# Another exampley <- rt(1000, df = 4)qqplot(y, rt(300, df = 5));qqline(y);
STT520-420
13
QQ-Norm for Normal and t distribution
QQ-Norm for Normal and t distribution
STT520-420
14
Probability Plots A probability plot provides evidence as to whether
(continuous) data have come from a population having a stated probability model S.
In survival analysis, the prob models mostly checked are the Exponential and Weibull.
Survival times are often transformed by logarithms to create a natural measurement scale.
Recall Example 2.1 and Example 2.2, we use stem-leaf plot and calculate the empirical hazard function to determine whether an Exponential or Weibull are possible models.
STT520-420
15
Probability Plots (QQ-plot) If the points plotted from the QQ-plot fall close to a
45 degree line through the origin (0, 0) of the plot, then the prob model fits the data well.
Points in the plot lying above/below 45 degree line represent data points that are larger/smaller than the prob model predicts they should be.
If the prob plot looks like a “S-shape” plot, then only tails are shorter or longer than they should be.
For QQ-plot of Weibull data is most suitably presented on the logarithm scale (for each axis). The interpretation of a straight-line fit is then relative to the scale involved.
STT520-420
16
Consider a simple hazard function, the constant hazard h(y)= for all y≥0. Here we assume , where 0. We have seen that
so if we evaluate this for h(y)= , we get
Since f(y)=-d(S(y))/dy, we have
the exponential probability density with parameter . This means the expected value is and the variance is 2.
S(y) exp( h(u)duu0
y
)
S(y) exp( 1/ duu0
y
) exp( y /)
f (y) ddy
(exp( y /)) 1
exp( y /)
Review: Constant Hazard model
STT520-420
17
Review: Exponential distribution From STT315: In R: dexp(x, 1/β); pexp (x, 1/β); qexp (per, 1/β);
rexp (N, 1/β).## Note that in R, exponential distribution is defined in a
different way than we used to have in STT315 class. set.seed(100)y=rexp(10000, 0.1)mean(y) ## beta= 10.07184, not 0.1!
Probability Plots (QQ-plot)to check Exponential dist
For data: 3, 5, 7, 8. 1. Find the sample quantile for each observation. For example: y(1) = 3 = (1-0.5)/4 th sample quantile
=0.125th sample quantile. 2. Now assume the data come from Normal distribution,
find the population quantile from R or your calculator.For Example, in R: qexp(0.125, 1/mean(y)) = ?;
3. Plot the QQ plot manually.
STT520-420
19
.,...,2,1),),5.0(())(),(( )(1 njy
njFpQpQ jjnj
How do we decided whether a set of survival data is following the exponential distribution? That the hazard is constant? Look over Example 2.1: 200 randomly generated exponential
variables with mean=100. Characteristic skewed distribution, sample mean=107.5, sample s.d.=106.1; (Recall that if Y~exp( then E(Y)=SD(Y)= . ) The sample stemplot and the sample mean and sd approximate the true shape, center and spread of the exponential.
The estimated hazards (rightmost column) approx. .01 (1/100) - constant - see the formula on p.22 for getting these values…
But another way to check the distribution is to compare the quantiles of the exponential distribution with the sample quantiles in a plot known as a qqplot. See R-Quantile plot for a way to compute the quantiles and do the plot… Recall that the p-th quantile of a distribution of a r.v. Y is the value Q s.t. P(Y<=Q)=p. So we must compute the quantiles of the theoretical distribution and compare them (smallest to smallest, next smallest to next smallest, etc.) to the sample quantiles.
Exponential Prob Plot for Constant Hazard model
STT520-420
20
STT520-420
21
Constant Hazard model: Example 2.1, page 21
The decimal point is 1 digit(s) to the right of the |
0 | 001234555667799011344455999 2 |
01133455566777888901222344455677889
4 | 0122334455566667779013688 6 | 000012346666679900124689 8 | 334445883467 10 | 0082445556779 12 | 2358880125 14 | 0144467 16 | 12457868 18 | 13493 20 | 17792369 22 | 1934446 24 | 2113 26 | 05839 28 | 77 30 | 66 32 | 4 34 | 0 36 | 6 38 | 3 40 | 42 | 44 | 78
set.seed(100)y=rexp(10000, 0.1)mean(y) ## beta= 10.07184, not 0.1!############################
# random generation for the exponential # distribution with rate rate (i.e., mean=1/rate). # rexp(n, rate) # From Example 2.1, we first generate 200 #exponential varaibles with mean=100.set.seed(100)x<-rexp(n=200,rate=1/100);mean(x); # meansd(x); #st1andard deviation
# do a stemplot of 200 randomly generated # exponential variables with beta=100stem(x);
plot(qexp(ppoints(200,a=.5),rate=1/mean(x)), sort(x)); abline(0,1)
QQ-plot: Example 2.1, page 21
STT520-420
22
Check Example 1.3, page 6 to see whether the placebo data comes from an Exponential distributionplacebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23);
Step1: Fit the placebo data to an exponential with mean = mean(placebo)#Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis.
Step2: Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line??#The function in R that draws lines on plots is abline(intercept, slope), so we wantabline(0,1)
Step3: What do you think? Does the placebo dataset follow an exponential distribution?
#Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions.
STT520-420
23
Check Example 1.3, page 6 to see whether it comes from an Exponential distributionplacebo<-c(1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23);n=length(placebo); n#The first plot we can do is the so-called QQ-plot:plot(ppoints(n,a=.5),sort(placebo));#This shows the distribution of placebo; e.g. you can see the median, the quartiles, etc.############################################################To see whether a set of data is following a particular distribution, plot the quantiles of the data against the quantiles of the specific distribution.#Recall that there are many distributions built in to R: qexp, qnorm, etc. Try help(qexp) to see the general format of these functions…############################################################We'll fit our placebo distribution to an exponential with mean = mean(placebo)plot(qexp(ppoints(n,a=.5),rate=1/mean(placebo)), sort(placebo))#Note that the theoretical distribution goes on the horizontal axis and the sorted data goes on the vertical axis.#Now draw in the 45-degree line to give you a visual reference to compare the two distributions. Why this line??#The function in R that draws lines on plots is abline(intercept, slope), so we wantabline(0,1)#What do you think? Does the placebo dataset follow an exponential distribution?#Try other qqplots by randomly generating various distributions and then checking to see whether they follow various theoretical distributions.
STT520-420
24
Check Example 1.3, page 6 to see whether it comes from an Exponential distribution
STT520-420
25
Review: Power Hazard model Power Hazard:
Note this is of the form (constant)yconstant and if 1 this reduces to the constant hazard we just considered.
Note that and so
STT520-420
26
h(y) y
1, for y 0; , 0
S(y) exp( ( y
) )
f (y) y
1 exp( ( y
) ), y 0
Weibull Prob Plots Recall Power hazard model:
Substitute , then take logarithm:
Take logarithm of base 10 again to create log-life variable:
We now have: Note: base-10 log are traditionally used in lifetime,
but natural log is equivalent with a difference of scale. STT520-420
27
)()(log )(
)(j
je
yyS
))(exp()(
yyS
]log[log)](log[log 10)(10)(10 jje yyS
)( jyy
)](log[log1
loglog )(1010)(10 jej ySy
Weibull Prob Plots If data fit Weibull model:
Weibull Prob plot is:
It follows a straight line with slope and intercept .
Reacll: A probability plot, quantile-quantile plot, or QQ-plot, is a plot of the points:
STT520-420
28
)]5.01(log[log1loglog 1010)(10 njy ej
njyn
jje ,...,2,1)],log)],5.01(log[(log )(1010
/1 10log
.,...,2,1),),5.0(())(),(( )(1 njy
njFpQpQ jjnj
Summary: To check graphically if a distribution is Weibull, do essentially
a qqplot on the log-log scale (see section 4.4, p. 61-63). The key formulas are: (4.2) Now substitute the ordered data and take natural logs
(4.2a) Take logs again…
(4.2b)
Write as a linear equation: (4.3)
(4.3a)
So plot the points in 4.4, look for a straight line and the slope will equal 1/ and the intercept will equal log()
S(y) exp( (y /) )
ln(S(y( j )) (y( j )
)
]log[log))(ln((log 10)(10)(10 jj yyS
log10 y( j ) log10 (1/)log10[ ln(S(y( j ))]
log10 y( j ) log10 (1/)log10[ ln(1j .5n
)]
STT520-420
29
Example 4.2 on page 62. Ranked data: 10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27,
29, 29, 30, 35. [Note: n=17] Step 1: take y=log10 to ranked data;
Step 2: find the plotting position:
Step 3: find x= Step 4: plot the Weibull Prob plot
STT520-420
30njy
nj
je ,...,2,1),log),5.01(log[(log )(1010
njp j
5.0
)5.01(log[log10 nj
e
Example 4.2, page 62#Now let's do the Weibull probability plot as described in Chapter 4 on p. 61-63#Formula (4.4) shows that the slope is 1/shape and intercept is log10(scale) #when you do a Weibull probability plot. Let's try it with the data in Ex. 4.2#############lifetime=c(10, 12, 15, 17, 18, 18, 20, 20, 21, 21, 23, 25, 27, 29, 29, 30, 35)#now take logs base 10 and compute the values needed in formula (4.4)y=log10(lifetime)n=length(lifetime)k=(1:n-.5)/n; ##or k=ppoints(n, a=0.5);x=log10(-log(1-k))#now plot the x's and y's and put in a vertical line at x=0 to see the interceptplot(x,y) ; #similarly find the slope both graphically or you may use regressionlm(y~x) #from the output we see intercept=1.3813 and slope=0.2660abline(lm(y~x), col="blue") ## regression line#so notice that the intercept of the regression line is 1.38;#so log10(beta)=1.38 which implies beta=10^1.38 =23.98833#and slope=0.2660 which implies alpha=1/.2660=3.75938
STT520-420
31
More about Weibull Prob Plots The Weibull prob plot follows a straight line with
slope and intercept
Standard Least square regression can provide the estimates to the slope and intercepts.
If slope= is estimated to be close to 1, then it is reasonable to assume the data come from an Exponential model.
STT520-420
32
/110log
/1
Cumulative Hazard plot A hazard plot is: , where
which is called (cumulative) hazard plot scores.
For Exponential model, h(y)= thenyy/
If the data come from an exponential distribution, then hazard plot will be a straight line. It is a simpler diagnostic test than the prob plot.
STT520-420
33
),( )( jj y
j
ij in1 1
1