Chapter 5clavius.bc.edu/.../BIOL2300/Notes/Notes/LECTURES/chap5.pdf · 2017. 9. 10. · Chapter 5...

Post on 27-Feb-2021

2 views 0 download

Transcript of Chapter 5clavius.bc.edu/.../BIOL2300/Notes/Notes/LECTURES/chap5.pdf · 2017. 9. 10. · Chapter 5...

BIOL2300 Biostatistics

Chapter 5 Discrete probability

distributions

https://www.cartoonstock.com/directory/o/odds.asp

Random variable

•  Intuitively, a random variable (r.v.) is the numerical value expressing outcome of a stochastic experiment

•  Formally, a r.v. is a function X : Ω ! Reals Since Ω has an associated probability function, we will be able to discuss P(X=k) = P({x 2 Ω: X(x)=k})

Discrete vs continuous

•  Discrete r.v. takes either finitely many values or countably many (range of X is finite of countably infinite)

•  Continuous r.v. X has range containing an interval of the real numbers

Probability distribution

•  Probability that r.v. X takes value k p(k) = P( {x 2 Ω: X(x) = k} )

•  Probability distribution for r.v. X 0 ≤ p(k) = P(X=k) ≤ 1 for all y

Answer

•  No: sum of probabilities not equal to 1

Mean, variance, stdev of r.v.(population variance & stdev)

Round off rule

•  Round mean, variance and stdev to one more decimal place than accuracy of r.v.

Hardy-Weinberg theorem

Under random mating hypothesis, genotypes reach fixation levels in one generation

#x is AA, y is AB, z is BB: genotype frequencies #a,b,c are old values of x,y,z y = (1-x-z) a = x b = y c = z gen = 0 #generation count x = a**2 + 0.5*a*b + 0.5*b*a + 0.25*b*b z = c**2 + 0.5*c*b + 0.5*b*c + 0.25*b*b y = 1-x-z

Answer: under random mating hypothesis

Female A (p)

Female B (q)

Male A (p) AA (p2) AB (pq)

Male B (q) BA (pq) BB (q2)

•  p,q are allele frequencies. p = allele frequency of A, q = allele frequency of B. We have p+q=1

•  AA, AB and BB are genotypes. Assume that A is dominant over B, and there is a phenotypic difference between – AA, AB – BB

•  Then since E[BB]=q2, we can compute the B-allele frequency from the square root of the phenotype frequency BB, assuming that the population is in Hardy-Weinberg equilibrium.

•  Thus p = 1-q, and E[AA]=p2, E[AB]= 2pq.

Binomial coefficient

•  Pascal’s triangle •  1, 1, 1 1, 2, 1 1, 3, 3, 1 1, 4, 6, 4, 1 1, 5, 10, 10, 5, 1

•  What is sum of each row?

More on binomial coefficients

•  “n choose k”

Binomial distribution

•  Bernouilli trial: 2 outcomes, success or failure with prob p and q=(1-p)

•  binomially distributed r.v. counts the number of successes in n trials

Binomial theorem

Answer

Graph of binomial distribution

Out[2]=

Here n=50, p=0.3 in relative frequency plot of binomial distr.

Mean, variance, stdev of Bernouilli distributed r.v.

•  Let Y be Bernouilli r.v. with probability p of success

•  E[Y] = 1*p + 0*(1-p) = p •  V[Y] = (1-p)2*p + (0-p)2*(1-p)

= p(1-2p+p2) +p2 -p3 = p-p2 = p(1-p) = pq

•  stdev[Y] = sqrt(pq)

Mean and variance of linear combination of independent r.v.

•  Theorem 1: Expectation is always additive E[X+Y] = E[X] + E[Y] E[cX] = cE[X]

•  Theorem 2: Variance is additive if r.v. independent V[X+Y] = V[X] + V[Y] V[cX] = c2V[X]

What does it mean that two r.v. are independent?

•  X,Y independent iff for all values x,y respectively taken on by X,Y we have

•  P(X=x,Y=y) = P(X=x) * P(Y=y)

Mean, variance, stdev of binomial distributed r.v.

•  If X is r.v. that counts number of successes in n trials, then by additivity of expectation E[X] = E[Y]+...+E[Y] where there are n terms in sum and Y is Bernouilli distributed r.v.

•  Thus E[X] = np if X is binomially distributed

r.v. that counts number of successes in n trials where probability of success in one trial is p

•  Since n Bernouilli trials are independent, by additivity of variance (REQUIRES independence) V[X] = nV[Y] = npq = np(1-p) if X is binomially distributed r.v. that counts number of successes in n trials where probability of success in one trial is p

Answer: D. E[2 X + 7] = 2E[X]+7 E[2X+7 – 2µ+7)2] = 4E[(X-µ)2] = 4 V[X] So stdev[2X+7] = 2 ¾

Binomial distribution is for sampling with replacement

•  urn has 4 red balls and 6 black balls •  P(red ball) = 4/10 = 0.4 •  b(r;n,p) is probability of r red balls in

sample of n balls, where balls are drawn from urn WITH replacement

•  What about drawing balls without replacement?

Binomial distribution: sampling with replacement

n=100, p=0.15

Proportion of successes in n trials

•  Let Z be r.v. that counts the number of successes in n trials, where probability of success is p (absolute frequency).

•  Z = X+...+X, where P(X=1) = p. •  Let Y be r.v. that returns the proportion of

successes in n trials (relative frequency) •  E[Y] = E[Z/n] = E[X+...+X]/n = np/n = p •  V[Y] = V[Z/n] = V[X+...+X]/n2 = npq/n2 = pq/n

Hypergeometric distribution

•  R red balls in population of size N •  what is probability of drawing r red balls

in sample of size n, when drawing WITHOUT replacement?

Mean,variance of hypergeometric dist.

•  Let p = R/N. Then mean is np Note that the mean of hypergeometric (sampling without replacement) is same as mean of binomial (sampling with replacement)

•  Let p = R/N and q = 1-p = (N-R)/N, so that q = B/N where B is number of black balls in population. Recall that variance of binomial is npq. Then variance of hypergeometric is npq(N-n)/(N-1)

Graph of hypergeometric distribution

Out[2]=

Here N=100, R=50, n=30. Dot (x,y) represents h(n,x;N,R), the probability of drawing x red balls in size n sample, without replacement

Lotto problem

•  In Lotto game, a player selects six numbers from 1,...,54 without repetition, thus forming an unordered set of 6 numbers.

•  P(selecting 6 winning numbers)

Continuation of Lotto Problem

•  P(selecting exactly 5 winning numbers)

Continuation of Lotto Problem

•  P(selecting exactly 3 winning numbers)

Continuation of Lotto Problem

•  P(selecting no winning number)

Poisson distribution X is Poisson r.v. with parameter λ (λ is the mean) if

P[X=1]=0.014872513 = µ1/1! · e-µ

P[X=4] = (3.1)4/4! · e-3.1

Graph of Poisson distribution

Out[2]=

Here lambda=10.

Mean and variance of Poisson r.v

•  If X is Poisson distributed r.v. with parameter λ, then

– E[X] = λ – V[X] = λ

Mean of Poisson with parameter lambda is lambda

Variance of Poisson with parameter lambda is lambda

Applications of Poisson

•  Suppose that p is small, where p is the probability that an elementary event happens in a given time or space interval (accident in 5 minute interval, nucleotide mutation in genome)

•  Poisson distributed r.v. is good distribution for fitting the NUMBER of elementary events that occur per time or space interval

Number of TATA boxes (TATAAA) in M. jannaschii (archea)

Number of TATA boxes (TATAAA) in E. coli K12

•  http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.2

Interarrival time, or distance between successive genomic

motifs

What probability distribution is this?

Remarks

•  When N is large compared with sample size n, sampling with replacement (binomial) is approximately equal to sampling without replacement (hypergeometric).

•  For large n, both binomial and hypergeometric distribution look approximately normal. For small p and large n, binomial distribution can be approximated by Poisson distribution.

Is number of women having car accidents in fixed time interval

Poisson distributed? Accidents 0 1 2 3 4 5 >5 Total

#female 447 132 42 21 3 2 0 647

E[#female] 406 189 44 7 1 0 0 647

Columns 0 and 3 deviate from values of Poisson r.v. Explanation: prudent drivers have fewer accidents, reckless drivers have more?

Geometric distribution

Defective component problem

•  Probability of defective component is 0.2. Find the probability that the first defect is found in the seventh component tested.

•  P(X=7), where X is Poisson distributed •  P(X=7) = (1-0.2)6*(0.2)

Mean of geometric distribution is 1/p

Application of mean of geometric distribution

•  Recall that the absolute risk reduction is defined by |P(event occurring in treatment group) - P(event occurring in control group)|

•  The book states that the number of persons needed to treat, in order to prevent one disease, is computed by 1/(absolute risk reduction)

•  This follows since the number of persons who must be treated (without success) before treating a person with success is geometrically distributed, and the mean is 1/p, where p is absolute risk reduction.

Variance of geometric distribution is q/p2

Multinomial coefficients Recall that the binomial coefficient is number of ways of choosing a size k subset from a size n set.

The multinomial coefficient at right is number of ways of choosing partitioning size n set into subsets of size n1,n2,...,nk

Problem on multinomial distribution

Solution to problem on multinomial distribution

•  P(A)=p1, P(B)=p2, P(C)=p3 •  Multinomial probability (generalization of

binomial probability) that in a sample of size n, there are n1 items of type A, n2 items of type B and n3 items of type C is:

Solution to problem on multinomial distribution

•  Genetics experiment involves 6 mutually exclusive genotypes: A,B,C,D,E,F, all equally likely (so pi=1/6 for i=1,...,6)

•  Probability of 5 A’s, 4 B’s, 3 C’s, 2 D’s, 3 E’s, 3 F’s is

Chebyshev’s inequality

Prob that z-score greater than or equal to 2 is at most 1/4 Prob that z-score greater than or equal to 3 is at most 1/9

Proof of Chebyshev’s inequality

Equivalent formulation of Chebyshev’s inequality

Prob that z-score is less than or equal to k is greater than or equal to 1-1/k2 1/4