Detecting signals in biological sequences

31
Detecting signals in biological sequences Leelavati Narlikar

Transcript of Detecting signals in biological sequences

Page 1: Detecting signals in biological sequences

Detecting signalsin biological sequences

Leelavati Narlikar

Page 2: Detecting signals in biological sequences

August 2010

DNA: code for life

sugar-phosphate backbone

hydrogen-bondedbase pairs

2

Cells store their hereditary information on DNA molecules

Complete DNA content = Genome (~3 billion base-pairs)

Four types of nucleotides:

A: adenine C: cytosineG: guanineT: thymine

Page 3: Detecting signals in biological sequences

August 2010

Expressing information on the DNA

3

DNA

RNA

Transcription(RNA synthesis)

PROTEIN

Translation(protein synthesis)

Gene

Page 4: Detecting signals in biological sequences

August 2010

Can explain differences across species...

4

Page 5: Detecting signals in biological sequences

August 2010

A region along human genome...

5

Page 6: Detecting signals in biological sequences

August 2010

Also explain differences within species...

6

Page 7: Detecting signals in biological sequences

August 2010

but what about the same organism?

Neuron Lymphocyte

25 μm

7

[adapted from Molecular Biology of the Cell, Alberts et al.]

Page 8: Detecting signals in biological sequences

August 2010

Variability in expression levels

Gene A Gene B

8

A BA A A A A

A A A A A A

A A A A A A

Transcription Transcription

Translation

Translation

×Gene C

Page 9: Detecting signals in biological sequences

August 2010

Transcriptional regulatory code

Transcriptional regulation

Basal

transcription

complex

Transcription start

+ +

+

-

-

Transcription factor

(TF)

Transcription factor binding site

(TFBS)

Transcription start site

9

Page 10: Detecting signals in biological sequences

August 2010

What we know about TF-DNA binding

A TF binds a small DNA-site (around 5 to 10 bp long)

It is specific, will not bind arbitrary nucleotide string: for example, a TF may want to binding only CAGTGT

If we knew its “preference”, we could just scan the DNA and look for matches

There are over 500 TFs in the human genome, we probably know preferences for 50 or so

10

Page 11: Detecting signals in biological sequences

August 2010

3000bp region near gene, bound by TF ZYX

11



Page 12: Detecting signals in biological sequences

August 2010

3000bp region near TSS of a real mouse gene

12

AGGAATATTCTGCTGTTTGGGATCTTGCCACAGCCACTTCCAGCCTGGGAAAAGGCATTTACTGTAAACAGCGGGAGAAGGGGCTCCTTCCCCAACAGCTGACAGCTCATTTTAACCAGAGGAACTGAGATTTGATTTTGGAGTTCATCTCCCTGGGCAGTGAAGGATCAAAAAACAAACAACAACCTGAGAGAAGGGGTGGAGGTTTCCATAGAGGAAAGTACTGGGGCGGGGATGGAGGCTTGGTGGTGGGGGCTTGGGGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGGGGTGGGGAGGGAAGCATCTAACTTCCTGGGTTCTAACCTGGCTTTCTCTCAGGTTCTGGCTGACACTGAGCAGGACACTTTATTGCATTGCAGCCTTGTGGGTTTGCCTATCTGTAAAATTAGGAATAAAAAGAGGCCCAGTCAGGATATTTGAAGATGGAAGACAGAATAAATCAAGTGATCTCTCAGAACTGGGTCTCACAGAGCCAAGAATTGGGGGTAAGAAGTACAGGTGGCGGCTACACTTGCTCTGGTAATTCCAACAGAATAGCCATGCACAGTTAGGGAGGTAAAAAGTGGATACGTAAACGGCCCCATTTCTCACTGATGAAAAAACCCTTCCTGCTTCCTGTAACAAAAGCACTGTACGGCAAAGCAAGGAGAAGCTTAAGTACTTGGGACCTCCTCGTCAAAGAAGGCATCCATGGCCATCTTAAGGTGTGGAGGGGCACAAGAGGTCACTACAAAGGTACTAGGTCCCTCTGATACATATGTCAGGTGGGCAAAAGAATTCCTCCAGGAAAGGGGGGCAGTGAAGGGGACGGAAACCCTTGTCTGGATGAAGTTCTGGGGTGAAGAGTCTCTTCTGTCCAAAGCCTTTTGGGAAGATGAGGTGCCTGCATTTACTCTCTTTGCTTCTGTCAAGTGCCTGAGGGTGGCCAGATTTCGACCTGCTGGGGAGAAGGATTTTGGTCATGGTTTAGCAGGAGGTGGGGGTTTTGCAGTGGGCATGTGAAGGAAGGGATGGTGGCGGAGTGACGAAGACCACTCTGTCTGTTGTACGAAGGTCCCCAACCTGGGATAGAGGCCTCTTCCAGAGACTGTGAGAGTGCTTGAGGTGAAGGGGGGTGACTAGGGGTAGCCCGTCTTGTCCTGGCAGCTCCTACTTGCTGGTCAAAGCCCTCAGGCCGCCCTACTTGTGCACTGACCTGAGCTAATCTAAAAAATACCGCAGGGGAGGTAGAGACTGGGGTTCCCAGTGAGAGAGAGTCTCCAGACGGGAGAAAAGGAGCGGGAGTCCTTGTCATTTCTGTCAGCTTTCTTAGGCTCAGTGACAACAATGCTTCTCCTTCATCTAGGCTGGGTCCCATCTCGTGGTTTGCTGCTTAGGAGTTTGAAAGAGAACCCAGCTGGGGACGTAGACAGGGACCCACAGAAAGCAGCCGTAGCTGACCCATGCCTCATGAAGACTACAAAGGGGCTCACGCCAGCACGAACGCAAGGCAACTCCTTTCAGAAGCGCCAGCTCGGCAATGAAACTCGGCTGCGCAGCAAACCACACACGGAATACGCACGGTTACCAAAGCTGCCGCTCAGAGTTCACACAGCGCCAACCCACAGCTGTATCTAATGCGATGTCTTTGTCTCTGGATCTCTTTCGTCTTCGTGCCCGCGCGCACTCGCATGACACTCAACAGAAACATCCAAGCTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTGAGCTCCGGTCCGCAAGGCTGTCAGCTCGCCTTGCCTTTCGTCTATCCTGACCTTCTCAGATAAGCATTTGCTTACCGAGGGGGCGAGGGGGCGTCCTCAGAATCCCTCCGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAGACAGAGACAGAGACAGAGACACAGAGACAGAGACAGACAGAGACAGACAGAGACAGAGAGAGCGCCCAAAGGCTAGCCTTTCCCTTCCACTGCGCGCAGTTGATGGTGAGGCACCAGCTCCTACCACGGCATCCCTGGACGACAGAAACAGCTCAGATGGTCCAACCCAGGGCTGACTTTCTTCAAAAGTAATCCAAGACAGTCACTTCTGTCATCAGGATGGACTTGCAGAACAAGTGATAGATGGCAGAGACACAAACAGACGACCGATCGGCCGGCCTAGCTCTAGAGACTCTCACCTGTCTGTCCTGTTGGTTATCCGGACGGTTAGCCAGAGGATCCGGGGCCCCCGCACAGCTCCGGGACTCTGGAAGATAGTTCCGAGGGTGGGGACCTTCGAGAACCAGCCCACACTGAACTCCTCCCTCCTTGTGGCAGCAGCAAGGTGGGACGGGCCAGGACGTCTGCTTAGCACCTCCTCCAGAAATGCAGCACTTGGGGGGCCCCCACCCTTTCGCGCGCTCCTTCCCACCGACCTCCCAGGGGTGCACCTCTCCAGCCTCGGTCGCGCTTCCGAAACCTTTGGTGCCCCCTTTTCCTGGTCCCGACCCCCCACCTCACGCCCCCTGGTCTGGACAGCATCTCCCCCTCGCCGCCCTCCGCCCACGCACCGCCTGACTCCGAGGGGTGCGAGCGCATTGGGCTGCGCCCGCGTGGGGGCGCCGCGCCAGCCTCGCGTAGCTGTTCTGACGCTGCCGTCGCCGCCGCCCTCCGCAGCCCAGCCGGCACCCGCACCAGCTCTGCAGTGCACTCGTCGCCTCTCGGGCCGGTCCCACCAAGAGCCAGACTGTCGTGACCGGGGCCAGCCTCGAACGTCAGGCGCGAGGGTCATGAGCCAGAGCGCCCTGGGGCGCCGCGCGGAGACCCAGCGGAGATAGCAGTCCTCGCTGCCTTGACGCGCGCCCGCCGCGTCCCCAGA

Page 13: Detecting signals in biological sequences

August 2010

Markov models for background

13

Probabilistic Model for the Background

Markov Model of Order 0: Aggregate Probabilities of

Occurrence

A C G T0.2 0.3 0.3 0.2

• Long-range correlations are known to exist in DNA sequences.

Markov Model of Order 1: Conditional Probabilities

Representing One-Step Correlations

Next Base → A C G TCurrent Base ↓

A 0.3 0.2 0.4 0.1

C 0.5 0.5 0 0

G 0 1 0 0

T 0 0.75 0 0.25

• Each row adds up to 1.

Mihir Arjunwadkar Probabilistic Pattern Discovery in Genomic Sequences

Page 14: Detecting signals in biological sequences

August 2010

Markov models for background

14

Probabilistic Model for the Background

Markov Model of Order 2: Two-Step Correlations

Next Base → A C G TCurrent Word ↓

AA 0.14468887 0.34738369 0.3647338 0.14319361AC 0.28652560 0.04944015 0.2475591 0.41647510AG 0.25500737 0.24605241 0.2790940 0.21984627AT 0.24553913 0.32125812 0.2272264 0.20597631CA 0.29077682 0.44232378 0.2193125 0.04758690CC 0.18156346 0.30470655 0.2991050 0.21462501CG 0.51227826 0.15130835 0.2561048 0.08030855CT 0.33931374 0.30861055 0.1403038 0.21177192GA 0.03849488 0.59090986 0.2929306 0.07766468GC 0.29668576 0.01695954 0.3368234 0.34953134GG 0.60471652 0.01173188 0.1304406 0.25311101GT 0.27622836 0.42019980 0.1779573 0.12561454TA 0.31275145 0.26811120 0.1299927 0.28914466TC 0.05453213 0.28767978 0.2924408 0.36534726TG 0.22519927 0.47783775 0.1882527 0.10871030TT 0.11225098 0.17351505 0.3864218 0.32781218

• Background model parameters can be estimated from the samesequence data where motifs are searched. Generally, not recommended.

• Appropriate/optimal model order needs to be determined.Mihir Arjunwadkar Probabilistic Pattern Discovery in Genomic Sequences

Page 15: Detecting signals in biological sequences

August 2010

3000bp region near TSS of a real mouse gene

15

AGGATTATTCTGCTGTTTGGGATCTTGCCACAGCCACTTCCAGCCTGGGAAAAGGCATTTACTGTAAACAGCGGGAGAAGGGGCTCCTTCCCCAACAGCTGACAGCTCATTTTAACCAGAGGAACTGAGATTTGATTTTGGAGTTCATCTCCCTGGGCAGTGAAGGATCAAAAAACAAACAACAACCTGAGAGAAGGGGTGGAGGTTTCCATAGAGGAAAGTACTGGGGCGGGGATGGAGGCTTGGTGGTGGGGGCTTGGGGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGGGGTGGGGAGGGAAGCATCTAACTTCCTGGGTTCTAACCTGGCTTTCTCTCAGGTTCTGGCTGACACTGAGCAGGACACTTTATTGCATTGCAGCCTTGTGGGTTTGCCTATCTGTAAAATTAGGAATAAAAAGAGGCCCAGTCAGGATATTTGAAGATGGAAGACAGAATAAATCAAGTGATCTCTCAGAACTGGGTCTCACAGAGCCAAGAATTGGGGGTAAGAAGTACAGGTGGCGGCTACACTTGCTCTGGTAATTCCAACAGAATAGCCATGCACAGTTAGGGAGGTAAAAAGTGGATACGTAAACGGCCCCATTTCTCACTGATGAAAAAACCCTTCCTGCTTCCTGTAACAAAAGCACTGTACGGCAAAGCAAGGAGAAGCTTAAGTACTTGGGACCTCCTCGTCAAAGAAGGCATCCATGGCCATCTTAAGGTGTGGAGGGGCACAAGAGGTCACTACAAAGGTACTAGGTCCCTCTGATACATATGTCAGGTGGGCAAAAGAATTCCTCCAGGAAAGGGGGGCAGTGAAGGGGACGGAAACCCTTGTCTGGATGAAGTTCTGGGGTGAAGAGTCTCTTCTGTCCAAAGCCTTTTGGGAAGATGAGGTGCCTGCATTTACTCTCTTTGCTTCTGTCAAGTGCCTGAGGGTGGCCAGATTTCGACCTGCTGGGGAGAAGGATTTTGGTCATGGTTTAGCAGGAGGTGGGGGTTTTGCAGTGGGCATGTGAAGGAAGGGATGGTGGCGGAGTGACGAAGACCACTCTGTCTGTTGTACGAAGGTCCCCAACCTGGGATAGAGGCCTCTTCCAGAGACTGTGAGAGTGCTTGAGGTGAAGGGGGGTGACTAGGGGTAGCCCGTCTTGTCCTGGCAGCTCCTACTTGCTGGTCAAAGCCCTCAGGCCGCCCTACTTGTGCACTGACCTGAGCTAATCTAAAAAATACCGCAGGGGAGGTAGAGACTGGGGTTCCCAGTGAGAGAGAGTCTCCAGACGGGAGAAAAGGAGCGGGAGTCCTTGTCATTTCTGTCAGCTTTCTTAGGCTCAGTGACAACAATGCTTCTCCTTCATCTAGGCTGGGTCCCATCTCGTGGTTTGCTGCTTAGGAGTTTGAAAGAGAACCCAGCTGGGGACGTAGACAGGGACCCACAGAAAGCAGCCGTAGCTGACCCATGCCTCATGAAGACTACAAAGGGGCTCACGCCAGCACGAACGCAAGGCAACTCCTTTCAGAAGCGCCAGCTCGGCAATGAAACTCGGCTGCGCAGCAAACCACACACGGAATACGCACGGTTACCAAAGCTGCCGCTCAGAGTTCACACAGCGCCAACCCACAGCTGTATCTAATGCGATGTCTTTGTCTCTGGATCTCTTTCGTCTTCGTGCCCGCGCGCACTCGCATGACACTCAACAGAAACATCCAAGCTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTGAGCTCCGGTCCGCAAGGCTGTCAGCTCGCCTTGCCTTTCGTCTATCCTGACCTTCTCAGATAAGCATTTGCTTACCGAGGGGGCGAGGGGGCGTCCTCAGAATCCCTCCGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAGACAGAGACAGAGACAGAGACACAGAGACAGAGACAGACAGAGACAGACAGAGACAGAGAGAGCGCCCAAAGGCTAGCCTTTCCCTTCCACTGCGCGCAGTTGATGGTGAGGCACCAGCTCCTACCACGGCATCCCTGGACGACAGAAACAGCTCAGATGGTCCAACCCAGGGCTGACTTTCTTCAAAAGTAATCCAAGACAGTCACTTCTGTCATCAGGATGGACTTGCAGAACAAGTGATAGATGGCAGAGACACAAACAGACGACCGATCGGCCGGCCTAGCTCTAGAGACTCTCACCTGTCTGTCCTGTTGGTTATCCGGACGGTTAGCCAGAGGATCCGGGGCCCCCGCACAGCTCCGGGACTCTGGAAGATAGTTCCGAGGGTGGGGACCTTCGAGAACCAGCCCACACTGAACTCCTCCCTCCTTGTGGCAGCAGCAAGGTGGGACGGGCCAGGACGTCTGCTTAGCACCTCCTCCAGAAATGCAGCACTTGGGGGGCCCCCACCCTTTCGCGCGCTCCTTCCCACCGACCTCCCAGGGGTGCACCTCTCCAGCCTCGGTCGCGCTTCCGAAACCTTTGGTGCCCCCTTTTCCTGGTCCCGACCCCCCACCTCACGCCCCCTGGTCTGGACAGCATCTCCCCCTCGCCGCCCTCCGCCCACGCACCGCCTGACTCCGAGGGGTGCGAGCGCATTGGGCTGCGCCCGCGTGGGGGCGCCGCGCCAGCCTCGCGTAGCTGTTCTGACGCTGCCGTCGCCGCCGCCCTCCGCAGCCCAGCCGGCACCCGCACCAGCTCTGCAGTGCACTCGTCGCCTCTCGGGCCGGTCCCACCAAGAGCCAGACTGTCGTGACCGGGGCCAGCCTCGAACGTCAGGCGCGAGGGTCATGAGCCAGAGCGCCCTGGGGCGCCGCGCGGAGACCCAGCGGAGATAGCAGTCCTCGCTGCCTTGACGCGCGCCCGCCGCGTCCCCAGA

Page 16: Detecting signals in biological sequences

August 2010

ChIP-chip experiments

Anoverview

oftheChIP-chip

experimentalprocedure

Rangingfrom

yeast

toculturedmam

maliancells,

there

issurprisingly

little

variationin

published

ChIP-chip

pro-

tocols.Generally,cellsaregrownunder

thedesired

exper-

imentalconditionandthen

fixed

withform

aldehyde(Fig.

1A).

Form

aldehydecrosslinksproteinsto

each

other

pri-

marilybetweenthee-am

inogroupoflysineresidues

and

anadjacentpeptide

bond.Form

aldehyde

can

also

form

DNA–protein

crosslinks,

butonly

iftheDNA

ispartially

Fig.1.(A)AsummaryoftheChIP-chip

procedure.S

eethetextfordetails.(B)Comparison

ofthecontrolsusedforsingle-locus,PCR-based

ChIP

experim

ents

andmicroarray-based

experim

ents.Single-locusexperim

entsuse

asingleinternal

controlin

each

sample.Theintensity

ofthetarget

bandiscompared

across

theIP,mock

IP(orcontrolIP),andinputDNA.In

microarrayexperim

ents,ratiosobtained

forenriched

elem

ents

(boxed

inwhite)

arecompared

tothose

obtained

forallother

elem

ents,whichareterm

ednon-enriched.(C

)Globalarraynorm

alizationwillslidetherawdistribution(red)alongthex-axisso

thatthe

medianlog2ratioisequal

to0forthenorm

alized

distribution

(blue).(D

)Theeffect

ofdefaultnorm

alizationonasimulatedChIP-chip

experim

entin

which

20%

ofarrayed

elem

ents

detectfive-fold

enrichment(log 2

STDev

=0.5).Thesimulatedexperim

entwas

repeatedthreetimes,andthedistributionofthe

averageratiosareplotted.T

hedistributionisskew

edsuch

thatthemedianlog2ratioofthenon-enriched

populationisat!0.25(black).Theidealnorm

alization

would

centerthenon-enriched

populationat

0(green).

M.J.Buck,J.D.Lieb/Genomics83(2004)349–360

350

16

Page 17: Detecting signals in biological sequences

August 2010

Given: a set of DNA sequences bound by a TF

X1

X2

X3

X4

Xn

···

Problem of motif discovery

Goal is to find:locations of these binding sites in the sequences, and

description of the word (or motif)

Each believed to contain a binding site of that TF

Xi

17

Page 18: Detecting signals in biological sequences

August 2010

E.g. Pho4, a yeast TF, has 12 binding sites listed in SCPD

Common sequence “pattern” in the binding sites: motif

How can we model a motif?

Modeling binding specificities

>YBR093C>YBR093C>YBR093C>YBR093C>YBR093C>YBR093C>YDR481C>YGR233C>YML123C>YML123C>YML123C>YML123C

ACACGTGGACACGTGG GCACGTTT GCACGTTT GCACGTTT ACACGTGG CCACGCGC GCACGTGC GCACGTGGCCACGTGG GCACGTTT TCACGTTA

ACACGTGGACACGTGG GCACGTTT GCACGTTT GCACGTTT ACACGTGG CCACGCGC GCACGTGC GCACGTGGCCACGTGG GCACGTTTTCACGTTA

18

Consensus: GCACGTGG

Expand alphabet: gCACGTgg

Regular expr.: [GACT]CACG[TC][GT][GTCA]

1 2 3 4 5 6 7 8

A 0.25 0.00 1.00 0.00 0.00 0.00 0.00 0.08

C 0.17 1.00 0.00 1.00 0.00 0.08 0.00 0.17

G 0.50 0.00 0.00 0.00 1.00 0.00 0.58 0.42

T 0.08 0.00 0.00 0.00 0.00 0.92 0.42 0.33

1 2 3 4 5 6 7 8

A 3 0 12 0 0 0 0 1

C 2 12 0 12 0 1 0 2

G 6 0 0 0 12 0 7 5

T 1 0 0 0 0 11 5 4

Position specific scoring matrix

Page 19: Detecting signals in biological sequences

August 2010

Basics of probability theory

Probability distribution

19

Introduction to probability models1

Background

• We define probability in terms of sample space S, a set whose elements areelementary events.

• Each elementary event can be viewed as the outcome of an experiment.

• For example, consider the flipping of a coin twice. The sample spaceconsists of all possible outcomes that is S = {HH, TT,HT, TH}. Let A

denote an event that we obtain a head. Then probability that event A

occurs is given byPr(A) = 3/4

P (event A) = number of elements in the sample space favorable to event Asize of sample space

• Random variable: function used to assign unique numerical values to allpossible outcomes of a random experiment. E.g., we can use randomvariable X to be the number of heads in the coin tosses. So X can be 0,1, or 2 and we need to find P (X ≥ 1) in our example.

Probability distribution

• A probability distribution P assigns the probability to each value of aparticular random variable such that the following axioms are satisfied.

1. P (X = u) ≥ 0 ∀u .2.

�u P (X = u) = 1

• Say you flip a coin n times, and you want to find the probability of findingk heads in it.

• Let’s say the coin in biased. So you have probability p of getting a headand 1− p of getting a tail. What then?

Continuous probability distribution / probability density

• Associated with continuous random variables. E.g., weight of a new born.

• f(x) ≥ 0

•� +∞−∞ f(x)dx = 1

• P (a ≤ x ≤ b) =� b

a f(x)dx ≥ 0

1These notes are based on lecture notes from a Fall 2002 course at Duke University taught

by Prof. Hartemink.

1

Introduction to probability models1

Background

• We define probability in terms of sample space S, a set whose elements areelementary events.

• Each elementary event can be viewed as the outcome of an experiment.

• For example, consider the flipping of a coin twice. The sample spaceconsists of all possible outcomes that is S = {HH, TT,HT, TH}. Let A

denote an event that we obtain a head. Then probability that event A

occurs is given byPr(A) = 3/4

P (event A) = number of elements in the sample space favorable to event Asize of sample space

• Random variable: function used to assign unique numerical values to allpossible outcomes of a random experiment. E.g., we can use randomvariable X to be the number of heads in the coin tosses. So X can be 0,1, or 2 and we need to find P (X ≥ 1) in our example.

Probability distribution

• A probability distribution P assigns the probability to each value of aparticular random variable such that the following axioms are satisfied.

1. P (X = u) ≥ 0 ∀u .2.

�u P (X = u) = 1

• Say you flip a coin n times, and you want to find the probability of findingk heads in it.

• Let’s say the coin in biased. So you have probability p of getting a headand 1− p of getting a tail. What then?

Continuous probability distribution / probability density

• Associated with continuous random variables. E.g., weight of a new born.

• f(x) ≥ 0

•� +∞−∞ f(x)dx = 1

• P (a ≤ x ≤ b) =� b

a f(x)dx ≥ 0

1These notes are based on lecture notes from a Fall 2002 course at Duke University taught

by Prof. Hartemink.

1

Page 20: Detecting signals in biological sequences

August 2010

Basics of probability theory

Probability distribution

Conditional distribution

Bayes’ theorem

Marginalization

20

P (A | B) =P (A ∩B)

P (B)

P (A|B) =P (A)P (B|A)

P (B)

P (A) =�

B

P (A, B)

Page 21: Detecting signals in biological sequences

August 2010

Given: a set of DNA sequences bound by a TF

X1

X2

X3

X4

Xn

···

Problem of de novo motif discovery

Goal is to find:locations of these binding sites in the sequences, and

parameters of the motif model describing these binding sites

Each believed to contain a binding site of that TF

Xi

21

Page 22: Detecting signals in biological sequences

August 2010

Some more notation

vector indicating starting position of the binding site in each sequence

length of the binding site

parameters of the motif model (we assume a position specific scoring matrix, or PSSM). is the probability of finding nucleotide at location within the binding site

background model parameters - use a k-order Markov model

Z

φ

φ0

X1

X2

X3

X4

Xn

···

Z1

Z2

Z3

Z4

Zn

ab

φa,b

W

22

Page 23: Detecting signals in biological sequences

August 2010

Likelihood of a sequence

23

X1

X2

X3

Xi

Xn−1

Xn

li

Z3

Z2

Z1

Zi

Zn−1

Zn

···

···

Figure 3.1: We are given n DNA sequences X1, . . . ,Xn. Each Xi is believed tocontain one binding site depicted in red at an unknown position denoted by the Zi.The goal is to infer the value of Z as well as the motif parameters φ that best describethe variabilities in the binding sites.

3.2.2 Sequence model

For simplicity, as depicted in Figure 3.1, each Xi is assumed to contain exactly one

binding site of that TF. Let Z be a vector of length n denoting the starting location

of the binding site in each sequence: Zi = j if there is a binding site at position j in

sequence Xi. The nucleotides not belonging to the binding sites are assumed to be

drawn from some background model parameterized by φ0.

Thus if the sequence Xi is of length li, and it contains a binding site at location

Zi, we can compute the likelihood of the sequence as:

P (Xi | φ, Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)×�

W�

a=1

φa,Xi,Zi+a−1

× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.8)

Each sequence Xi can thus be portioned into two regions, one that contains

nucleotides in the binding site while the other that contains nucleotides that are not

part of the nucleotide based on the value of Zi and W . For simplicity, let us use

PM(Xi | φ, Zi) to denote the region that is explained by the motif model φ, and

PM(Xi | Zi, φ0) to denote the region that is explained by the background model φ0.

58

X1

X2

X3

Xi

Xn−1

Xn

li

Z3

Z2

Z1

Zi

Zn−1

Zn

···

···

Figure 3.1: We are given n DNA sequences X1, . . . ,Xn. Each Xi is believed tocontain one binding site depicted in red at an unknown position denoted by the Zi.The goal is to infer the value of Z as well as the motif parameters φ that best describethe variabilities in the binding sites.

3.2.2 Sequence model

For simplicity, as depicted in Figure 3.1, each Xi is assumed to contain exactly one

binding site of that TF. Let Z be a vector of length n denoting the starting location

of the binding site in each sequence: Zi = j if there is a binding site at position j in

sequence Xi. The nucleotides not belonging to the binding sites are assumed to be

drawn from some background model parameterized by φ0.

Thus if the sequence Xi is of length li, and it contains a binding site at location

Zi, we can compute the likelihood of the sequence as:

P (Xi | φ, Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)×�

W�

a=1

φa,Xi,Zi+a−1

× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.8)

Each sequence Xi can thus be portioned into two regions, one that contains

nucleotides in the binding site while the other that contains nucleotides that are not

part of the nucleotide based on the value of Zi and W . For simplicity, let us use

PM(Xi | φ, Zi) to denote the region that is explained by the motif model φ, and

PM(Xi | Zi, φ0) to denote the region that is explained by the background model φ0.

58

In other words,

PM(Xi | φ, Zi) =

W�

a=1

φa,Xi,Zi+a−1 (3.9)

and

PM(Xi | Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.10)

Thus equation (3.8) can be written as:

P (Xi | φ, Zi, φ0) = PM(Xi | Zi, φ0)× PM(Xi | φ, Zi) (3.11)

3.2.3 Objective function

We wish to find φ and Z that maximize the joint posterior distribution of the un-

knowns conditional on the data. Therefore, our objective function is:

arg maxφ,Z

P (φ, Z | X, φ0) = arg maxφ,Z

P (X | φ, Z, φ0)P (φ)P (Z) (3.12)

assuming independent priors P (φ) and P (Z) over φ and Z, respectively.

3.2.4 Collapsed Gibbs sampling

If we applied traditional Gibbs sampling to the optimization problem described in

equation (3.12), we would have to sample each Zi and the high dimensional φ. How-

ever, collapsing φ, as proposed by Liu [1994], results in a more efficient algorithm

with n components.

We note that

P (Z | X, φ0) ∝ P (Z, X | φ0)

=

φ

P (φ, X | Z, φ0)P (Z)dφ

= P (Z)

φ

P (X | φ, Z, φ0)P (φ)dφ

= P (Z)PM(X | Z, φ0)

φ

PM(X | φ, Z)P (φ)dφ (3.13)

59

In other words,

PM(Xi | φ, Zi) =

W�

a=1

φa,Xi,Zi+a−1 (3.9)

and

PM(Xi | Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.10)

Thus equation (3.8) can be written as:

P (Xi | φ, Zi, φ0) = PM(Xi | Zi, φ0)× PM(Xi | φ, Zi) (3.11)

3.2.3 Objective function

We wish to find φ and Z that maximize the joint posterior distribution of the un-

knowns conditional on the data. Therefore, our objective function is:

arg maxφ,Z

P (φ, Z | X, φ0) = arg maxφ,Z

P (X | φ, Z, φ0)P (φ)P (Z) (3.12)

assuming independent priors P (φ) and P (Z) over φ and Z, respectively.

3.2.4 Collapsed Gibbs sampling

If we applied traditional Gibbs sampling to the optimization problem described in

equation (3.12), we would have to sample each Zi and the high dimensional φ. How-

ever, collapsing φ, as proposed by Liu [1994], results in a more efficient algorithm

with n components.

We note that

P (Z | X, φ0) ∝ P (Z, X | φ0)

=

φ

P (φ, X | Z, φ0)P (Z)dφ

= P (Z)

φ

P (X | φ, Z, φ0)P (φ)dφ

= P (Z)PM(X | Z, φ0)

φ

PM(X | φ, Z)P (φ)dφ (3.13)

59

Page 24: Detecting signals in biological sequences

August 2010

Objective function

Need to find optimal values for and which will maximize the posterior distribution:

Z φ

arg maxφ,Z

P (φ,Z | X,φ0)

= arg maxφ,Z

P (X | φ,Z,φ0)× P (φ)× P (Z)

24

Page 25: Detecting signals in biological sequences

August 2010

Traditional Gibbs sampling

Goal is to generate samples from the joint distribution

Gibbs sampling is used when joint distribution is not known explicitly, but conditional of each can be computed

25

D

: random variables: data or known parameters

θi

θ1, . . . , θk

N

1. Initialize θ(0)1:k

2. For t = 1 to N

• Sample θ(t)1 ∼ P (θ1 | θ(t−1)

2 , θ(t−1)3 , . . . , θ(t−1)

k , D)

• Sample θ(t)2 ∼ P (θ2 | θ(t)

1 , θ(t−1)3 , . . . , θ(t−1)

k , D)...

• Sample θ(t)k ∼ P (θk | θ(t)

1 , θ(t)2 , . . . , θ(t)

k−1, D)

P (θ1, θ2, . . . , θk | D)

Page 26: Detecting signals in biological sequences

August 2010

Collapsed Gibbs sampling

Ideal qualities of a Gibbs sampler:

sampling one component conditional on others must be fast

sample autocorrelation must be low to promote better exploration of the sample-space

Instead of sampling from , reduce the number of components, by integrating out one or more of them

26

P (θi | θ1, . . . , θi−1, θi+1, . . . , θk)

P (θi | θ1, . . . , θi−1, θi+1, . . . , θk−1) =�

P (θi, θk | θ1, . . . , θi−1, θi+1, . . . , θk−1)d(θk)

Page 27: Detecting signals in biological sequences

August 2010

Gibbs sampler for motif discovery

Objective:

Collapsed Gibbs sampling (Liu 1995): Sample only by integrating out the parameters

arg maxφ,Z

P (X | φ,Z,φ0)× P (φ)× P (Z)

φZ

27

In other words,

PM(Xi | φ, Zi) =

W�

a=1

φa,Xi,Zi+a−1 (3.9)

and

PM(Xi | Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.10)

Thus equation (3.8) can be written as:

P (Xi | φ, Zi, φ0) = PM(Xi | Zi, φ0)× PM(Xi | φ, Zi) (3.11)

3.2.3 Objective function

We wish to find φ and Z that maximize the joint posterior distribution of the un-

knowns conditional on the data. Therefore, our objective function is:

arg maxφ,Z

P (φ, Z | X, φ0) = arg maxφ,Z

P (X | φ, Z, φ0)P (φ)P (Z) (3.12)

assuming independent priors P (φ) and P (Z) over φ and Z, respectively.

3.2.4 Collapsed Gibbs sampling

If we applied traditional Gibbs sampling to the optimization problem described in

equation (3.12), we would have to sample each Zi and the high dimensional φ. How-

ever, collapsing φ, as proposed by Liu [1994], results in a more efficient algorithm

with n components.

We note that

P (Z | X, φ0) ∝ P (Z, X | φ0)

=

φ

P (φ, X | Z, φ0)P (Z)dφ

= P (Z)

φ

P (X | φ, Z, φ0)P (φ)dφ

= P (Z)PM(X | Z, φ0)

φ

PM(X | φ, Z)P (φ)dφ (3.13)

59

Page 28: Detecting signals in biological sequences

August 2010

Gibbs sampler for motif discovery

If we assume a Dirichlet distribution for the prior over parameterized by

28

We also note that, since φ is a multinomial distribution, if we assume a conjugate

product Dirichlet prior over φ, we can get a closed form solution for the integral

in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the

Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral

in equation (3.13) can be simplified as:

φ

PM(X | φ, Z)P (φ)dφ ∝W�

a=1

4�

b=1

Γ(cab(X) + αab) (3.14)

where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath

position in the binding site, where the positions of the binding sites are determined

based on corresponding Z.

For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).

Using equations (3.13) and (3.14) we get:

P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)

P (Z[−i] | X, φ0)

∝P (Z)PM(X | Z, φ0)

W�a=1

4�b=1

Γ(cab(X)) + αab)

P (Z[−i])PM(X[−i] | Z[−i], φ0)W�

a=1

4�b=1

Γ(cab(X[−i]) + αab)

= PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

Γ(cab(X) + αab)

Γ(cab(X[−i]) + αab)(3.15)

where cab(X[−i]) denotes the counts obtained from the data X without the sequence

Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)

denotes the counts in sequence Xi. We now use the fact that if c1 � c2,

Γ(c1 + c2)

Γ(c1)≈ cc2

1 (3.16)

Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the

nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.

Therefore, as long as the total number of sequences n is large, we can assume

60

φ

We also note that, since φ is a multinomial distribution, if we assume a conjugate

product Dirichlet prior over φ, we can get a closed form solution for the integral

in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the

Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral

in equation (3.13) can be simplified as:

φ

PM(X | φ, Z)P (φ)dφ ∝W�

a=1

4�

b=1

Γ(cab(X) + αab) (3.14)

where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath

position in the binding site, where the positions of the binding sites are determined

based on corresponding Z.

For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).

Using equations (3.13) and (3.14) we get:

P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)

P (Z[−i] | X, φ0)

∝P (Z)PM(X | Z, φ0)

W�a=1

4�b=1

Γ(cab(X)) + αab)

P (Z[−i])PM(X[−i] | Z[−i], φ0)W�

a=1

4�b=1

Γ(cab(X[−i]) + αab)

= PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

Γ(cab(X) + αab)

Γ(cab(X[−i]) + αab)(3.15)

where cab(X[−i]) denotes the counts obtained from the data X without the sequence

Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)

denotes the counts in sequence Xi. We now use the fact that if c1 � c2,

Γ(c1 + c2)

Γ(c1)≈ cc2

1 (3.16)

Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the

nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.

Therefore, as long as the total number of sequences n is large, we can assume

60

We also note that, since φ is a multinomial distribution, if we assume a conjugate

product Dirichlet prior over φ, we can get a closed form solution for the integral

in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the

Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral

in equation (3.13) can be simplified as:

φ

PM(X | φ, Z)P (φ)dφ ∝W�

a=1

4�

b=1

Γ(cab(X) + αab) (3.14)

where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath

position in the binding site, where the positions of the binding sites are determined

based on corresponding Z.

For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).

Using equations (3.13) and (3.14) we get:

P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)

P (Z[−i] | X, φ0)

∝P (Z)PM(X | Z, φ0)

W�a=1

4�b=1

Γ(cab(X)) + αab)

P (Z[−i])PM(X[−i] | Z[−i], φ0)W�

a=1

4�b=1

Γ(cab(X[−i]) + αab)

= PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

Γ(cab(X) + αab)

Γ(cab(X[−i]) + αab)(3.15)

where cab(X[−i]) denotes the counts obtained from the data X without the sequence

Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)

denotes the counts in sequence Xi. We now use the fact that if c1 � c2,

Γ(c1 + c2)

Γ(c1)≈ cc2

1 (3.16)

Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the

nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.

Therefore, as long as the total number of sequences n is large, we can assume

60

Page 29: Detecting signals in biological sequences

August 2010

Gibbs sampler for motif discovery

29

We also note that, since φ is a multinomial distribution, if we assume a conjugate

product Dirichlet prior over φ, we can get a closed form solution for the integral

in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the

Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral

in equation (3.13) can be simplified as:

φ

PM(X | φ, Z)P (φ)dφ ∝W�

a=1

4�

b=1

Γ(cab(X) + αab) (3.14)

where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath

position in the binding site, where the positions of the binding sites are determined

based on corresponding Z.

For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).

Using equations (3.13) and (3.14) we get:

P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)

P (Z[−i] | X, φ0)

∝P (Z)PM(X | Z, φ0)

W�a=1

4�b=1

Γ(cab(X)) + αab)

P (Z[−i])PM(X[−i] | Z[−i], φ0)W�

a=1

4�b=1

Γ(cab(X[−i]) + αab)

= PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

Γ(cab(X) + αab)

Γ(cab(X[−i]) + αab)(3.15)

where cab(X[−i]) denotes the counts obtained from the data X without the sequence

Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)

denotes the counts in sequence Xi. We now use the fact that if c1 � c2,

Γ(c1 + c2)

Γ(c1)≈ cc2

1 (3.16)

Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the

nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.

Therefore, as long as the total number of sequences n is large, we can assume

60

that cab(X[−i]) + αab � cab(Xi). Using the approximation in equation (3.16), equa-

tion (3.15) can be simplified as:

P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

(cab(X[−i]) + αab)cab(Xi)

(3.17)

Note that ∀a,4�

b=1cab(X[−i]) = n−1. Thus if we divide each

4�b=1

(cab(X[−i])+αab)cab(Xi)

in equation (3.17) by (n− 1) +

4�b=1

αab, we get:

P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

��φa,b

�cab(Xi)

= PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

�φa,Xi,Zi+a−1 (3.18)

where �φ is the posterior mean of φ conditioned on sequences X[−i] and positions

Z[−i].

3.3 Framework of PRIORITY

We have developed PRIORITY, a program that performs motif discovery using a

collapsed Gibbs sampling approach, similar to the one described in the previous

section. In the following subsections, we describe the various properties of PRIORITY

that make it different from traditional motif discovery programs.

3.3.1 Relaxing the assumption of exactly one binding site

PRIORITY allows for the possibility of some sequences not possessing a binding site.

This is particularly important when dealing with noisy data. The idea is similar

to the ZOOPS (Zero or One Occurrence Per Sequence) model described used in

61

X1

X2

X3

Xi

Xn−1

Xn

li

Z3

Z2

Z1

Zi

Zn−1

Zn

···

···

Figure 3.1: We are given n DNA sequences X1, . . . ,Xn. Each Xi is believed tocontain one binding site depicted in red at an unknown position denoted by the Zi.The goal is to infer the value of Z as well as the motif parameters φ that best describethe variabilities in the binding sites.

3.2.2 Sequence model

For simplicity, as depicted in Figure 3.1, each Xi is assumed to contain exactly one

binding site of that TF. Let Z be a vector of length n denoting the starting location

of the binding site in each sequence: Zi = j if there is a binding site at position j in

sequence Xi. The nucleotides not belonging to the binding sites are assumed to be

drawn from some background model parameterized by φ0.

Thus if the sequence Xi is of length li, and it contains a binding site at location

Zi, we can compute the likelihood of the sequence as:

P (Xi | φ, Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)×�

W�

a=1

φa,Xi,Zi+a−1

× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.8)

Each sequence Xi can thus be portioned into two regions, one that contains

nucleotides in the binding site while the other that contains nucleotides that are not

part of the nucleotide based on the value of Zi and W . For simplicity, let us use

PM(Xi | φ, Zi) to denote the region that is explained by the motif model φ, and

PM(Xi | Zi, φ0) to denote the region that is explained by the background model φ0.

58

=W�

a=1

�φa,Xi,Zi+a−1

P (Xi,Zi+a−1 | Xi, φ0)

Page 30: Detecting signals in biological sequences

August 2010

The algorithm

Initialize vector randomly

Two-step iterative procedure:

1. Hold out one of the sequences at random (or in some specified order). Compute based on the alignment of all sequences except the held out sequence.

2. Calculate for each value of using the new PSSM and sample a position from it.

30

Z

n Xi�φ

that cab(X[−i]) + αab � cab(Xi). Using the approximation in equation (3.16), equa-

tion (3.15) can be simplified as:

P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

(cab(X[−i]) + αab)cab(Xi)

(3.17)

Note that ∀a,4�

b=1cab(X[−i]) = n−1. Thus if we divide each

4�b=1

(cab(X[−i])+αab)cab(Xi)

in equation (3.17) by (n− 1) +

4�b=1

αab, we get:

P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

4�

b=1

��φa,b

�cab(Xi)

= PM(Xi | Zi, φ0)P (Z)

P (Z[−i])

W�

a=1

�φa,Xi,Zi+a−1 (3.18)

where �φ is the posterior mean of φ conditioned on sequences X[−i] and positions

Z[−i].

3.3 Framework of PRIORITY

We have developed PRIORITY, a program that performs motif discovery using a

collapsed Gibbs sampling approach, similar to the one described in the previous

section. In the following subsections, we describe the various properties of PRIORITY

that make it different from traditional motif discovery programs.

3.3.1 Relaxing the assumption of exactly one binding site

PRIORITY allows for the possibility of some sequences not possessing a binding site.

This is particularly important when dealing with noisy data. The idea is similar

to the ZOOPS (Zero or One Occurrence Per Sequence) model described used in

61

Zi

Page 31: Detecting signals in biological sequences

August 2010

Iterations in a typical Gibbs sampler

31

! "!! #!!! #"!! $!!! $"!! %!!! %"!! &!!! &"!! "!!!!#!!!

!'!!

!(!!

!&!!

!$!!

!

$!!

&!!

)*+,-*)./

0.*)1234.,+

2

2

5)*62)/1.,0-*)7+28,).,

5)*629/)1.,028,).,

Figure 4.2: Motif scores for two Gibbs samplers: one with and the other without

the informative prior, over 5,000 iterations. Both programs were run five times from

different starting locations. The two black plots are the best and worst runs for the

program with the uniform prior. The two grey plots are the best and worst runs for

the program with the informative prior. Although the absolute values of the scores

are not comparable (due to an arbitrary constant value assigned to the uniform prior),

it is clear that the number of iterations taken to converge for the algorithm with the

informative prior is almost half. Also, each of the five runs converges to a similar

final motif in the case of the algorithm incorporating the informative prior. On the

other hand, during the worst of the five runs for the other program with the uniform

prior, the sampler gets stuck in a local maximum that corresponds to a suboptimal

motif.

83