E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

Post on 14-Jan-2016

217 views 2 download

Transcript of E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXPECTATION MAXIMIZATION MEETS SAMPLING IN MOTIF FINDINGZhizhuo Zhang

OUTLINE

Review of Mixture Model and EM algorithm Importance Sampling Re-sampling EM Extending EM Integrate Other Features Result

REVIEW MOTIF FINDING: MIXTURE MODELING

Given a dataset X, a motif model Ѳ, and a background model θ0, the likelihood of observed X, is defined as :

To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Zi is binding site Boolean flag of each site:

Motif Component Background Component

REVIEW MOTIF FINDING: EM

E-step:

M-step:

PROS AND CONS

Pros: Pure Probabilistic Modeling EM is a well known method The complexity of each iteration is linear

Cons: In each iteration, it examines all the sites (most

is background sites) EM is sensitive to its starting condition The length of motif is assumed given

SAMPLING IDEA (1)

Simple Example: 20 As and 10 Bs AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBB

Let’s define a sampling function Q(x), and Q(x)=1 when x is sampled:

E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2The sampled data maybe: AABB we can recover the original data from

“AABB” 2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in original

SAMPLING IDEA (2)

Almost every sampling function can recover the statistics in the original, which is known as “Importance sampling”

We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites. According the parameter complexity, motif

model need more samples than background to achieve the same level of accuracy.

RE-SAMPLING EM

Sampling function Q(.), and sampled data XQ

E-step: the same as original EM M-step:

RE-SAMPLING EM

RE-SAMPLING EM

How to find a good sampling function

Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand.

Fortunately, a approximate PWM model already can do a good job in practice.

HOW TO FIND A GOOD APPROXIMATING PWM?

Unknown length Unknown distribution

EXTENDING EM

Start from all over-represented 5-mers Similarly, we find a motif model(PWM)

contains the given 5-mer which maximizes the likelihood of the observed data.

We define a extending EM process which optimizes the flanking columns included in the final PWM.

EXTENDING EM

Imagine we have a length-25 PWM Ѳ with 5-mer q “ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp (Wmax).

Po 1 2 ……

10 11 12 13 14 15 16 ……

24 25

A 0.25

0.25

…… 0.25

1 0 0 0 0 0.25

…… 0.25

0.25

C 0.25

0.25

…… 0.25

0 1 0 0 0 0.25

…… 0.25

0.25

G 0.25

0.25

…… 0.25

0 0 0 0 1 0.25

…… 0.25

0.25

T 0.25

0.25

…… 0.25

0 0 1 1 0 0.25

…… 0.25

0.25

EXTENDING EMWe use two indices to maintain the start and end of the real motif PWM

EXTENDING EM

The M-step is the same as original EM, but we need to determine which column should be included. The increase of log-likelihood by including column j

CONSIDER OTHER FEATURES IN EM

Other features Positional Bias Strand Bias Sequence Rank Bias

We integrate them into mixture model New likelihood ratio Boolean variable to determine whether

include feature or not.

CONSIDER OTHER FEATURES IN EM

If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included:

The multinomial parameters φ also can be learned in the M-step:

ALL TOGETHER

PWM Model Position Prior Model

Peak Rank Prior Model

SIMULATION RESULT

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73

AUC

Diff

eren

ce

Rank of AUC Difference

AUCSEME - AUCMEME

MEME Pax4 Motif

SEME Pax4 Motif

JASPAR Pax4 Motif

SIMULATION RESULT

2000

00

4000

00

8000

00

1600

000

2400

000

3200

000

4000

000

0200400600800

100012001400160018002000

Running Time Comparison

MEMECUDA-MEMESEME

Total Length of Input Sequences (bp)

Ru

nn

ing

Tim

e (

min

)

REAL DATA RESULT

163 ChIP-seq datasets Compare 6 popular motif finders. Half for training, half for testing

REAL DATA RESULT

De novo AP1 Model

De novo FOXA1 Model

De novo ER Model

CONCLUSION

SEME can perform EM on biased sampled data but estimate

parameters unbiasedly vary PWM size in EM procedure by starting with a

short 5-mer automatically learn and select other feature

information during EM iterations