E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

25
EXPECTATION MAXIMIZATION MEETS SAMPLING IN MOTIF FINDING Zhizhuo Zhang

Transcript of E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

Page 1: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXPECTATION MAXIMIZATION MEETS SAMPLING IN MOTIF FINDINGZhizhuo Zhang

Page 2: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

OUTLINE

Review of Mixture Model and EM algorithm Importance Sampling Re-sampling EM Extending EM Integrate Other Features Result

Page 3: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

REVIEW MOTIF FINDING: MIXTURE MODELING

Given a dataset X, a motif model Ѳ, and a background model θ0, the likelihood of observed X, is defined as :

To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Zi is binding site Boolean flag of each site:

Motif Component Background Component

Page 4: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

REVIEW MOTIF FINDING: EM

E-step:

M-step:

Page 5: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

PROS AND CONS

Pros: Pure Probabilistic Modeling EM is a well known method The complexity of each iteration is linear

Cons: In each iteration, it examines all the sites (most

is background sites) EM is sensitive to its starting condition The length of motif is assumed given

Page 6: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

SAMPLING IDEA (1)

Simple Example: 20 As and 10 Bs AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBB

Let’s define a sampling function Q(x), and Q(x)=1 when x is sampled:

E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2The sampled data maybe: AABB we can recover the original data from

“AABB” 2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in original

Page 7: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

SAMPLING IDEA (2)

Almost every sampling function can recover the statistics in the original, which is known as “Importance sampling”

We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites. According the parameter complexity, motif

model need more samples than background to achieve the same level of accuracy.

Page 8: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

RE-SAMPLING EM

Sampling function Q(.), and sampled data XQ

E-step: the same as original EM M-step:

Page 9: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

RE-SAMPLING EM

Page 10: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

RE-SAMPLING EM

How to find a good sampling function

Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand.

Fortunately, a approximate PWM model already can do a good job in practice.

Page 11: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

HOW TO FIND A GOOD APPROXIMATING PWM?

Unknown length Unknown distribution

Page 12: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXTENDING EM

Start from all over-represented 5-mers Similarly, we find a motif model(PWM)

contains the given 5-mer which maximizes the likelihood of the observed data.

We define a extending EM process which optimizes the flanking columns included in the final PWM.

Page 13: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXTENDING EM

Imagine we have a length-25 PWM Ѳ with 5-mer q “ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp (Wmax).

Po 1 2 ……

10 11 12 13 14 15 16 ……

24 25

A 0.25

0.25

…… 0.25

1 0 0 0 0 0.25

…… 0.25

0.25

C 0.25

0.25

…… 0.25

0 1 0 0 0 0.25

…… 0.25

0.25

G 0.25

0.25

…… 0.25

0 0 0 0 1 0.25

…… 0.25

0.25

T 0.25

0.25

…… 0.25

0 0 1 1 0 0.25

…… 0.25

0.25

Page 14: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXTENDING EMWe use two indices to maintain the start and end of the real motif PWM

Page 15: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXTENDING EM

The M-step is the same as original EM, but we need to determine which column should be included. The increase of log-likelihood by including column j

Page 16: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

CONSIDER OTHER FEATURES IN EM

Other features Positional Bias Strand Bias Sequence Rank Bias

We integrate them into mixture model New likelihood ratio Boolean variable to determine whether

include feature or not.

Page 17: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

CONSIDER OTHER FEATURES IN EM

If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included:

The multinomial parameters φ also can be learned in the M-step:

Page 18: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

ALL TOGETHER

Page 19: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

PWM Model Position Prior Model

Peak Rank Prior Model

Page 20: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

SIMULATION RESULT

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73

AUC

Diff

eren

ce

Rank of AUC Difference

AUCSEME - AUCMEME

MEME Pax4 Motif

SEME Pax4 Motif

JASPAR Pax4 Motif

Page 21: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

SIMULATION RESULT

2000

00

4000

00

8000

00

1600

000

2400

000

3200

000

4000

000

0200400600800

100012001400160018002000

Running Time Comparison

MEMECUDA-MEMESEME

Total Length of Input Sequences (bp)

Ru

nn

ing

Tim

e (

min

)

Page 22: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

REAL DATA RESULT

163 ChIP-seq datasets Compare 6 popular motif finders. Half for training, half for testing

Page 23: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

REAL DATA RESULT

De novo AP1 Model

De novo FOXA1 Model

De novo ER Model

Page 24: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

CONCLUSION

SEME can perform EM on biased sampled data but estimate

parameters unbiasedly vary PWM size in EM procedure by starting with a

short 5-mer automatically learn and select other feature

information during EM iterations

Page 25: E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.