E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

EXPECTATION MAXIMIZATION MEETS SAMPLING IN MOTIF FINDINGZhizhuo Zhang

OUTLINE

Review of Mixture Model and EM algorithm Importance Sampling Re-sampling EM Extending EM Integrate Other Features Result

REVIEW MOTIF FINDING: MIXTURE MODELING

Given a dataset X, a motif model Ѳ, and a background model θ0, the likelihood of observed X, is defined as :

To optimize likelihood above is NP-hard, EM algorithm solve this problem with the concept of missing data. Assume the missing data Zi is binding site Boolean flag of each site:

Motif Component Background Component

REVIEW MOTIF FINDING: EM

E-step:

M-step:

PROS AND CONS

Pros: Pure Probabilistic Modeling EM is a well known method The complexity of each iteration is linear

Cons: In each iteration, it examines all the sites (most

is background sites) EM is sensitive to its starting condition The length of motif is assumed given

SAMPLING IDEA (1)

Simple Example: 20 As and 10 Bs AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBB

Let’s define a sampling function Q(x), and Q(x)=1 when x is sampled:

E.G., P(Q(A)=1)=0.1 P(Q(B)=1)=0.2The sampled data maybe: AABB we can recover the original data from

“AABB” 2A in sample/0.1=20 A in original 2B in sample/0.2=10 B in original

SAMPLING IDEA (2)

Almost every sampling function can recover the statistics in the original, which is known as “Importance sampling”

We can defined a good sampling function on the sequence data, which prefer to sample binding sites than background sites. According the parameter complexity, motif

model need more samples than background to achieve the same level of accuracy.

RE-SAMPLING EM

Sampling function Q(.), and sampled data XQ

E-step: the same as original EM M-step:

RE-SAMPLING EM

How to find a good sampling function

Intuitively, Motif PWM is the natural good sampling function, but it is impossible for us to know the motif PWM before hand.

Fortunately, a approximate PWM model already can do a good job in practice.

HOW TO FIND A GOOD APPROXIMATING PWM?

Unknown length Unknown distribution

EXTENDING EM

Start from all over-represented 5-mers Similarly, we find a motif model(PWM)

contains the given 5-mer which maximizes the likelihood of the observed data.

We define a extending EM process which optimizes the flanking columns included in the final PWM.

EXTENDING EM

Imagine we have a length-25 PWM Ѳ with 5-mer q “ACTTG” in the middle, which is wide enough for us to target any motif less than 15bp (Wmax).

Po 1 2 ……

10 11 12 13 14 15 16 ……

A 0.25

…… 0.25

1 0 0 0 0 0.25

…… 0.25

C 0.25

…… 0.25

0 1 0 0 0 0.25

…… 0.25

G 0.25

…… 0.25

0 0 0 0 1 0.25

…… 0.25

T 0.25

…… 0.25

0 0 1 1 0 0.25

…… 0.25

EXTENDING EMWe use two indices to maintain the start and end of the real motif PWM

EXTENDING EM

The M-step is the same as original EM, but we need to determine which column should be included. The increase of log-likelihood by including column j

CONSIDER OTHER FEATURES IN EM

Other features Positional Bias Strand Bias Sequence Rank Bias

We integrate them into mixture model New likelihood ratio Boolean variable to determine whether

include feature or not.

CONSIDER OTHER FEATURES IN EM

If feature data is modeled as multinomial, Chi-square Test is used to decide whether a feature should be included:

The multinomial parameters φ also can be learned in the M-step:

ALL TOGETHER

PWM Model Position Prior Model

Peak Rank Prior Model

SIMULATION RESULT

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73

Rank of AUC Difference

AUCSEME - AUCMEME

MEME Pax4 Motif

SEME Pax4 Motif

JASPAR Pax4 Motif

SIMULATION RESULT

0200400600800

100012001400160018002000

Running Time Comparison

MEMECUDA-MEMESEME

Total Length of Input Sequences (bp)

REAL DATA RESULT

163 ChIP-seq datasets Compare 6 popular motif finders. Half for training, half for testing

REAL DATA RESULT

De novo AP1 Model

De novo FOXA1 Model

De novo ER Model

CONCLUSION

SEME can perform EM on biased sampled data but estimate

parameters unbiasedly vary PWM size in EM procedure by starting with a

short 5-mer automatically learn and select other feature

information during EM iterations

E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

Documents

Transcript of E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.

Bellwork Determine whether the two triangles are similar Set 1 ΔABC: m A=90 o, m B=44 o ΔDEF: m D=90 o, m F=46 o Set 2 ΔABC: m A=132.

Accurate and Independent Spectral Response … · Web viewFilter matrix y(m),Y(m) Relative response of detector m Filtered value of y(m) F(m,n) Responsivity ratio between detector

Electromagnetic Field Theory - irina.stobbe.netirina.stobbe.net/wiki/images/EMFT_Book.pdf · Electromagnetic Field Theory ... F.4.2 Vector formulae 180 F.5 Bibliography 182 M Mathematical

Ageing of the 2+1 dimensional Kardar-Parisi Zhang model

NOVIKOV HOMOLOGY, TWISTED ALEXANDER POLYNOMIALS, … · 2018-11-16 · Let M be a closed manifold, f: M → S1 a circle-valued Morse function on M. Let m k(f) denote the number of

Rotação de Corpos Rígidos (Cap. 9) - Moodle USP: e ... · Analogia TranslaçãoRotação R θ m x y m R z z F r a d F F t a n F z OBS: Corpo Rígido Forças de vínculo compensam

Torque τ = F·ℓ F and ℓ must be perpendicular. Units: N*m (enter them this way into computer) F Axis ℓ.

Decision Tree and Boosting - CILVR at NYU · Decision Tree and Boosting Tong Zhang Rutgers University T. Zhang (Rutgers) Boosting 1 / 29

Q. Zhang, A. Setyan, et al UC Davis · Q. Zhang, A. Setyan, et al UC Davis. T0 . T1 . Frequent NPF & growth events • Occurred on regional scale, T0 data from ... Qi Zhang Created

Precise α s from Decays(*) M. Davier, S. Descotes-Genon, A. Hoecker, B. Malaescu, and Z. Zhang Tau08 Workshop Novosibirsk, Sept. 22-25 2008 (*) arxiv:0803.0979;

Entre nous 1 Glossaire - Klettfaire κάνω famille f οικογένεια festival m καλλιτεχνική διοργάνωση fétiche γούρι fille f κορίτσι film

st& order typing rules SN and redex creationpauillac.inria.fr/~levy/talks/12cea/sn-f-4.pdf · Weak vs Strong Normalisation 7 M F 1 F 2 F n M 1 M 2 M n R 1 R 2 R 3 R 4 R 5 =N P 1 P

Lifting randomized query complexity to randomized ...rahul/allfiles/R-composition.pdf · R 1=3(f IP n m) = R 1=3(f) m; where IP m: f0;1gmf 0;1gm!f0;1gis the Inner Product (modulo

HORIZON 2020 CALLS FOR 2014-2015 · H2020-Galileo-2015-1 4 25.000.000 € Earth Observation ... PROGRAMMES Call Title Call Identifier J F M A M J J A S O N D J F M A M J J A S O N

2 Modelo do Sistema sem Apoios Elásticos · M II F M II yy I I zz I I. ... II t II t II t. αη αη αη αη ... C G x + Kx. w = F. desb + F. mm + F. emp (2.21) 2.2. Determinação

Axion-Higgs Uniﬁcation - ICTP-SAIFR...Axions are most elegant solution θ → a(x) f Axions are Goldstone bosons of a symmetry anomalous under QCD m a ∼ m π f π f Thursday, October

Sensoriamento Remoto Aplicado à Geografia · C = λ f λ = comprimento de onda (m, mm, µm) f = frequência (em ciclos por segundo ou Hertz) c = velocidade da luz (em m/s) Comprimento

Question m r 2 f jn g - homepages.math.uic.edu

F. Leuterer, M. Kaufmann

F G F F m g m F m g mcf - matka - halapa.com fizika · Tijelo mase m koje postavimo u takav sustav koji ima stalnu akceleraciju a, ne će mirovati s obzirom na sustav, nego će imati

Precise α s from Decays() M. Davier, S. Descotes-Genon, A. Hoecker, B. Malaescu, and Z. Zhang Tau08 Workshop Novosibirsk, Sept. 22-25 2008 () arxiv:0803.0979;