Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu.

31
Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu

Transcript of Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu.

Finding repeat pattern in human genome

by TEIRESIAS algorithm

Xiaojun Hu

Human Genome

TEIRESIAS Algorithm

Define PatternDefine Pattern

ΣΣ((ΣΣ{‘.’} * ){‘.’} * )ΣΣΣΣ: the alphabet of residues, like ACTG in DNA,: the alphabet of residues, like ACTG in DNA,

the set of all amino acids the set of all amino acids‘‘.’ : wild card character.’ : wild card characterPattern: Any string that begins and ends witPattern: Any string that begins and ends wit

h a residue, and contains an arbitrary comh a residue, and contains an arbitrary combination of residues and ‘.’ characters bination of residues and ‘.’ characters

Define PatternDefine Pattern

• The pattern is defined by three The pattern is defined by three parametersparameters

• L: the minimal number of literals L: the minimal number of literals

• W: the pattern lengthW: the pattern length

• K: the minimal repeat timesK: the minimal repeat times

• Example: Example: •Pattern: A.CH..EPattern: A.CH..E

L = 4, W= 7L = 4, W= 7

<4, 7> pattern<4, 7> pattern

Define PatternDefine Pattern

• P: patternP: pattern ‘ ‘A.CH..E’A.CH..E’

• S: a set of sequencesS: a set of sequencesS={LFAS={LFAAADDCHCHFFFFEEDTR, LKLDTR, LKLAALLCHCHESESEESDR, SDR,

AFAFAAGGCHCHADADEELFT}LFT}• Ls(P) = {(i,j) | sequence sLs(P) = {(i,j) | sequence s ii matches P at off matches P at off

set j}set j}Ls(‘A.CH..E’) = {(1,4), (2,4), (3,3)}Ls(‘A.CH..E’) = {(1,4), (2,4), (3,3)}

Problem DefinitionProblem Definition

• Given a set S={sGiven a set S={s11,s,s22, …, s, …, snn} of input seque} of input sequences and parameters L, W, K, find all maxnces and parameters L, W, K, find all maximal <L, W> patterns that have support aimal <L, W> patterns that have support at least K repeat times.t least K repeat times.

TEIRESIAS AlgorithmTEIRESIAS Algorithm

• The algorithm contains two steps: The algorithm contains two steps: Scanning and convolutionScanning and convolution

• 1. Scanning the sequences in the 1. Scanning the sequences in the input set S and locating all elementary input set S and locating all elementary patterns with support at least K.patterns with support at least K.

• 2. Combining together the elementary 2. Combining together the elementary pattern to recover the maximal pattern to recover the maximal pattern pattern

Scanning ProcessScanning Process

Convolution ProcessConvolution Process

Convolution ProcessConvolution Process

• Prefix(P) is the prefix sub pattern of P thaPrefix(P) is the prefix sub pattern of P that has exactly (L-1) residuest has exactly (L-1) residuesExample: Example: Prefix(‘F.ASTS’) =‘F.A’ if L = 3Prefix(‘F.ASTS’) =‘F.A’ if L = 3

• Suffix(P) is the suffix subpattern of P witSuffix(P) is the suffix subpattern of P with exactly (L-1) residuesh exactly (L-1) residuesExample: Example: Prefix(‘F.A…S’) =‘A..S’ if L = 3Prefix(‘F.A…S’) =‘A..S’ if L = 3

Convolution ProcessConvolution Process

• P, Q are the arbitrary patterns with at least L reP, Q are the arbitrary patterns with at least L residues each sidues each

• R R denotesdenotes a new pattern a new pattern• Q’ Q’ denotesdenotes what remains of Q after the prefix what remains of Q after the prefix

(Q) is thrown away(Q) is thrown away• ΦΦ denotes the empty string denotes the empty string

Convolution ProcessConvolution Process• Forward extensionForward extension If Suffix(P) = Prefix(Q), P is extended to right wiIf Suffix(P) = Prefix(Q), P is extended to right wi

th Q’(Pattern Q leftover except prefix)th Q’(Pattern Q leftover except prefix) Example: ‘F.Example: ‘F.ASAS’ convolute ‘’ convolute ‘ASAST’ to form T’ to form

‘F.‘F.ASAST’T’• Backward extensionBackward extension If Prefix(P) = Suffix(Q), P is extended to the left If Prefix(P) = Suffix(Q), P is extended to the left

with Q’(Pattern Q leftover except suffix)with Q’(Pattern Q leftover except suffix) Example: ‘Example: ‘F.AF.AS’ convolute ‘TS’ convolute ‘TF.AF.A’ to form ’ to form

‘T‘TF.AF.AS’S’

Application 1: Finding repeat pattApplication 1: Finding repeat pattern in ChrM ern in ChrM • Define the elementary pattern:Define the elementary pattern:

L=W=5L=W=5• Scan the ChrM sequenceScan the ChrM sequence

Option 1. Generate all possible elementaOption 1. Generate all possible elementary patterns (4^5 = 1024), implement in Rry patterns (4^5 = 1024), implement in ROption 2. Use Hash function to count eleOption 2. Use Hash function to count elementary patterns (1017), implement in Pmentary patterns (1017), implement in Perl erl

Application 1: Finding repeat pattApplication 1: Finding repeat pattern in ChrMern in ChrM

Elementary patterns distribution in the chrM

0

10

20

30

40

50

70_79 60_69 50_59 40_49 30_39 20_29 10_19 1_9

Pattern repeat times

patte

rn n

umbe

r%

Application 1: Finding repeat pattApplication 1: Finding repeat pattern in ChrMern in ChrM• Scan the sequenceScan the sequence Save the pattern Offset (index)Save the pattern Offset (index)

ChrM: GChrM: G11AA22TT33CACAGGTCT …CACAGGTCT …Pattern: GATCAPattern: GATCA Offset: 1Offset: 1Pattern: ATCACPattern: ATCAC Offset: 2Offset: 2

Pattern: TCACAPattern: TCACA Offset: 3Offset: 3

Application 1: Finding repeat pattApplication 1: Finding repeat pattern in ChrMern in ChrM• ConvolutionConvolution

Simple repeat pattern offset example:Simple repeat pattern offset example: Repeat Pattern1: GRepeat Pattern1: G11ATCACAG ATCACAG Offset: 1Offset: 1 Repeat Pattern2: GRepeat Pattern2: G1010ATCACAG Offset: 10ATCACAG Offset: 10 Repeat Pattern3: GRepeat Pattern3: G2020ATCACAG Offset: 20ATCACAG Offset: 20 Elementary pattern and their offset:Elementary pattern and their offset: GGATCA ATCA 1 10 201 10 20 ATCAATCAC 2 11 21C 2 11 21 TCACTCACA 3 12 22A 3 12 22 CACACACAG 4 13 23G 4 13 23 From the offset, if the pattern forward extend 1 base, the From the offset, if the pattern forward extend 1 base, the

offset increase 1offset increase 1

Application 1: Finding repeat pattApplication 1: Finding repeat pattern in ChrMern in ChrMRepeat Pattern: GATCRepeat Pattern: GATC44ACAGACAG

Elementary pattern and their offset:Elementary pattern and their offset:

CACACACAG 4 13 23G 4 13 23

TTCACACACA 3 12 22 3 12 22

AATCACTCAC 2 11 21 2 11 21

GGATCA ATCA 1 10 201 10 20

From the offset, if the pattern backward From the offset, if the pattern backward extend 1 base, the offset decrease 1extend 1 base, the offset decrease 1

Application 1: Finding the repeat Application 1: Finding the repeat pattern in ChrMpattern in ChrM• ConvolutionConvolution Pattern: CACAACAA, length 8, repeat 5 times, offset: 6399 7724 12487 12930 15824Pattern: CACAACAA, length 8, repeat 5 times, offset: 6399 7724 12487 12930 15824 L=W=5, K=5 L=W=5, K=5 Elementary patterns and their offset:Elementary patterns and their offset: ACAACACAAC 492 1712 492 1712 33033303 3400 3445 3400 3445 38063806 40404040 40734073 5057 5396 6065 6197 5057 5396 6065 6197

64006400 71857185 77257725 8643 8643 87098709 8768 8768 91989198 9668 9668 1004010040 10134 10300 10839 10134 10300 10839 1089710897 10900 10944 11263 11335 11997 10900 10944 11263 11335 11997 1203112031 1217812178 1225012250 1248812488 12545 12554 12741 12858 12545 12554 12741 12858 1293112931 13380 13380 1374613746 1377513775 14331 14478 14331 14478 14691 15364 14691 15364 1568015680 15767 15767 1582515825 16079 16079

CAACACAACA 1110 1122 2214 2399 2430 2969 1110 1122 2214 2399 2430 2969 33043304 3762 3762 38073807 3940 3940 40414041 40744074 4698 5581 4698 5581 6198 6198 64016401 6549 6598 6549 6598 71867186 77267726 8347 8641 8663 8347 8641 8663 87108710 91999199 9354 9560 9791 9354 9560 9791 1004110041 1 10086 10617 10716 10780 10835 10895 0086 10617 10716 10780 10835 10895 1089810898 11264 11333 11954 11998 11264 11333 11954 11998 1203212032 1217121799 1225112251 12489 12829 12489 12829 1293212932 1374713747 1377613776 13861 13906 14188 14651 14822 15121 13861 13906 14188 14651 14822 15121 1515681681 1582615826 16077 16281 16077 16281

Forward extension:Forward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =20 >5 , acceptRepeat position (repeat times) =20 >5 , accept New pattern: ACAACA (elementary pattern ACAAC forward extends 1 base A)New pattern: ACAACA (elementary pattern ACAAC forward extends 1 base A) New Offset: 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 121New Offset: 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 121

78 12250 12488 12931 13746 13775 15680 1582578 12250 12488 12931 13746 13775 15680 15825

Application 1: Finding the repeat Application 1: Finding the repeat pattern in ChrMpattern in ChrM• ConvolutionConvolutionpatterns and their offset:patterns and their offset:ACAACAACAACA 3303 3806 4040 4073 3303 3806 4040 4073 64006400 7185 7185 77257725 8709 9198 10040 8709 9198 10040 1089710897 12031 12031

12178 12250 12178 12250 1248812488 1293112931 13746 13746 1377513775 1568015680 1582515825AACAAAACAA 240 283 362 646 1123 2400 2766 2920 2970 3302 3721 240 283 362 646 1123 2400 2766 2920 2970 3302 3721

4039 4699 5173 5287 5395 6196 4039 4699 5173 5287 5395 6196 64026402 77277727 8508 8642 8664 8508 8642 8664 8691 9667 10299 10420 10781 10896 8691 9667 10299 10420 10781 10896 1089910899 10943 11334 11573 10943 11334 11573 12030 12249 12416 12030 12249 12416 1249012490 12553 12740 12553 12740 1293312933 13388 13395 13388 13395 13745 13774 13745 13774 13777 13777 13862 14029 14189 14193 14573 14658 13862 14029 14189 14193 14573 14658 15363 15603 15679 15363 15603 15679 1568215682 1582715827 15867 16078 16282 15867 16078 16282

Forward extension:Forward extension: Offset difference = new pattern length (6) – pattern suffix length (4) = 2Offset difference = new pattern length (6) – pattern suffix length (4) = 2 Repeat position (repeat times) =8 >5 , acceptRepeat position (repeat times) =8 >5 , accept New pattern: ACAACAA (pattern ACAAC forward extends 1 base A)New pattern: ACAACAA (pattern ACAAC forward extends 1 base A) New Offset: 6400 7725 10897 12488 12931 13775 15680 15825New Offset: 6400 7725 10897 12488 12931 13775 15680 15825

Application 1: Finding the repeat Application 1: Finding the repeat pattern in ChrMpattern in ChrM• ConvolutionConvolutionpatterns and their offset:patterns and their offset:ACAACAAACAACAA 64006400 77257725 10897 10897 1248812488 1293112931 13775 15680 13775 15680 1582515825CACAACACAA 1011 1052 1619 3156 3805 3810 3991 4072 5280 1011 1052 1619 3156 3805 3810 3991 4072 5280

5680 5938 5680 5938 63996399 7184 7184 77247724 8053 8456 8563 8708 8767 8053 8456 8563 8708 8767 8916 9106 10133 10838 10985 11262 11357 11996 12001 8916 9106 10133 10838 10985 11262 11357 11996 12001 12217 12217 1248712487 12544 12544 1293012930 13218 13379 13546 13942 13218 13379 13546 13942 14330 14624 15419 15581 14330 14624 15419 15581 1582415824 16431 16431

Backward extension:Backward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =5 >=5 , acceptRepeat position (repeat times) =5 >=5 , accept New pattern: CACAACAA (pattern ACAACAA backward extends 1 bNew pattern: CACAACAA (pattern ACAACAA backward extends 1 b

ase C)ase C) New Offset: 6399 7724 12487 12930 15824New Offset: 6399 7724 12487 12930 15824• Forward and backward extension finished, repeat times >=K , repForward and backward extension finished, repeat times >=K , rep

ort itort it

Application 1: Finding the repeat Application 1: Finding the repeat pattern in ChrMpattern in ChrM

PatternPattern LengthLength Repeat Repeat timestimes

PositionsPositions

CACAACACACAACA 77 1111 3805 4072 6399 7184 7724 8708 11262 11996 12487 12930 158243805 4072 6399 7184 7724 8708 11262 11996 12487 12930 15824

ACCCCAT ACCCCAT 77 1010 535 638 4391 5437 6584 7491 13692 14340 14629 14814535 638 4391 5437 6584 7491 13692 14340 14629 14814

CCCACTA CCCACTA 77 1414 15 110 2282 3739 4195 5705 7113 7403 9058 11216 11267 12361 13624 1368115 110 2282 3739 4195 5705 7113 7403 9058 11216 11267 12361 13624 13681

CACCCTACACCCTA 77 1010 1086 2219 11816 11871 12970 13147 13882 14391 15701 162681086 2219 11816 11871 12970 13147 13882 14391 15701 16268

CCCCCATCCCCCAT 77 1212 4880 7956 8029 8407 11428 12085 12238 12387 12433 14426 14589 16191 4880 7956 8029 8407 11428 12085 12238 12387 12433 14426 14589 16191

CCCTCTACCCTCTA 77 1010 726 1192 2068 3513 8275 8284 8608 9086 11980 13785726 1192 2068 3513 8275 8284 8608 9086 11980 13785

CCTAGCA CCTAGCA 77 1111 4198 6357 9061 11744 12988 13132 13306 13468 13925 15320 156474198 6357 9061 11744 12988 13132 13306 13468 13925 15320 15647

CCTCCTACCTCCTA 77 1212 3322 3610 4168 9279 9729 10092 10952 12095 12985 13990 15335 15491 3322 3610 4168 9279 9729 10092 10952 12095 12985 13990 15335 15491

AGCCCTAAGCCCTA 77 1010 1098 6924 10290 10353 10695 11969 13069 13558 13837 15317 1098 6924 10290 10353 10695 11969 13069 13558 13837 15317

Pattern report is saved in the ChrM_pattern.txt file. Here is a part of result(L=5, K=10).

Application 1: Finding the repeat Application 1: Finding the repeat pattern in ChrMpattern in ChrM

5 10 20 30 40

pattern number

average pattern length

0100200300400

500

600

700

K

The relationship among K, pattern number and average pattern length

pattern number

average pattern length

Application 2: Finding the Application 2: Finding the repeat pattern in Chr1repeat pattern in Chr1

• Define the elementary pattern:Define the elementary pattern:

L=W=17L=W=17

• Scan the Chr1 sequenceScan the Chr1 sequence

• Problem:Problem:

1. Ch1 file size: 252,194,721 bytes, very 1. Ch1 file size: 252,194,721 bytes, very large, can’t been read by Rlarge, can’t been read by R

2. All possible elementary patterns (4^17 2. All possible elementary patterns (4^17 = 17,179,869,184), over the R memory = 17,179,869,184), over the R memory limitation limitation

Application 2: Finding the Application 2: Finding the repeat pattern in Chr1repeat pattern in Chr1• Solution:Solution: Use another computer language, like C or perlUse another computer language, like C or perl The C program gcat and the scanning result werThe C program gcat and the scanning result wer

e got from Dr. Steven. Parallel program were use got from Dr. Steven. Parallel program were used for running the program. The elementary ped for running the program. The elementary patterns over 2000 repeat times are used for conatterns over 2000 repeat times are used for convolution.volution.

I also wrote perl script: scan.plI also wrote perl script: scan.pl Usage: perl scan.pl <chr file> <output file> <L vUsage: perl scan.pl <chr file> <output file> <L v

alue> alue>

Application 2: Finding the Application 2: Finding the repeat pattern in Chr1repeat pattern in Chr1• ConvolutionConvolution Problem: The index file is too large, it caProblem: The index file is too large, it ca

n’t been read by Rn’t been read by R My Solution:My Solution: perl convolute.pl <index file> <output fileperl convolute.pl <index file> <output file

> <pattern length(L)> <Repeat times(K)>> <pattern length(L)> <Repeat times(K)> It’s working, but still very slow. The PVM It’s working, but still very slow. The PVM

(parralle virtual machine) techniques sh(parralle virtual machine) techniques should be used for increasing speed. ould be used for increasing speed.

Application 2: Finding the Application 2: Finding the repeat pattern in Chr1repeat pattern in Chr1Some of Pattern Found in the Chr1Some of Pattern Found in the Chr1

PatternPatternATCTCAGCTCACTGCAACCTCTGCCTCCCGGGTTCAAGCGATTCTATCTCAGCTCACTGCAACCTCTGCCTCCCGGGTTCAAGCGATTCT

ATCCTCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGGATCCTCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGG

Pattern Pattern lengthlength

4545

4040

Repeat timesRepeat times

4040

4242

CAATCTCAGCTCACTGCAACCTCCACCTCCCAGGTTCAAGCAATCTCAGCTCACTGCAACCTCCACCTCCCAGGTTCAAG 4040 4949

GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAATTCTCGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAATTCTC4040

108108

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC 3636 4545

CCCACCTCAGCCTCCCAAAGTGCTGGGATTATAGGCCCACCTCAGCCTCCCAAAGTGCTGGGATTATAGG 3535 6666

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTA 3434 4242

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGC 3434 4242

CCCACCTCAGCCTCCCAAAGTGCTGAGATTACAGCCCACCTCAGCCTCCCAAAGTGCTGAGATTACAG 3434 5454

CAATCTCAGCTCACTGCAACCTCCACCTCCTGGGCAATCTCAGCTCACTGCAACCTCCACCTCCTGGG 3434 6262

•There are two more patterns, namely,There are two more patterns, namely,

AGCAATTCTCCTGCCTCA: 2858AGCAATTCTCCTGCCTCA: 2858CTCCTGCCTCAGCCTCC: 14168CTCCTGCCTCAGCCTCC: 14168

 can convolute with  can convolute with

gctcactgcaacctccgcctcccaggttcaagcaattctc gctcactgcaacctccgcctcccaggttcaagcaattctc : 108: 108to form the maximal patternto form the maximal pattern

GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAATTCTCCTGCCTCAGCCTCC : 46 GCAATTCTCCTGCCTCAGCCTCC : 46

Application 2: Finding the Application 2: Finding the repeat pattern in Chr1repeat pattern in Chr1

• DiscussionDiscussion1. For the human genome pattern 1. For the human genome pattern discovery, the high performance discovery, the high performance computer and parallel program should be computer and parallel program should be used. R is a not good tools for solving this used. R is a not good tools for solving this problem.problem.

2. The algorithm should be carefully 2. The algorithm should be carefully implemented for the better performance, implemented for the better performance, like reducing the loop time.like reducing the loop time.

ReferenceReference

• Isodore Rigoutsos, Aris Floratos Combinatorial Isodore Rigoutsos, Aris Floratos Combinatorial pattern discovery in biological sequences: the pattern discovery in biological sequences: the TEIRESIAS algorithm, Bioinformatics, 1998, VolTEIRESIAS algorithm, Bioinformatics, 1998, Vol(14) no.1:55-67(14) no.1:55-67

• Isodore Rigoutsos et al. Short blocks from the Isodore Rigoutsos et al. Short blocks from the noncoding parts of the human genome have innoncoding parts of the human genome have instances within nearly all known genes and relastances within nearly all known genes and relate to biological process, PNAS, 2006, Vol(103) nte to biological process, PNAS, 2006, Vol(103) no.17:6605-6610 o.17:6605-6610