DNA duplex stability as discriminative characteristic for Escherichia coli σ54- and σ28- dependent...

7
DNA duplex stability as discriminative characteristic for Escherichia coli s 54 - and s 28 - dependent promoter sequences Scheila de Avila e Silva * , Franciele Forte, Ivaine T.S. Sartor, Tahila Andrighetti, Günther J.L. Gerhardt, Ana Paula Longaray Delamare, Sergio Echeverrigaray Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas,1130, CEP 95070-560 Caxias do Sul, RS, Brazil article info Article history: Received 28 August 2013 Accepted 1 October 2013 Keywords: Promoter prediction Neural networks DNA duplex stability abstract The advent of modern high-throughput sequencing has made it possible to generate vast quantities of genomic sequence data. However, the processing of this volume of information, including prediction of gene-coding and regulatory sequences remains an important bottleneck in bioinformatics research. In this work, we integrated DNA duplex stability into the repertoire of a Neural Network (NN) capable of predicting promoter regions with augmented accuracy, specicity and sensitivity. We took our method beyond a simplistic analysis based on a single sigma subunit of RNA polymerase, incorporating the six main sigma-subunits of Escherichia coli. This methodology employed successfully re-discovered known promoter sequences recognized by E. coli RNA polymerase subunits s 24 , s 28 , s 32 , s 38 , s 54 and s 70 , with highlighted accuracies for s 28 - and s 54 - dependent promoter sequences (values obtained were 80% and 78.8%, respectively). Furthermore, the discrimination of promoters according to the s factor made it possible to extract functional commonalities for the genes expressed by each type of promoter. The DNA duplex stability rises as a distinctive feature which improves the recognition and classication of s 28 - and s 54 - dependent promoter sequences. The ndings presented in this report underscore the usefulness of including DNA biophysical parameters into NN learning algorithms to increase accuracy, specicity and sensitivity in promoter beyond what is accomplished based on sequence alone. Ó 2013 The International Alliance for Biological Standardization. Published by Elsevier Ltd. All rights reserved. 1. Introduction The rst and key step in the process of gene expression is the recognition of the promoter sequence by the RNA polymerase enzyme (RNAP) [1]. In the simplest denition, a promoter can be dened as an element located before the transcription start site (TSS) for an open reading frame (ORF) that may play a regulatory role [1,2]. The proper regulation of transcription is crucial for a single-cell prokaryotes because their environment can change dramatically and instantly [3]. In bacteria, the RNAP holoenzyme consists of ve subunits (2 a, b, b 0 , u) and an additional sigma (s) subunit factor [4, 5]. A collection of different s subunits is responsible for directing the RNAP binding to specic promoter regions and the consequent activation of genes in response to environmental changes. The s factors are labeled according to their molecular weights (s 24 , s 28 , s 32 , s 38 , s 54 and s 70 ). Each s factor has been assigned to a global function. For instance, s 32 and s 24 play a role in the heat-shock response, s 28 is associated with the expression of agellar genes during normal growth, s 54 is involved in nitrogen metabolism and s 70 is the factor responsible for the bulk housekeeping transcriptional activity [5e7]. The canonical prokaryotic promoter has two consensus hex- amers located before the rst nucleotide transcribed (TSS). One is centered at approximately 35 nucleotides (called the 35 region), and another is centered at 10 nucleotides (called the 10 region or Pribnow Box). For s 54 , the positions of these regions are centered at 24 and 12 bp. The hexamers present different consensus se- quences according to the s factor that recognizes them. For instance, s 70 , s 32 , s 28 and s 54 have the following consensus se- quences: TTGACA and TATAAT separated by 16e18 bp; CCCTTGAA and CCCGATNT separated by 13e15 bp; CTAAA and GCCGATAA separated by 15 bp; GGCACG and TTGC separated by 4 bp [5,7]. The correct classication of a given DNA sequence as a promoter or non-promoter is an attractive research area in bioinformatics because it improves genome annotation and allows the generation of hypotheses in the context of the bacterial transcription initiation process and gene function [8]. The use of in silico promoter * Corresponding author. Tel.: þ55 54 3218 21 00x2075; fax: þ55 54 3218 21 49. E-mail addresses: [email protected], [email protected] (S. de Avila e Silva), [email protected] (F. Forte), [email protected] (I. T.S. Sartor), [email protected] (T. Andrighetti), [email protected] (G. J.L. Gerhardt), [email protected] (A.P. Longaray Delamare), [email protected] (S. Echeverrigaray). Contents lists available at ScienceDirect Biologicals journal homepage: www.elsevier.com/locate/biologicals 1045-1056/$36.00 Ó 2013 The International Alliance for Biological Standardization. Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.biologicals.2013.10.001 Biologicals 42 (2014) 22e28

Transcript of DNA duplex stability as discriminative characteristic for Escherichia coli σ54- and σ28- dependent...

lable at ScienceDirect

Biologicals 42 (2014) 22e28

Contents lists avai

Biologicals

journal homepage: www.elsevier .com/locate/biologicals

DNA duplex stability as discriminative characteristic for Escherichiacoli s54- and s28- dependent promoter sequences

Scheila de Avila e Silva*, Franciele Forte, Ivaine T.S. Sartor, Tahila Andrighetti,Günther J.L. Gerhardt, Ana Paula Longaray Delamare, Sergio EcheverrigarayUniversidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil

a r t i c l e i n f o

Article history:Received 28 August 2013Accepted 1 October 2013

Keywords:Promoter predictionNeural networksDNA duplex stability

* Corresponding author. Tel.: þ55 54 3218 21 00x2E-mail addresses: [email protected], scheila.silva@gm

[email protected] (F. Forte), ivaine.sautSartor), [email protected] (T. Andrighetti), gunthGerhardt), [email protected] (A.P. Longaray Dela(S. Echeverrigaray).

1045-1056/$36.00 � 2013 The International Alliancehttp://dx.doi.org/10.1016/j.biologicals.2013.10.001

a b s t r a c t

The advent of modern high-throughput sequencing has made it possible to generate vast quantities ofgenomic sequence data. However, the processing of this volume of information, including prediction ofgene-coding and regulatory sequences remains an important bottleneck in bioinformatics research. Inthis work, we integrated DNA duplex stability into the repertoire of a Neural Network (NN) capable ofpredicting promoter regions with augmented accuracy, specificity and sensitivity. We took our methodbeyond a simplistic analysis based on a single sigma subunit of RNA polymerase, incorporating the sixmain sigma-subunits of Escherichia coli. This methodology employed successfully re-discovered knownpromoter sequences recognized by E. coli RNA polymerase subunits s24, s28, s32, s38, s54 and s70, withhighlighted accuracies for s28- and s54- dependent promoter sequences (values obtained were 80% and78.8%, respectively). Furthermore, the discrimination of promoters according to the s factor made itpossible to extract functional commonalities for the genes expressed by each type of promoter. The DNAduplex stability rises as a distinctive feature which improves the recognition and classification of s28-and s54- dependent promoter sequences. The findings presented in this report underscore the usefulnessof including DNA biophysical parameters into NN learning algorithms to increase accuracy, specificityand sensitivity in promoter beyond what is accomplished based on sequence alone.

� 2013 The International Alliance for Biological Standardization. Published by Elsevier Ltd. All rightsreserved.

1. Introduction

The first and key step in the process of gene expression is therecognition of the promoter sequence by the RNA polymeraseenzyme (RNAP) [1]. In the simplest definition, a promoter can bedefined as an element located before the transcription start site (TSS)for an open reading frame (ORF) thatmay play a regulatory role [1,2].The proper regulation of transcription is crucial for a single-cellprokaryotes because their environment can change dramaticallyand instantly [3]. In bacteria, the RNAP holoenzyme consists of fivesubunits (2a,b,b0,u) andanadditional sigma (s) subunit factor [4, 5].A collection of different s subunits is responsible for directing theRNAP binding to specific promoter regions and the consequentactivation of genes in response to environmental changes. The sfactors are labeled according to theirmolecularweights (s24,s28,s32,

075; fax: þ55 54 3218 21 49.ail.com (S. de Avila e Silva),

[email protected] (I. [email protected] (G. J.L.mare), [email protected]

for Biological Standardization. Pub

s38,s54 ands70). Eachs factorhas been assigned to a global function.For instance,s32 ands24 play a role in the heat-shock response,s28 isassociated with the expression of flagellar genes during normalgrowth, s54 is involved in nitrogen metabolism and s70 is the factorresponsible for the bulk housekeeping transcriptional activity [5e7].

The canonical prokaryotic promoter has two consensus hex-amers located before the first nucleotide transcribed (TSS). One iscentered at approximately 35 nucleotides (called the �35 region),and another is centered at 10 nucleotides (called the �10 region orPribnow Box). For s54, the positions of these regions are centeredat �24 and �12 bp. The hexamers present different consensus se-quences according to the s factor that recognizes them. Forinstance, s70, s32, s28 and s54 have the following consensus se-quences: TTGACA and TATAAT separated by 16e18 bp; CCCTTGAAand CCCGATNT separated by 13e15 bp; CTAAA and GCCGATAAseparated by 15 bp; GGCACG and TTGC separated by 4 bp [5,7].

The correct classification of a given DNA sequence as a promoteror non-promoter is an attractive research area in bioinformaticsbecause it improves genome annotation and allows the generationof hypotheses in the context of the bacterial transcription initiationprocess and gene function [8]. The use of in silico promoter

lished by Elsevier Ltd. All rights reserved.

S. de Avila e Silva et al. / Biologicals 42 (2014) 22e28 23

prediction techniques increases because thesemethods require lesstime and they are not as demanding when compared with molec-ular techniques [9]. Promoter compilation and analysis have led tocomputer programs that predict promoter sequences on the basisof their homology or consensus sequences. Unfortunately, thepredictive power of these algorithms is reduced due the frequentvariations in the consensus sequence position and nucleotidecomposition [10]. Many current promoter prediction methods arebased on position weight matrices [8,11], support vector machine(SVM) [9,10,12] or neural network (NN) [13e17] techniques, andthey are devoted to predicting s70-dependent promoter sequences.In addition to the prediction, another interesting procedure is ruleextraction from trained NNs or SVMs because it gives an explana-tion for how each decision is made [18]. Despite the importance ofthis procedure, it is not often carried out because it requires theapplication of specific algorithms [18,19].

Considering the role of the promoter in the open complex for-mation [1], structural DNA features (such as stability, bendabilityand curvature) are expected to distinguish promoters from othergenomic sequences. It is known that the eukaryotic promoter andprokaryotic s70-dependent promoter sequences have lower sta-bility, higher curvature and lesser bendability than coding se-quences [20]. Consequently, DNA duplex stability is considered apromising parameter for promoter recognition [17,20e23]. Therelative stability of a DNA duplex depends on the identity of thenearest neighbor bases [24,25], and different bioinformatic ap-proaches have used DNA duplex stability [17,21,23].

This paper applies Escherichia coli promoter sequence duplexstability as the input data in NN simulations. The follows approachwas employed: the DNA duplex stability was calculated by applyingthe Nearest Neighbors for the prediction of free energy, and thevalues obtained were filtered using a moving average. This data setwas used for NN learning, and in contrast fromprevious reports, thesimulations were carried out separately for the six s-dependentpromoters of E. coli: s70, s28, s38, s24, s32 and s54. After the clas-sification analysis, the rule extraction (another infrequent proce-dure) was performed to identify significant results obtained fromthe best NN architecture simulated for each s factor promoter set.

2. Methods

2.1. Data set

The positive examples were obtained from the RegulonDBdatabase [26] in the April, 2010 version. A total of 1035 promotersequences belonging to different E. coli s factors were employed.Therefore, for s24, s28, s32, s38, s54 and s70, a total of 69, 20, 69, 99,38 and 740 sequences were obtained, respectively. The negativeexamples were randomly chosen from E. coli non-promoter inter-genic regions in the same number and length (80 nucleotides) asthe positive examples.

To carry out the NN simulation, the following procedure [21,27]was used:

DG0 ¼ DG0iniþ DG0symXn�1

i¼1

DG0ijþ 1 (1)

According to Kanhere and Bansal [27], the terms DG0sym andDG0ini were not considered. The DNA duplex stability scores of allobserved di-, tri-, tetra- or dodeca- nucleotides in sliding over-lapping windows (one nucleotide step) were calculated and aver-aged to evaluate the mean scores of individual promoter and non-promoter sequences [22].

The data sets were smoothed with a moving average using thesoftware Low121 [28]. To find the optimal number of iterations in

the moving average, semi empirical tests were carried out usingfour smoothing values: 3, 5, 8 and 12 degree.

2.2. Neural network simulation

The NNs are powerful computational tools inspired by thestructure and behavior of biological neurons [5]. NNs are compu-tational tools with complex non-linear functions. Many bio-informatic tools based on NN have been developed, includingpromoter prediction [13e15,17], protein analysis [29] and otherapplications [30]. The NNs are particularly well suited for promoterprediction and recognition due to their ability to identify degen-erated, imprecise and incomplete patterns merged within thosesequences, and they can achieve high performance when process-ing extended genome sequences [5].

The NN architecture usedwas amultilayer perceptron (MLP) withthree layers: an input layer, a hidden layer and an output layer. Foreach s dependent-promoter sequence set, different architectureswith the cross combinations of the followingparameterswere tested:

1. Number of neurons in the input layer: 80, 79, 78, 76 and 70.These numbers were obtained according to the size of thewindow (2, 3, 4, 6 and 12) applied in formula 1.

2. Number of neurons in the hidden layer: 1 to 8.3. Number of neurons in the output layer: 1

The R environment was used to carry out the NN simulations[31]. The approach made use of a back-propagation algorithm [15].The k-fold cross-validation methodology was chosen because itprovides statistically valid results. At each iteration, one of the ksubsets was used as the test-set and the others formed a trainingset. The average error across all k trials was then computed [10]. Thek value for each data set was determined according to the numberof promoter sequences available. In this context, the k-values were10 for s70 promoters; 2 for s28 and s54 promoters; and 3 for s24, s32

and s38 promoters.

2.3. Results analysis

The promoter prediction problem is a type of binary classifica-tion, as the input sequence can be classified in only one class of twonon-overlapping classes. In this way, the four possible outcomes ofa classification model evaluate this correctness [5,32]:

- TP: promoter sequences classified as promoter (true positive);- TN: non-promoter sequences recognized as non-promoters(true negative);

- FP: non-promoter sequences classified as promoter (falsepositive);

- FN: promoters classified as non-promoter sequences (falsenegative).

The analysis of the results used the performance measures:Accuracy (A), Specificity (S) and Sensitivity (SN), which werecalculated according to the following formulae:

A ¼ TPþ TNTNþ TPþ FNþ FP

(2)

S ¼ TNTNþ FP

(3)

SN ¼ TPTPþ FN

(4)

Table 1The best architecture trained for each NN trained with the promoters according tothe s factor which recognize the sequence.

s factor Number ofneurons ininput layer

Number ofiterationson Low121

Number ofneurons inhidden layer

Number ofneurons inoutput layer

s 24 80 5 3 1s 28 80 3 4 1s 32 80 12 3 1s 38 80 12 5 1s 54 80 3 5 1s 70 80 5 4 1

Fig 1. The ROC curve for the best architecture for each NN trained with the promotersaccording to the s factor that recognizes the sequence.

S. de Avila e Silva et al. / Biologicals 42 (2014) 22e2824

Specificity is the proportion of negative test sequences that arecorrectly classified, sensitivity is the proportion of positive testsequences that are correctly classified, and accuracy is the pro-portion of correctly classified sequences for the entire data set [32].The specificity and sensitivity values were used to prepare an ROCcurve aiming at the determination of cut-off values for each data setclassification.

The cut-off values, which classify a sequence as a promoter ornon-promoter, were determined by the ROC curve analysis(Receiver Operator Characteristic curve) for each s-dependent dataset. The ROC curve is a plot of sensitivity against specificity fordifferent cut-off points. The ROC curve is widely used because itprovides a visual as well as a numerical summary of a predictor’sbehavior. Further information about the ROC curve can be found inRefs. [32,33].

2.4. Rule extraction

NNs have been used during the last few decades in a wide va-riety of applications [15]. When they are applied to supportdecision-making, such as in classification or clustering problems, itis desirable to understand how the NNs determine their own de-cision. Rule extraction attempts to address this problem by makingexplicit the relationships between the input and output data[18,19].

The rules were extracted based on the values of the hiddenneurons using the FAGNIS technique (Fuzzy Automatically Gener-ated Neural Inferred System). This methodology has been suc-cessfully used to extract rules for promoter prediction [14,23] andfor other biological problems using NNs [29].

Briefly, the FAGNIS technique is based on the idea that theactivation functions (fA) work on a small input range due to pecu-liarities in the data. FAGNIS substitutes the activation functions for aset of linear segments. The value of fA can be approximated by a setof linear segments, using a relationship as demonstrated in Eq. (5):

f A�aj�wSi

�Fi�aj�*�pi*aj þ qi

��(5)

where fA(aj) is the original non-linear function, aj is the activationsignal (weighted sum of the input vector), Fi(aj) is a function thatlinks each value of aj to a correspondent linear segment(s) andpi*aj þ qi are linear segments. To improve the precision of theapproximation, Fi(aj) must be a fuzzy number.

The results are conveniently presented by rule prototypes,which are defined as the average model of the input data set. Therule can be written as a linear equation: “If x z prototype theny ¼ constant of linear equation þ (coefficients of the linear equa-tion)”. Here, x is an input example, y corresponds to the NN outputand the coefficients of the linear equation are the nucleotides of thesequence. Further explanation and the details of FAGNIS can beobtained in Ref. [29].

3. Results and discussion

An initial result of the NN training process was the definition ofthe optimal architecture among all those tested for each s-dependent promoter data set (Table 1). A greater number of neu-rons in the hidden layer did not increase the accuracy, specificity orsensitivity values.

Aiming to evaluate the prediction results and to determinate acut-off value, an ROC curve was analyzed (Fig. 1) for the best NNsimulated for each s-dependent promoter sequence. The cut-offvalues established were 0.6 (s70 and s32), 0.5 (s28 and s54) and0.7 (s38 and s24). In Fig. 1, it is possible to observe the samemathematical behavior for all data sets used. The performance

measures (Table 2) obtained with the chosen cut-off values showedsatisfactory values when compared with the NN prediction litera-ture [13,17,23].

The best NN architecture for each s-dependent promoter ach-ieved the lower accuracy value of 56.52% for s24 and the highestvalue of 80% for s28 (Table 2). The specificity varied from 59.74%(s24) to 77.64% (s54), and the sensitivity varied from and 53.3% (s24)to 83% (s28). The similar specificity and sensitivity values obtainedfor each s-dependent promoter sequences (Table 2) are indicativeof the consistency of the NN learning process. In other words, thisresult shows the capability of the NN to recognize promoters andnon-promoter sequences.

A comparison of the performance values obtained with NNsimulations using DNA duplex stability and orthogonal input data(Table 2) showed that DNA duplex stability simulations were moreefficient to predict s28-, s38- and s54-dependent promoters. Thes70-dependent promoters exhibited almost the same accuracy,specificity and sensitivity values for both input data. Conversely, theperformance values for s24 and s32 were lower than those obtainedwith the orthogonal simulations.

The literature includes many efforts to solve the bacterial pro-moter prediction problem [7,8,10,13,16,17,21,22]. In this paper, wecarried out a comparison of the results achieved with previous

Table 2Performance measures results for the best NN architecture presented in Table 1. In bold the best result when compared with the orthogonal simulation presented by de Avila eSilva et al. [14].

s factor Accuracy (%) Sensibility (%) Specificity (%)

Stability codification Orthogonalcodificationc

Stability codification Orthogonalcodificationc

Stability Codification Orthogonalcodificationc

IRb RSa IRb RSa IRb RSa

s 24 58.7 58.2 71.6 59.7 60.8 69.1 53.3 57.7 73.9s 28 80.2 80.4 70.2 83.0 85.9 66.6 77.4 80.1 73.8s 32 64.7 68.3 72.4 52.5 68.8 71.4 75.5 76.9 73.4s 38 73.1 71.2 67.7 75.2 76.6 68.5 71.0 73.3 66.8s 54 78.8 76.4 73.5 80.0 88.5 73.2 77.6 80.5 73.8s 70 76.9 77.9 77.0 73.6 78.9 76.7 80.2 81.4 77.2

a RS random sequences as negative examples.b IR intergenic region as negative examples.c Tested with random sequences as negative examples.

S. de Avila e Silva et al. / Biologicals 42 (2014) 22e28 25

approaches using DNA stability information. In this context, the NNresults presented in this paper are comparable to those that havebeen reported in other publications in the context of s70-depen-dent promoter sequences.

Ranganann and Bansal [21] developed their own method forpromoter prediction using DNA duplex stability and obtainedmodest accuracy of 52.2% with a sensitivity of 99%. Promoter sta-bility values were also used by [17,23] as NN input data. Askary et al.[17] used an orthogonal codification for stability values for theirpredictions of s70-dependent promoters and achieved sensitivity

Fig. 2. The prototypes extracted for each NN trained with the promoters according to the s

non-promoter sequence prototypes. The nucleotide positions (presented in x-axis) are relatireport are represented by a continuous line (___) as follows: (A) s38, (B) s32, (C) s24, (D) dif

and specificity of 94%. Using the nearest neighbors stability valuesfor all the promoters from the RegulonDB (without s factor sepa-ration), de Avila e Silva et al. [23] obtained accuracy, specificity andsensitivity of approximately 70%.

An alternative analysis of the promoter was carried out by Song[34]. The author used a Z-curve method to extract features fromprokaryotic promoters and use them as input data for classificationby using a partial least squares technique. According to the inputdata set (composed of different types of sequences) the accuracy forE. coliwas 96.05% (s70 promoters, coding negative samples), 90.44%

factor that recognizes the sequence. In all figures, the dashed line (- - -) represents theve to the TSS (Transcription Start Site). The prototypes for the s factors analyzed in thisferent prototypes obtained from s70, (E) s54, (F) s28

Fig. 4. The PCA analysis for the promoter sequences belonging for the prototypes. Thethree components that explain 79.3% of the variance of the model are presented.

S. de Avila e Silva et al. / Biologicals 42 (2014) 22e2826

(s70 promoters, non-coding negative samples), 92.13% (knownsigma-factor promoters, coding negative samples) and 92.50%(known sigma-factor promoters, non-coding negative samples).The accuracies obtained by Song [34] for s70 promoterswere higherthan those obtained in the present paper. However, the resultscannot be directly compared, as the author did not specify the ac-curacy and the rules.

As the performance parameters obtained were indicative ofgood NN prediction and recognition, it was desirable to determinewhich promoter features played an important role in the NNlearning process (Fig. 2). Except for s70, two rules were extractedfor each s-dependent promoter: one from promoter sequences andanother from the negative examples. The three rules obtained fromthe s70-dependent promoter sequences were the consequence ofthe large number of examples and the high level of heterogeneity ofthese promoters [10,34,35]. To illustrate this case, according toHuerta and Collado-Vides [34], the �35 region can be dispensableeither in the presence of a �10 extended (TG 1 bp immediatelyupstream of �10 region) or due to its substitution by upstreamactivator sites (centered at �40 bp).

Considering the DNA duplex stability profiles presented by theprototypes (Fig. 2), it is possible to perceive some similarities anddifferences despite the s factor responsible for the sequencerecognition. In general, a clear difference was detected between thepromoter and non-promoter prototypes. Most of the prototypesexhibited lower stability values in the conserved motifs (�25and �10 region), as previously reported by Kanhere and Bansal[20]. The low stability around the �10 region is expected because itis the DNAmelting regionwhere the transcription bubble is formed[1,3]. Besides that, these results show that DNA duplex stability isan efficient feature to distinguish s28- and s54-dependent pro-moters from each other and from those recognized by others sfactors.

The s28- and s54-dependent promoters are associated with theregulation of bacterial envelope structures, pathogenic mecha-nisms, transport and biosynthesis of the metabolic intermediates,

Fig. 3. The WebLogo frequency plot of the promoter data set used: (A) s38 dependent-prompromoters; (E) s54 dependent-promoters; and (F) s28 dependent-promoters. Here, the nucleto TSS is indicated for the highlight promoter regions.

flagella, lipids and other biological molecules [35,36]. In addition tothe role in biological mechanisms, these promoters are widelydistributed among bacteria. Despite their relevance, the alternativefactors are less studied as the major housekeeping s70-dependentpromoters [37]. For this reason, their recognition became a relevantand still open issue [35e37].

As can be observed in Fig. 3, the �12 to �16 region that char-acterize s28-dependent promoters correspond to a consensualsequence GCCG of high stability presided by a ATAA (�8 to �11)region of low stability [38] . According to Koo et al. [39], s28- ands32-dependent promoter sequences present goodmatches with thecanonical motifs �35, extended �10, and �10 in order to start thetranscription. For promoter sequences recognized by s54, the mostdiscriminatory region (�23 to �27) correspond to the consensualsequence CGGT of high stability described by Barrios et al. [35].Particularly, this is the most important region in the specificinteraction between the s54- RNAP holoenzyme and the promoter.The s54- factor is structurally and functionally distinct from allother s-factors. Gene expression leaded by it requires activator

oters; (B) s32 dependent-promoters; (C) s24 dependent-promoters; (D) s70 dependent-otide 61 is the þ1 nucleotide (assigned with an arrow). The nucleotide position relative

Table 3The average stability values (kcal/mol) for the most important regions according to the PCA analysis.

s Factor Component 1 Component 2 Component 3

Stability of �12to �16

Stability of �46to �51

Stability of �41to �44

Stability of �31to �35

Stability of �8to �11

Stability of þ9to þ12

Stability of �23to �27

s 24 �1.36 �1.32 �1.36 �1.37 �1.29 �1.35 �1.34s 28 �1.51 �1.41 �1.25 �1.21 �1.11 �1.36 �1.34s 32 �1.41 �1.32 �1.41 �1.24 �1.18 �1.31 �1.34s 38 �1.26 �1.27 �1.31 �1.31 �1.18 �1.37 �1.34s 54 �1.26 �1.29 �1.33 �1.19 �1.22 �1.33 �1.55s 70 rule 1 �1.27 �1.29 �1.34 �1.32 �1.23 �1.47 �1.3s 70 rule 2 �1.13 �1.23 �1.24 �1.2 �1.04 �1.31 �1.34s 70 rule 3 �1.15 �1.37 �1.26 �1.29 �1.07 �1.12 �1.4

S. de Avila e Silva et al. / Biologicals 42 (2014) 22e28 27

ATPases, which bind to promoter-distal enhancer DNA sequencesfor starting the process [36,37].

For the others s-dependent promoters, the DNA duplex stabilityshows some highlighted features but without the capability todiscriminate the sequences from each others. The s38-dependentpromoters did not exhibit important variations on DNA stabilityalong the sequence (Fig. 2). However, the less stable region (�3to �8) correspond to the consensual sequence of high A/T content(Fig. 3). The lower stability region of s70- promoters (A/T reachregion) include the well-known Pribnow box motif, which islocated on nucleotides �8 to �12 (Fig. 3). The s24- and s32-dependent promoters, that showed low prediction accuracy,exhibited poorly-defined stability features (Fig. 2) that are associ-ated with sparse consensual regions all along the sequence (Fig. 3).In general, the data showed a correspondence between well-defined conserved A/T or G/C regions formed by 4 or more nucle-otides. This relation is expected as the regional DNA stability de-pends not only on the number of hydrogen-bonds betweencomplementary nucleotides but also that of their neighbors [24,25].

Previous reports, such as that of van Hijum et al. [40], haveindicated that some E. coli promoters present a UP-element (AT-rich sequence), which is an additional cis-element located imme-diately upstream from the �35 region (�40 to �60 bp). The UP-element facilitates RNAP binding and stimulates the intrinsictranscription by up to two orders of magnitude [4]. The lower orhigher stability in addition to the well-known promoter motifs(Figs. 2 and 3) may be related to these elements. The stabilitypromoter profiles can be helpful for explaining how the promotersremain functional despite significant differences in the consensussequences [21,41].

To increase the knowledge about the DNA duplex stabilitypromoter profile and provide a reliable analysis, a principal com-ponents multivariate analysis was carried out (Fig. 4 and Table 3).The aim of this technique is to define, in a statistical way, the mostdiscriminatory differences between each s-dependent promotersequence. A combined analysis of Table 3 and Fig. 3 was conductedto explain the stability features of the prototypes. The first PCAcomponent (which explained 38.6% of the variance) allowed toseparate s28 from the other sequences and grouped the sequencesbelonging to rule 2 and 3 for s70. This separation was associatedwith high and low stability values in the �12 to �16 region of s28

and s70 (rules 2 and 3), respectively. The unusual stability proper-ties of the region that precedes the �10 consensual motif may beimplicated in promoter recognition by s28 and s70 factors.

The second component (22.5% of the variance) separated the s28

and s70 (rule 3) prototypes from the others. These sequences werecharacterized by high stability in the �41 to �44 region, interme-diate values for the �31 to �35 and þ9 to þ12 regions and lowstability in the �8 to �11 region, the consensus motif [4]. The thirdPCA component (18.2% of the variance) separated the s54 prototype

from the others and was associated with the high stability (�1.55)of s54 promoters in the �23 to �27 region (Table 3). This finding isin accordance with the biological background because this highstability region matches one of the consensus motif (definedas �24) of s54 promoters [27].

4. Conclusions

In thispaper, aNNtechnique for thepredictionand recognitionofE. coli promoters based on the stability of the sequence wasemployed. In contrast to the majority of the related literature, thispaper not only investigated the identification of s70-dependentpromoters, but it also included promoter classification for the others factors. The accuracies of the prediction for s24, s28, s32, s38, s54

and s70were 56.5%, 80%, 61.9%, 70.8%, 78.8% and 76.5%, respectively.Among these results, s28- and s54- dependent promoter sequencesshowed the greatest accuracies and exhibited important discrimi-nation features. These results are comparablewith those reported inrelated papers, and they represent additional information that canbe used as a promoter classificationparameter for the establishmentof the DNA duplex stability values. As the regions within promotershows multiple overlapping of sequences [41], the stability infor-mation can reduce the number of false positives presented by insilico promoter prediction. The performance values obtained in thepresent paper indicate that the information generated by the rulesextracted from NN learning can be used to increase accuracy, spec-ificity and sensitivity of promoter prediction programs based onsequence data, such as the BacPP tool [14]. In future works, the au-thors intend to implement the rules presented in this paper usingthe BacPP tool and make the results available on the web.

References

[1] Ramprakash J, Schwarz FP. Energetic contributions to the initiation in E. coli.Biophys Chem 2008;138:91e8.

[2] Hook-Barnard I, Johnson XB, Hinton DM. Escherichia coli RNA polymeraseenzyme of s70- dependent promoter requiring a �35 DNA element and anextended �10 TGn motif. J Bacteriol 2006;188:8352e9.

[3] Huffmann JL, Brennan RG. Prokaryotic transcription regulators: more than justthe helix-turn-helix motif. Curr Opin Struct Biol 2002;12:98e106.

[4] Borukhov S, Nudler E. RNA polymerase: the vehicle of transcription. TrendsMicrobiol 2008;16(3):126e34.

[5] de Avila e Silva S, Echeverrigaray S. Bacterial promoter features descriptionand their application on E. coli in silico prediction and recognition approaches.In: Pérez-Sánchez H, editor. Bioinformatics. Rijeka: InTech; 2012. p. 241e60.

[6] Potvin E, Sanschagrin F, Levesque RC. Sigma factors in Pseudomonas aerugi-nosa. Feder Eur Microbiol Soc 2008;32:38e55.

[7] Janga SC, Collado-Vides J. Structure and evolution of gene regulatory networksin microbial genomes. Res Microbiol 2007;158:787e94.

[8] Jacques P-E, Rodrigue S, Gaudreau L, Goulet J, Brzezinski R. Detection ofprokaryotic promoters from the genomic ditribution of hexanucleotides pairs.BMC Bioinform 2006;7:423.

[9] Towsey M, Timms P, Hogan J, Mathews SA. The cross-species prediction ofbacterial promoters using a support vector machine. Comput Biol Chem2008;32:359e66.

S. de Avila e Silva et al. / Biologicals 42 (2014) 22e2828

[10] Polate K, Günes S. A novel approach to estimation of E. coli promoter genesequences: combining feature selection and least square support vector ma-chine (FS_LSSVN). Appl Math Comput 2007;190:1574e82.

[11] Li Q-Z, Lin H. The recognition and prediction of s70 promoters in Escherichiacoli K-12. J Theor Biol 2006;242:135e41.

[12] Gordon L, Chervonenkis A, Gammerman AJ, Shahmuradov IA, Solovyev VV.Sequence alignment for recognition of promoter regions. Bioinformatics2003;19(15):1964e71.

[13] Burden S, Lin Y-X, Zhang R. Improving promoter prediction for the NNPP2.2algorithm: a case study using Escherichia coli DNA sequences. Bioinformatics2005;21(5):601e7.

[14] de Avila e Silva S, Echeverrigaray S, Gerhardt GJL. BacPP: bacterial promoterpredictionda tool for accurate sigma-factor specific assignment in enter-obacteria. J Theor Biol 2011;287:92e9.

[15] Kalate RN, Tambe SS, Kulkarni BD. Artificial neural networks for prediction ofmycobaterial promoter sequences. Comput Biol Chem 2003;27:555e64.

[16] Rani TS, Bhavani SD, Bapi RS. Analysis of E. coli promoter recognition problemin dinucleotide feature space. Bioinformatics 2007;23(5):582e8.

[17] Askary A, Masoudi-Nejad A, Sharafi R, Mizbani A, Parizi SN, Purmasjedi M. N4:a precise and highly sensitive promoter predictor using neural network fed bynearest neighbors. Genes Genet Syst 2009;84(6):425e30.

[18] Odajima K, Hayashi Y, Tianxia G, Setiono R. Greedy rule generation fromdiscrete data and its use in neural network rule extraction. Neural Networks2008;21:1020e8.

[19] Andrews R, Diederich J, Tickle AB. A survey and critique of techniques forextracting rules from trained artificial neural networks. Knowledge-BasedSyst 1995;8(6):373e89.

[20] Kanhere A, Bansal M. Structural properties of promoters: similarities anddifferences between prokaryotes and eukaryotes. Nucleic Acids Res2005;33(10):3165e75.

[21] Rangannan V, Bansal M. Identification and annotation of promoter regions inmicrobial genome sequences on the basis of DNA stability. J Biosci 2007;32(5):851e62.

[22] Akan P, Deloukas P. DNA sequence and structural properties as predictors ofhuman and mouse promoters. Gene 2008;410(1):165e76.

[23] de Avila e Silva S, Gerhardt GJL, Echeverrigaray S. Rules extraction from neuralnetworks applied to the prediction and recognition of prokaryotic promoters.Genet Mol Biol 2011;34(2):353e60.

[24] Breslauer KJ, Frank R, Blocker H, Marky LA. Predicting DNA duplex stabilityfrom the base sequence. Proc Natl Acad Sci USA 1986;83:3746e50.

[25] SantaLucia J, Hicks D. The thermodynamics of DNA structural motifs. AnnuRev Biophys Biomol Struct 2004;33:415e40.

[26] Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Peñaloza-Spinola MI, Contreras-Moreira B, et al. RegulonDB (version 6.0): gene regu-lation model of Escherichia coli K-12 beyond transcription, active (experi-mental) annotated promoters and Text press navigation. Nucleic AcidsResearch 2008;36:D120e4.

[27] Kanhere A, Bansal M. A novel method for prokaryotic promoter predictionbased on DNA stability. BMC Bioinform 2005;6:1471e2105.

[28] Hegger R, Kantz H, Schreiber T. Practical implementation of nonlinear timeseries methods: the TISEAN package. CHAOS 1999;9:413e35.

[29] Battistella E, Cechin AL. The protein folding problem solved by a fuzzyinference system extracted from an artificial neural network. Lecture NotesComput Sci 2004;3315:474e83.

[30] Tan Ah-H, Pan H. Predictive neural networks for gene expression data anal-ysis. Neural Networks 2005;18:297e306.

[31] R Development Core Team. R: a language and environment for statisticalcomputing. Vienna, Austria: R Foundation for Statistical Computing; 2008.URL: http://www.R-project.org.

[32] Sokolova M, Lapalme G. A systematic analysis of performance measures forclassification tasks. Inform Process Manage 2009;427:427e37.

[33] Sonego P, Kocsor A, Pongor S. ROC analysis: applications to the classification ofbiological sequences and 3D structures. Brief Bioinform 2008;9(3):198e209.

[34] Song K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Nucleic Acids Res 2012;40(3):963e71.

[35] Barrios H, Valderrama B, Morett E. Compilation and analysis of s54-dependentpromoter sequences. Nucleic Acids Res 1999;27(22):4305e13.

[36] Shingler V. Signal sensory systems that impact s54-dependent transcription.FEMS Microbiol Rev 2011;35(3):425e40.

[37] L. J. Friedman, J.Gelles, Mechanism of transcription initiation at an activator-dependent promoter defined by single-molecule observation, Cell 148(4)679e689.

[38] Song W, Maiste PJ, Naiman DQ, Ward MJ. Sigma 28 promoter prediction inmembers of the gammaproteobacteria. FEMS 2007;271:222e9.

[39] Koo BM, Rhodius VA, Nonaka G, deHaseth PL, Gross CA. Reduced capacity ofalternative sigmas to melt promoters ensures stringent promoter recognition.Genes Dev 2009;23:2426e36.

[40] van Hijum SAF, Medema MH, Kuipers OP. Mechanisms and evolution ofcontrol logic in prokaryotic transcriptional. Microbiol Mol Biol Rev2009;73(3):481e509.

[41] Huerta AM, Collado-Vides J. Sigma70 promoters in Escherichia coli: specifictranscription in dense regions of overlapping promoter-like signals. J Mol Biol2003;333:261e78.