[IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA...

5
Exploratory Tools for Follow-Up Studies to Microarray Experiments Kaushik Sinha * Ruoming Jin Gagan Agrawal * Helen Piontkivska δ * Department of Computer Science and Engineering Ohio State University, Columbus OH 43210 {sinhak,agrawal}@cse.ohio-state.edu Department of Computer Science Kent State University, Kent OH 44242 {jin}@cs.kent.edu δ Department of Biological Sciences Kent State University, Kent OH 44242 {opiontki}@kent.edu ABSTRACT In this paper we present two different exploratory tools for data anal- ysis following a gene microarray experiment. The first tool is based on a novel data mining formulation, called hypergraph mining. In the second tool, we provide a family of expectation based similarity mea- sures between sets of genes or between a set and a single gene. We have evaluated these two tools using output from two different mi- croarray studies. We showed how many interesting observations could be made using our tools, and how the results from the two tools were similar in many ways. 1. INTRODUCTION In recent years, DNA microarrays and other high-throughput gene and protein assays have become the essential tools for biologists. Biol- ogists can now quickly identify tens, hundreds, and even thousands of candidate genes associated with a target disease or functionality. The next step in biological inquiry typically is to find out What is known about these genes?, and How are these genes related to each other, and to other genes identified in similar studies?. Traditionally, biologists performed a thorough literature review and sequence/structure analysis to answer these questions. Unfortunately, the sheer volume and rapid growth of biological literature and other available data sources has made this practice extremely time-consuming and tedious. In the past several years, many research efforts have focused on liter- ature/text mining to automatically associate existing knowledge in the literature with the genes of interest [29, 28, 13, 11]. Existing research can be largely divided into the following categorizes: 1) Applying Na- ture Language Processing (NLP) or rule-based approach to extract bi- ological information [16, 7, 12, 14, 10], 2) Utilizing co-occurrence relationship to recover relationships (protein-protein interaction) be- tween genes/proteins [33, 17], 3) Incorporating ontologies into mining process [8, 30], and 4) biological information retrieval and text cate- gorization [23, 15, 21, 4]. However, the major difficulties that still remain are 1) How do we extract key properties shared by a set of candidate genes?, 2) How do we generate reasonable scientific hypothesis to explain them ? and 3) How do we define and evaluate similarity between sets of genes ? In this paper we present two different exploratory tools for data anal- ysis following a gene microarray experiment. The main underlying assumption behind our techniques is that if some genes are related or responsible for some disease, then that information probably have been reported in the existing biological literature. Thus, we use data min- ing techniques to extract useful information from the occurrence/co- occurrence of gene names or gene ontology terms in one or more arti- cles in the scientific literature. The first tool is based on a novel data mining formulation, called hypergraph mining and serves the following purpose. Suppose a biol- ogist has a list of genes and she has a hypothesis that a subset of these genes are likely responsible for a disease. Now, if her intuition is right, then a set of genes might appear together in the literature. However, some complications could arise. For example, suppose gene G1 and G2 are actually related or responsible for a disease, and there is also another gene, G3, which is weakly related or responsible for the same disease. So, it is unreasonable to assume that G1 and G2 will always occur together in the same scientific document. Instead, a more reason- able expectation might be that there will be some publication where G1 and G3 will appear together, some other publication where G2 and G3 will appear together, and yet another article where G1 and G2 appear together. The gene G3 here can be considered as a linking gene that helps to conclude that G1 and G2 are actually related. The purpose of hypergraph mining is to capture such contributions from linking genes, i.e. G3 in the above example, to make conclusions about close rela- tion between G1 and G2. Further, we should use not only such linking genes, but also linking gene ontology terms, to make such conclusions. In the second tool, we provide a family of expectation based similar- ity measures that can be used as follows. Suppose, after a gene micro array experiment, a biologist finds a set of genes A that are closely related to a particular disease. Now, she also has multiple large sets of potential genes that might be related or responsible to the same disease. So, a natural question is, “which of these gene sets is most similar to the given set A?”. Our family of expectation based similarity mea- sures can be used to answer such questions. Further, once a set B is found to be most similar, a biologist might be interested in ranking the genes from the set B. The next two sections provide details of these two approaches. We have evaluated these two tools using output from two different mi- croarray studies. We show how many interesting observations could be made using our tools, and how the results from the two tools were similar in many ways. 2. HYPERGRAPH MINING Let D1,...,DN represent N biological literature abstracts, we refer to them as documents. In order to have easy representation of such documents, we use the standard bag of word format, where each doc- ument can be represented as a set of gene names and gene ontology terms appearing in that document, which we refer to as keywords. We also need a dictionary that contains all potential linking keywords of Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06) 0-7695-2727-2/06 $20.00 © 2006

Transcript of [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA...

Page 1: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

Exploratory Tools for Follow-Up Studies to MicroarrayExperiments

Kaushik Sinha∗ Ruoming Jin† Gagan Agrawal∗ Helen Piontkivskaδ

∗Department of Computer Science and EngineeringOhio State University, Columbus OH 43210

{sinhak,agrawal}@cse.ohio-state.edu† Department of Computer Science

Kent State University, Kent OH 44242{jin}@cs.kent.edu

δDepartment of Biological SciencesKent State University, Kent OH 44242

{opiontki}@kent.edu

ABSTRACTIn this paper we present two different exploratory tools for data anal-ysis following a gene microarray experiment. The first tool is basedon a novel data mining formulation, called hypergraph mining. In thesecond tool, we provide a family of expectation based similarity mea-sures between sets of genes or between a set and a single gene. Wehave evaluated these two tools using output from two different mi-croarray studies. We showed how many interesting observations couldbe made using our tools, and how the results from the two tools weresimilar in many ways.

1. INTRODUCTIONIn recent years, DNA microarrays and other high-throughput gene

and protein assays have become the essential tools for biologists. Biol-ogists can now quickly identify tens, hundreds, and even thousands ofcandidate genes associated with a target disease or functionality. Thenext step in biological inquiry typically is to find out What is knownabout these genes?, and How are these genes related to each other, andto other genes identified in similar studies?.

Traditionally, biologists performed a thorough literature review andsequence/structure analysis to answer these questions. Unfortunately,the sheer volume and rapid growth of biological literature and otheravailable data sources has made this practice extremely time-consumingand tedious.

In the past several years, many research efforts have focused on liter-ature/text mining to automatically associate existing knowledge in theliterature with the genes of interest [29, 28, 13, 11]. Existing researchcan be largely divided into the following categorizes: 1) Applying Na-ture Language Processing (NLP) or rule-based approach to extract bi-ological information [16, 7, 12, 14, 10], 2) Utilizing co-occurrencerelationship to recover relationships (protein-protein interaction) be-tween genes/proteins [33, 17], 3) Incorporating ontologies into miningprocess [8, 30], and 4) biological information retrieval and text cate-gorization [23, 15, 21, 4].

However, the major difficulties that still remain are 1) How do weextract key properties shared by a set of candidate genes?, 2) How dowe generate reasonable scientific hypothesis to explain them ? and 3)How do we define and evaluate similarity between sets of genes ?

In this paper we present two different exploratory tools for data anal-ysis following a gene microarray experiment. The main underlyingassumption behind our techniques is that if some genes are related orresponsible for some disease, then that information probably have beenreported in the existing biological literature. Thus, we use data min-ing techniques to extract useful information from the occurrence/co-

occurrence of gene names or gene ontology terms in one or more arti-cles in the scientific literature.

The first tool is based on a novel data mining formulation, calledhypergraph mining and serves the following purpose. Suppose a biol-ogist has a list of genes and she has a hypothesis that a subset of thesegenes are likely responsible for a disease. Now, if her intuition is right,then a set of genes might appear together in the literature. However,some complications could arise. For example, suppose gene G1 andG2 are actually related or responsible for a disease, and there is alsoanother gene, G3, which is weakly related or responsible for the samedisease. So, it is unreasonable to assume that G1 and G2 will alwaysoccur together in the same scientific document. Instead, a more reason-able expectation might be that there will be some publication where G1

and G3 will appear together, some other publication where G2 and G3

will appear together, and yet another article where G1 and G2 appeartogether. The gene G3 here can be considered as a linking gene thathelps to conclude that G1 and G2 are actually related. The purpose ofhypergraph mining is to capture such contributions from linking genes,i.e. G3 in the above example, to make conclusions about close rela-tion between G1 and G2. Further, we should use not only such linkinggenes, but also linking gene ontology terms, to make such conclusions.

In the second tool, we provide a family of expectation based similar-ity measures that can be used as follows. Suppose, after a gene microarray experiment, a biologist finds a set of genes A that are closelyrelated to a particular disease. Now, she also has multiple large sets ofpotential genes that might be related or responsible to the same disease.So, a natural question is, “which of these gene sets is most similar tothe given set A?”. Our family of expectation based similarity mea-sures can be used to answer such questions. Further, once a set B isfound to be most similar, a biologist might be interested in ranking thegenes from the set B.

The next two sections provide details of these two approaches. Wehave evaluated these two tools using output from two different mi-croarray studies. We show how many interesting observations couldbe made using our tools, and how the results from the two tools weresimilar in many ways.

2. HYPERGRAPH MININGLet D1,...,DN represent N biological literature abstracts, we refer

to them as documents. In order to have easy representation of suchdocuments, we use the standard bag of word format, where each doc-ument can be represented as a set of gene names and gene ontologyterms appearing in that document, which we refer to as keywords. Wealso need a dictionary that contains all potential linking keywords of

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 2: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

interest. This dictionary depends on the context of the particular ex-periment.

Let k1,...,kT represent a dictionary KT with T keywords. Eachdocument Di can be represented as a set of keywords {kj}, wherekj ∈ KT . A biologist provides a subset of M keywords KM , whichshe hypothesizes might be related or responsible for a particular dis-ease, such that KM ⊆ KT . The goal of our hypergraph mining al-gorithm is to find frequently occurring hyperedges, each of which in-volves a set of keywords from the set KM as the nodes, and potentially,another set of keywords from the set KT − KM as the annotation orlinking keywords. Again, the linking keywords are either linking genenames or linking gene ontology terms as mentioned before.

As in frequent item-set mining, the hyperedges of interest are theones which have a support above a user-specified threshold. The maindifficulty in computing the hyperedges arises because of the notion oflinking keywords, and how they impact the calculation of this support.

2.1 Weighing the HyperedgesThe main challenge here is to build a model that incorporates con-

tribution from linking keywords. If keywords, k1, . . . , kn form a hy-peredge then, either one of the following can be true:

1. All ki,i = 1, . . . , n keywords appear together in one or moredocuments, depending upon the support supplied by the user.

2. n keywords can be one-level apart. By this, we mean, we canform a partition, i.e., two disjoint subsets out of the n keywords,such that each subset appears in different documents. Further,there exists a non-empty set of linking keywords that appear inboth the documents. Thus, these linking keywords form, in asense, a connection between the two subsets.

For any hyperedge H consisting of n keywords, each of the aboveevents can contribute some weight. The weight contributed due to thefirst event, when the n keywords appear together in a document, isthe usual support used in frequent itemset mining algorithms. To takeinto account the second possibility, i.e., when n keywords appear one-level apart and are connected by linking keywords, we introduce a newweight function called cross support. Thus, the overall weight W of ahyperedge H can be expressed as

W (H) = S(H) + CS(H)

where, S(H) is the usual support and CS(H) is the cross support.Evaluating the cross support part is not very straight forward, and

we need a systematic procedure. Let us consider any partition i ofH , which has li linking keywords and occurs ni times in the availableset of documents. Then, we can consider this partition as contributinga weight li × ni. The idea here is that larger the number of linkingkeywords from the dictionary, the higher is the significance of such alink. Using this idea, if we have Nmax different possible partitions,each with different number of linking keywords, then for any p > 0we model cross support as follows. We define a function F ,

F (H) =

NmaxX

i=1

(li × ni)p (1)

where the parameter p controls the effect of linking keywords in com-puting the weight. Now, we define CS as,

CS(H) = eF (H) − 1 (2)

Introducing a new parameter p, and making CS(H) exponential, pro-vides us more flexibility to calculate the contribution from cross sup-port. Further, 1 is subtracted to ensure that CS(H) = 0 when F (H) =0.

Now, if the cross support can have value at most U , then we can clipthe value of CS(H) to U whenever it would otherwise exceed U , i.e.,

CS(H) =

U if eF (H) − 1 ≥ U

eF (H) − 1 otherwise(3)

With CS defined in the above manner, the overall weight functiondoes not necessarily satisfy the down-closure property. However, sincethe overall weight function is formed by adding two functions, one ofwhich respects the down-closure property and the other function isbounded, effective pruning can still be performed. .

3. SIMILARITY MEASURES AMONG DIFFER-ENT SETS OF GENES

Suppose we are given three sets of gene names,- a small set A andtwo much larger sets B and C, and we are interested in finding outwhich of the two larger sets is more similar to the small set A. A stan-dard approach of using intersection set size, i.e., if |A∩B| > |A∩C|then B is more similar to A, could be misleading here because, a) theintersection set size would likely be quite small and empty in manycases and b) a given gene name can have multiple other aliases ordifferent gene symbols. Therefore, we need other ways of comput-ing similarities that will involve correlation among these gene names.Note that we are using the term correlation loosely here, and do notmean the standard statistical correlation.

So, we need other auxiliary information to compute such similarity.One way of obtaining this auxiliary information is to use the freelyavailable on-line biological article abstracts. We can represent eachsuch abstract using the bag of word model, as we had done with ourhypergraph mining approach. Thus, each abstract now acts as a trans-action containing gene names as items.

3.1 Extending the Notion of SimilarityOnce we have the biological article abstracts represented as trans-

actions, we can extend the notion of similarity to be based on unionamong sets, instead of intersection among sets. Suppose, in the fre-quent gene pairs among these abstracts, we see genes from A and B

appearing more frequently together than genes from A and C. Thisimplies that in some sense, the set A is more similar to the set B thanto set C.

We assume we have a user-provided support threshold θ. Next, wedefine parameters that will be used to define the similarity measure.

Ne(A, B, θ)= number of gene pairs exceeding θ s.t. for each suchgene pairs IS, ∃a, b ∈ IS s.t. a ∈ A, b ∈ B.

Nb(A, B, θ)= number of gene pairs exceeding θ s.t. for each suchgene pairs IS, ∀a, b ∈ IS a, b ∈ A ∩ B.

Nt(A,B, θ)= total number of gene pairs exceeding θ. where bygene pair we mean a set of genes containing two or more gene names.

Using the parameters defined as above, the naive similarity measures1 can be defined as,

s1(A,B, θ) =β ∗ Nb + (1 − β) ∗ Ne

Nt

(4)

Here β ∈ [0, 1] is a parameter that can be used to decide how muchweightage we want to give for each component of s1. It should benoted that this measure is normalized, i.e., s1 ∈ [0, 1] and higher thesimilarity measure, more similar are two sets.

3.2 Expectation based Similarity MeasureA more sophisticated measure of similarity can be obtained by using

the idea of statistical expectation. We assume that X, Y are two setscontaining gene names, where X has n genes, x1, . . . , xn and Y hasm genes, y1, . . . , ym. Now, we can consider X and Y as randomvariables where X can take values from x1, ..., xn and Y can takevalues from y1, ..., ym.

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 3: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

Since X and Y are discrete random variables, we can think of theirjoint probability mass function as, pX,Y (x, y) = P (X = x, Y = y)where, x ∈ {x1, ..., xn} and y ∈ {y1, ..., ym}. Next, let (X, Y )be a random vector and let Z = g(X, Y ), for a non-negative realvalued function g such that g : X × Y → [0,∞). Then, Z is arandom variable and we can determine its expectation as, E(Z) =P

x

P

yg(x, y)pX,Y (x, y)

Now, let us see how we can use the above idea in our context. Weare given two sets X and Y , containing gene names and we representthe abstracts as transactions. Now, we consider only those transactionswhich have gene names from either of the two given sets and has atleast two gene names. We call these transactions as valid transactions.The transactions where all the gene names come from a single setdo not give much information and therefore, we simply discard thesetransactions. Next, for each (xi, yj), i ∈ {1, ..., n}, j ∈ {1, ..., m},we count how many times (xi, yj) appears together in the set of validtransactions. We store this information in a matrix M(i, j), where M

is a n × m matrix. We call this matrix the probability matrix.Now, we assign pX,Y (xi, yj) = M(i,j)

P

i

P

j M(i,j). Clearly, pX,Y (., .)

is a joint probability mass function of variables X and Y since,-P

i

P

jpX,Y (xi, yj) = 1. So, the expectation now becomes,-

E(Z) = E(g(X,Y )) =P

i

P

j

g(xi,yj)M(i,j)

Mt , where,M t =

P

i

P

jM(i, j)

We can now directly use the above expectation as a measure of sim-ilarity. Clearly for any three sets X, Y, Z, if X is more similar to Y ,then E(g(X,Y )) > E(g(X,Z)).

At this point, we have purposefully kept the function g unspecified.We focus next on the choice of this function.

3.2.1 Family of Expectation Based similarity MeasuresThe idea of expectation based similarity measure we describe above

is very general, in the sense that it actually provides us with a familyof measures. This is because each choice of the function g results ina new measure. Here, we will describe two simple choices for thefunction g.

As a first choice of the g, for any pair (xi, yj), we can simply usethe number of times the pair (xi, yj) appear together over all validtransactions, i.e., g(xi, yj) = M(i, j). With this choice of g, weobtain the first expectation based similarity measure se1 where,

se1(X, Y ) =X

i

X

j

M(i, j)2

M t(5)

The above similarity measure does not take into account the trans-action length. To elaborate this point, suppose a specific pair (x1, y2)appears in two transactions t1, t2, where t1, t2 are:

t1 : x1, y2

t2 : x1, x2, y2, y5

Intuitively, we would like to give more weightage to the occur-rences of t2 like transactions since they contain more genes from twofiles. To take this into account, for any pair (xi, yj), we can defineg(xi, yj) = tot len(xi, yj) where, tot len(xi, yj) counts the sumof transaction lengths of the set of valid transactions, VT , such that(xi, yj) ⊆ VT . With this choice of g, we define the second expecta-tion based similarity measure se2 as,

se2(X, Y ) =X

i

X

j

tot len(xi, yj) × M(i, j)

M t(6)

3.3 Ranking Genes from the Most Similar SetOnce we have determined that a set Y is most similar to X , among

other possible sets, a biologist might be interested in obtaining morespecific information. In particular, if the file Y is quite large, then, a

Hyperedge Linking LinkingTerms Genesfrom GO

PATHOGENESISIMMUNE-RESPONSESECRETIONTRANSCRIPTION

{CCL5,CXCL13} INFLAMMATORY-RESPONSE CCL4CELL-ADHESION CD14CHEMOTAXIS PSMB6KINASE-ACTIVITY FYBINDING LARGE

Table 1: Analysis of a Hyperedge and Linking Terms from 21dataset

Hyperedge Linking LinkingTerms Genesfrom GO

BINDINGTRANSCRIPTIONAPOPTOSIS NKX3-1PHOSPHORYLATION ABCC4

{KLK2,KLK3} METABOLISM PSMB6{KLK3,SPDEF} SECRETION JUN

Table 2: Analysis of Hyperedges and Linking Terms from 31dataset

natural question is, “What are the top 20 or 30 genes from Y that aremost similar to X?” To answer this question, we need a procedure torank the genes from Y .

Interestingly, the expectation based similarity measure frameworkthat we developed above can be used to answer this question. How-ever, the concept of random variable needs to be used in a slightly dif-ferent manner. In the previous section, we considered Y as a randomvariable, but now for each gene name yl ∈ Y , we consider a randomvariable Ul, where Ul is a set containing a single gene yl.

Thus, we compute se(X, Ul) for all Ul, where Ul corresponds toyl ∈ Y . Next, we can sort the genes yl ∈ Y in decreasing orderof se(X, Ul). Finally, we can simply return the first N genes fromthe sorted list. Here se indicates that the similarity measure could beeither se1 or se2.

In terms of implementation of this approach, we do not need to buildseparate transaction list for each pair of (X, Ul). Instead, we can usethe same transaction list and evaluate s(X, Ul) in an iterative manner.

4. RESULTSWe first describe the datasets we have used and data preprocessing

steps we took. Then, we report the results obtained from the two dataanalysis tools.

4.1 Datasets and Data PreprocessingWe used two lists of 21 and 31 genes, respectively, that are dif-

ferentially expressed between prostate epithelial and stromal cells inprostate cancer patients (including genes that were significantly up-regulated in stromal cells). These lists contain Affymetrix probe ID,which could be searched via Affymetrix website1and GenBank ac-cession number. Since there is no common standard for represent-ing gene names/symbols, in order to avoid ambiguity among approvedgene symbols, previous symbols, and aliases, we searched Genbank2 ,HUGO3 databases and the Affymetrix website to represent each geneby a unique gene identifier.

Next, to form the dictionary we used 300 genes from [5]. This in-cludes genes that were significantly up or down-regulated in tumor andadjacent normal tissues when compared with a normal donor tissue.Each gene was validated by three biological databases and represented1http://www.affymetrix.com/support/technical/manual/probe set display manual.affx2http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide3http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 4: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

A B C D

Nb 0 0 0 0Ne 16 6 1 1Nt 21 7 3 2

Table 3: Similarity Measure s1 could be Misleading

Metric A B C D

s1 0.7619 0.8571 0.3333 0.5000se1 22.1144 5.9386 2.8800 5.0990se2 51.2019 16.8380 7.6200 2.8000

Table 4: Similarity Measurement between the 21 dataset and thefiles A, B, C, and D

by a unique identifier, as described above. We also added Gene Ontol-ogy (GO4) terms corresponding to the genes already in the dictionary.

After having the user-provided set of gene names and the above dic-tionary, the next step was to represent the biological literature infor-mation. For the purpose of our experiments, we used freely availablebiological literature abstracts from Pubmed5. We use the standard bagof words format to represent each such document. Specifically, eachdocument is represented by a set of keywords {wi}, where each wi isin our dictionary. Since these abstracts are available only after fillingout an on-line query form manually, we implemented a Java programthat fetches the abstracts for each user-provided keyword iteratively,automatically filling out the on-line form in the process. The numberof documents we extracted and represented varied between 700 and1100 for our experiments. As we considered only the documents thathad at least one of the key-words from our set, this is a very smallfraction of all datasets available in Pubmed.

4.2 Results from Hypergraph MiningIn this section we describe the results obtained using our first tool,

which is based on hypergraph mining. We focus on the biologicalsignificance of the interesting gene pairs that were found as hyperedgesby our first tool.

4.2.1 Results from 21 datasetAn interesting gene pair found as hyperedges by our first tool is re-

ported in Table 1. Of particular importance here is CCL4 as a linkingword for gene pair {CCL5, CXCL13}. The inflammatory chemokine,CCL5, is known for its participation in several types of cancers, in-cluding the breast cancer and the prostate cancer (PC) ( [22], [31]).Thus, an elevated level of CCL5 expression detected in the PC sam-ples, i.e. its presence in the 21-genes list, is consistent with the earlierobservations ( [31]), suggesting that indeed CCL5 plays a large rolein PC development. Other chemokines and chemokine receptors, in-cluding CCL4 and CXCL13, among others, are also known to exhibitabnormal expression patterns in various cancers, thus controlling theimmune response to the tumor (e.g., [19]). Note that immune responseis one of the linking words from GO that our algorithm reports.

4.2.2 Results from 31 datasetAgain interesting gene pairs found as hyperedges by our first tool are

reported in Table 2. Results showed significant relationships betweenKLK2 and KLK3 genes and SPDEF (kallikrein 2 and 3, and SAMpointed domain containing ETC transcription factor, respectively). Kallikreingene family encodes for serine proteases, some of which are expressedin prostate at high levels [6, 20], and were known to contribute toprostate cancer [27, 32] and other cancers, including breast cancer [25].An elevated level of the KLK3 protein expression, also known as PSA(prostate-specific antigen), is frequently used as prostate cancer biomarker.

4http://www.geneontology.org5http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

Metric A B C D

se1 3.4657 3.1538 9.2457 3.2146se2 8.577 8.0123 22.6275 8.2486

Table 5: Similarity Measurement between the 31 dataset and thefiles A, B, C, and D

Top 20 Top 20 Linking words fromusing se1 using se2 Hypergraph Mining

NR4A2 NR4A2MYH11 MYH11LARGE LARGE

JUN JUNJUNB PLP1

IGFBP4 JUNBPSMB6 IGFBP4PLP1 PSMB6CD14 CD14FOS FOS

EGR1 FYKLKB1 ADFP

STIL ERBB3SHBG CAV1

TNFRSF1A IFI27FY KLKB1 LARGE

IFI27 STIL FYAQP1 AQP1 PSMB6PLP2 CCL4 CD14CCL4 CD163 CCL4

Table 6: Gene Ranking using Expectation based Similarity Mea-sures and Comparison to Results from Hypergraph Mining (21dataset)

SPDEF acts as an androgen-independent transcriptional activator ofthe PSA promoter [26], and thus is also a very important gene inprostate cancer. Further, KLK3 gene was found to be significantly up-regulated in tumor samples compared with normal tissues in Chandranet al. study [5].

Significant relationships were also found between KLK3 and NKX3-1 (NK3 transcription factor related gene). The latter gene is primar-ily expressed in prostate epithelial cells, and may also contribute toprostate cancer [1, 34]. However, the precise role of NKX3-1 in prostatecancer is not entirely clear at this moment [18, 3]. A detailed studylooking at the co-expression of these androgen-dependent genes, NKX3-1 and KLK3 (PSA), may provide further clues.

4.3 Results from Similarity MeasuresIn this section we describe the results obtained by using our second

tool which is based on expectation based similarity measures. The goalof this set of experiments is to show the effectiveness of the similaritymeasures. We first create four files containing gene names. We namethese four files, A, B, C, and D, respectively. Each of these four filescontains 300 gene names. Our goal was to determine which of thesefour files is most similar to the 21 or 31 dataset.

The file A contains the same 300 gene names as described in Sec-tion 4.1. Each of the files B, C, and D comprised 300 genes randomlychosen form the list of nearly 2600 human genes found in superarray’sDNA micro-array experiment6.

4.3.1 Results from 21 datasetIn this section we describe the effect of using similarity measures

to find the file among A, B, C, and D that is most similar to the 21dataset. Clearly, the way four files were chosen, one would expect thefile A to be most similar to the 21 dataset. However, using a naivesimilarity measure like s1 fails to identify this. As shown in Table 4,naive similarity measure incorrectly indicates that the file B is mostsimilar to the 21 dataset. The reason for this is explained in Table 3. Incase of the file A, out of a total 21 gene pairs that appeared frequently,i.e., exceeding a user-defined threshold, 16 gene pairs had one of thegenes from the 21 dataset and another one from the file A. Whereas,these numbers were 7 and 6, respectively for the file B.6http://www.superarray.com/totalgene human.html

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006

Page 5: [IEEE Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06) - Arlington, VA (2006.10.16-2006.10.18)] Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06)

Top 20 using se1 Top 20 using se2

ABCC1 ABCC1AKR1B1 AKR1B1

CASP8AP2 CASP8AP2CDKN1A CDKN1A

KLF5 NQO1CA9 KLF5

CDC42 TMPRSS2LCK RHO

RAD50 CA9NQO1 ESR1RHO PPIA

TMPRSS2 FZD6ESR1 MAPKAPK2

SQSTM1 CD81TOP2B SQSTM1ERCC5 CDC42MDM2 SOX5VDR TOP2BLIG1 KRAS2

CASP10 ERCC5

Table 7: Gene Ranking using Expectation based Similarity Mea-sures for the file C (31 dataset)

This inconsistency above points out two drawbacks of the naivemeasure. First, we might have situation where the total number offrequently occurring gene pairs might be quite small (7 in the case ofthe file B) and still percentage of gene pairs not all coming from thesame file might be high ( 6/7=0.8571), leading to misleading result.Second, it shows that using threshold based measures, i.e., consideringonly those gene pairs exceeding a threshold might not work very well,as the result may be biased based on the choice of threshold.

Using similarity measures based on statistical expectation addressesthe above problems. As can be seen from Table 4, the similarity scorefor the file A, based on expectation based similarity measures (eitherse1 or se2), is much higher than the other files, indicating that the fileA is most similar to the 21 dataset than the files B, C, or D.

4.3.2 Results from 31 datasetResults from the 31 dataset show a very interesting fact. They

demonstrate that our second tool may sometime discover some unex-pected but useful information. Similar to 21 dataset, one would expectthat file A would be most similar to the 31 dataset. On the contrary,the results shown in Table 5 indicate that the file C is actually the mostsimilar to the 31 dataset. This unexpected result makes sense, becauseas we will show later in this section, the top ranking genes from thefile C are actually responsible for cancer.

4.4 Results from Gene RankingWe now describe the results from the use of expectation based sim-

ilarity measure for gene ranking.

4.4.1 Results from 21 datasetAfter finding that the file A was most similar to the 21 dataset, as

shown above, we ranked the genes from the file A in order of decreas-ing similarity to the 21 dataset. The top 20 genes from the file A usingsimilarity measure se1 and se2 are reported in Table 6. As can be seenfrom Table 6, the linking genes found from the hypergraph mining al-gorithm are all included in this top 20 gene list. This indicates that theresults obtained using these two tools are similar to an extent.

4.4.2 Results from 31 datasetAgain, we list in, Table 7, the top 20 genes found from the file C,

using the similarity measures se1 and se2 that were most similar tothe genes from the 31 dataset. Top ranked genes from this list werefound to be related to cancer!! The top four genes from this list wereABCC1, AKR1B1, CASP8AP2, and CDKN1A.

The multi drug resistance-associated protein ABCC1 was identifiedwith small lung cancer in 1992 [24]. AKR1B1 was listed in Androgen-responsive prostate cancer-related genes7. The caspase 8 associated7http://www.superarray.com/gene array product/HTML/HS-031.html

protein 2 (CASP8AP2) gene was reported for having a significant rolein apoptosis and glucocorticoid signaling in leukemia [9]. Role of cellcycle-regulatory CDKN1A in cellular response of colon cancer celllines during treatment and recovery was reported in [2].

These findings provide us with the justification as to why the file C

was found to be most similar to the 31 dataset.

5. CONCLUSIONSThis paper has focused on data mining based exploratory tools to

assist a biologist find interesting relationships from output of a mi-croarray experiment. We have evaluated two such tools using outputfrom two different microarray studies. We showed how many interest-ing observations could be made using our tools, and how the resultsfrom the two tools were similar in many ways.

6. REFERENCES[1] S. A. Abdulkadir. Mechanisms of prostate tumorigenesis: roles of transcription factors nkx3.1 and ehr1. Ann N Y Acad

Sci, 1059:33–40, 2005.[2] P. M. De Angelis, D. H. Svendsrud, K. L. Kravik, and T. Stokke. Cellular response to 5-fluorouracil (5-fu) in

5-fu-resistant colon cancer cell lines during treatment and recovery. Molecular Cancer, 2006.[3] G. Aslam, B. Irer, B. Tuna, K. Yorukoglu, F. Saatcioglu, and I. Celebi. Analysis of nkx3.1 expression prostate cancer

tissues in correlation with clinicopathologic features. Pathol Res Pract, 202:93–98, 2006.[4] Stapley BJ and Benoit G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names

in medline abstracts. In Pac Symp Biocomput, pages 529–540, 2000.[5] U. R. Chandran, R. Dhir, C. Ma, G. Michalopoulos, M. Becich, and J. Gilbertson. Differences in gene expression in

prostate cancer, normal appearing prostate tissue adjacent to cancer and prostate tissue from cancer free organ donors.BMC Cancer, 5(1):45, 2005.

[6] E. P. Diamandis, G. P. Yousef, L. Y. Luo, A. Magklara, and C. V. Obiezu. The new human kallikrein gene family:implications in carcinogenesis. Trends Endocrinol Metab, 11:54–60, 2000.

[7] Ian Donaldson, Joel Martin, Berry de Bruijn, Cheryl Wolting, Vicki Lay, Brigitte Tuekam, Shudong Zhang, BerivanBaskin, Gary Bader, Katerina Michalickova, Tony Pawson, and Christopher Hogue. Prebind and textomy - mining thebiomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(1):11,2003.

[8] SS Dwight, MA Harris, K Dolinski, CA Ball, G Binkley, KR Christie, DG Fisk, L Issel-Tarver, M Schroeder,G Sherlock, A Sethuraman, S Weng, D Botstein, and JM Cherry. Saccharomyces genome database (sgd) providessecondary gene annotation using the gene ontology (go). Nucleic Acids Res, 30:69–72, 2002.

[9] C. Flotho, E. Coustan-Smith, D. Pei, S. Iwamoto, G. Song, C. Cheng, C. H. Pui, J. R. Downing, and D. Campana.Genes contributing to minimal residual disease in childhood acute lymphoblastic leukemia: prognostic significance ofcasp8ap2. Blood, 2006.

[10] C friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: a natural-language processing system for theextraction of molecular pathways from journal articles. Bioinformatics, 17 Suppl 1::S74–82, 2001.

[11] M. Ganiz, W. M. Pottenger, and C. D. Janneck. Recent advances in literature based discovery. Journal of the AmericanSociety for Information Science and Technology, JASIST (Submitted).

[12] Yu Hao, Xiaoyan Zhu, Minlie Huang, and Ming Li. Discovering patterns to extract protein-protein interactions from theliterature: Part ii. Bioinformatics, 21(15):3294–3300, August 2005.

[13] L. Hirschman, J. C. Park, J. Tsujii, L. Wong, and C. H. Wu. Accomplishments and challenges in literature data miningfor biology. Bioinformatics, 18(12):1553–1561, December 2002.

[14] R. Hoffmann and A. Valencia. A gene network for navigating the literature. Nat Genet, 36(7):664, 2004.[15] Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry. Gene clustering by latent semantic indexing of

medline abstracts. Bioinformatics, 21(1):104–115.[16] Z. Hu, M. Narayanaswamy, K. Ravikumar, K. Vijay-Shanker, and C. Wu. Literature mining and database annotation of

protein phosphorylation using a rule-based system. Bioinformatics, 21(11):2759–2765, 2005.[17] T. K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig. A literature network of human genes for high-throughput

analysis of gene expression. Nat Genet, 28(1):21–28, 2001.[18] C. G. Korkmaz, K. S. Korkmaz, J. Manola, Z. Xi, B. Risberg, H. Danielsen, J. Kung, W. R. Sellers, M. Loda, and

F Saatcioglu. Analysis of androgen regulated homeobox gene nkx3.1 during prostate carcinogenesis. J Urol,172:1134–1139, 2004.

[19] H. Kulbe, N. R. Levinson, F. Balkwill, and J. L. Wilson. The chemokine network in cancer- much more than directingcell movement. Int J Dev Biol, 48:489–496, 2004.

[20] H. Lilja. Biology of prostate-specific antigen. Urology, 67:27–33, 2003.[21] Stephens M, Palakal M, Mukhopadhyay S, Raje R, and Mostafa J. Detecting gene relations from medline abstracts. In

Pac Symp Biocomput.[22] S. Manes, E. Mira, R. Colomer, S. Montero, L. M. Real, C. Gomez-Mouton, S. Jimenez-Baranda, A. Garzon, R. A.

Lacalle, K. Harshman, A. Ruiz, and A. C. Martinez. Ccr5 expression influences the prgression of human breast cancerin a p53-dependent manner. J Exp Med, 198:1381–1389, 2003.

[23] EM Marcotte, I Xenarios, and D Eisenberg. Mining literature for protein-protein interactions. Bioinformatics,17:359–363, 2001.

[24] S. Modok, H. R. Mellor, and R. Callaghan. Modulation of multidrug resistance efflux pump activity to overcomechemoresistance in cancer. Current Opinion in Pharmacology, 2006.

[25] C. V. Obiezu and E. P. Dimandis. Human tissue kallikrein gene family: applications in cancer. Cancer Lett, 224:1–22,2005.

[26] P. Oettgen, E. Finger, Z. Sun, Y. Akbarali, U. Thamrongsak, J. Boltax, F. Grall, A. Dube, A. Weiss, L. Brown,G. Quinn, K. Kas, G. Endress, C. Kunsch, and T. A. Libermann. A novel prostate epithelium-specific ets transcriptionfactor, interacts with the androgen receptor and activates prostate-specific antigen gene expression. J Biol Chem,275:1216–1225, 2000.

[27] A. W. Partin, G. E. Hanks, E. A. Klein, J. W. Moul, W. G Nelson, and H. I. Scher. Prostate specific antigen as a markerof disease activity in prostate cancer. oncology, 16:1218–1224, 2002.

[28] H. Shatkay and R. Feldman. Mining the biomedical literature in the genomic era: an overview. J Comput Biol,10(6):821–855, 2003.

[29] Hagit Shatkay. Hairpins in bookstacks: Information retrieval from biomedical text. Briefings in Bioinformatics,6(3):222–238, 2005.

[30] N. Tiffin, J. F. Kelso, A. R. Powell, H. Pan, V. B. Bajic, and W. A. Hide. Integration of text- and data-mining usingontologies successfully selects disease gene candidates. Nucleic Acids Res, 33(5):1544–1552, 2005.

[31] G. Vaday, D. M. Peehl, P. A. Kadam, and D. M. Lawrence. Expression of ccl5 (rantes) and ccr5 in prostate cancer.Prostate, 66:124–134, 2006.

[32] T. L. Veveris-Lowe, M. G. Lawrence, R. L. Collard, L. Bui, A. C. Herington, D. L. Nicole, and J. A. Clements.Kallikrein 4 (hk4) and prostate specific antigen (psa) are associated with loss of e-cadherin and epithelial-mesenchymaltransition (emt)-like effect in prostate cancer cells. Endocr Relat Cancer, 12:631–643, 2005.

[33] Dennis M. Wilkinson and Bernardo A. Huberman. A method for finding communities of related genes. PNAS,101(suppl-1):5241–5248, 2004.

[34] S. L. Zheng, J. H., B. L. Chang, E. Ortner, J. Sun, S. D. Isaacs, K. E. Wiley, , W. Liu, M. Zemedkun, P. C. Walsh,J. Ferretti, J. Gruschus, W. B. Issacs, E. P. Gelmann, and J. Xu. Germ-line mutation of nkx3.1 cosegregates withhereditary prostate cancer and alters the homeodomain structure and function. Cancer Res, 66:69–77, 2006.

Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06)0-7695-2727-2/06 $20.00 © 2006