Introductionhy566/Deliverables/final/Kritik_Stratak... · Web view1 Introduction 7 1.1 Applications...

ΕΡΓΑΣΙΑ ΣΤΟ ΜΑΘΗΜΑ ΗΥ566 – ΣΗΜΑΣΙΟΛΟΓΙΚΟΣ ΙΣΤΟΣ

«ΕΡΕΥΝΑ ΣΧΕΤΙΚΑ ΜΕ ΤΟ ΘΕΜΑ ΤΟΥ ONTOLOGY MERGING»

Κυριάκος Ε. Κρητικός – Υποψήφιος Διδάκτωρ

Μίλτος Στρατάκης – Μεταπτυχιακός Φοιτητής

Παρασκευή, 13 Ιουνίου 2003

Table of Contents

1 Introduction..................................................................................................................71.1 Applications of Representation Matching...........................................................71.2 Using Ontologies as Representations..................................................................8

1.2.1 Content Explication.....................................................................................91.2.1.1 Single Ontology Approaches.................................................................101.2.1.2 Multiple Ontology Approaches.............................................................111.2.1.3 Hybrid Approaches................................................................................11

1.2.2 Additional Roles of Ontologies.................................................................121.2.2.1 Query Model..........................................................................................121.2.2.2 Verification............................................................................................12

1.3 Ontologies in the Semantic Web: The need for Ontology Matching (Integration)...................................................................................................................131.4 Terminology and Content Specification of the Report......................................13

2 Problems with Ontology Combination......................................................................152.1 Mismatches between Ontologies.......................................................................15

2.1.1 Language level Mismatches......................................................................162.1.2 Ontology level Mismatches.......................................................................17

2.2 Ontology Versioning.........................................................................................192.3 Practical problems.............................................................................................20

3 Current Approaches and Techniques.........................................................................213.1 Solving Language Mismatches..........................................................................21

3.1.1 Superimposed Meta-model........................................................................213.1.2 Layered approach to interoperability.........................................................223.1.3 Open Knowledge Base Connectivity (OKBC)..........................................223.1.4 OntoMorph................................................................................................23

3.2 Ontology Level Integration and User Support...................................................243.2.1 Algebra for Scalable Knowledge Composition.........................................243.2.2 Chimaera....................................................................................................253.2.3 PROMPT...................................................................................................263.2.4 Anchor-PROMPT......................................................................................273.2.5 FCA-MERGE............................................................................................303.2.6 Information-Flow-based Ontology Mapping.............................................323.2.7 CAIMAN SYSTEM..................................................................................343.2.8 HICAL.......................................................................................................363.2.9 GLUE.........................................................................................................383.2.10 Common Top Level Model.......................................................................41

3.3 Versioning..........................................................................................................423.3.1 SHOE Ontology Integration......................................................................42

4 Overview of Approaches...........................................................................................435 Conclusion and Remarks...........................................................................................476 References..................................................................................................................49

3

Table of FiguresFigure 1: Single Ontology Approach...................................................................................7Figure 2: Multiple Ontology Approach...............................................................................7Figure 3: Hybrid Ontology Approach..................................................................................8Figure 4: The resulting framework of issues that are involved in ontology combining....18Figure 5: Anchor-PROMPT’s traversal of paths between anchors...................................26Figure 6: FCA-MERGE method........................................................................................28Figure 7: IF-Map scenario for ontology mapping.............................................................30Figure 8: The IF-Map architecture....................................................................................31Figure 9: Classification of instances between two categories...........................................34Figure 10: HICAL main algorithm....................................................................................35Figure 11: The GLUE architecture....................................................................................37

Table of TablesTable 1 – Table of problems and approaches for combined use of ontologies.................41Table 2 – Table of problems and approaches for combined use of ontologies (continued)

...................................................................................................................................42

5

1 Introduction

Representation matching, as referred to [Doan, 2002], is the problem of creating semantic mappings between two data representations. Examples of mappings are “element location of one representation maps to element address of the other”, “contact-phone maps to agent-phone”, and “listed-price maps to price * (1 + tax-rate)”. Representation matching is a fundamental step in numerous data management applications. The manual creation of semantic mappings is extremely labor intensive and has become a key bottleneck hindering the widespread deployment of the above applications.

1.1 Applications of Representation Matching

The key commonalities underlying applications that require semantic mappings are that they use structured representations (e.g., relational schemas, ontologies, and XML DTDs) to encode the data, and that they employ more than one representation. As such, the applications must establish semantic mappings among the representations, either to enable their manipulation (e.g., merging them or computing the differences) or to enable the translation of data and queries across the representations. Many such applications have arisen over time and have been studied actively by the database and AI communities.

One of the earliest such applications is schema integration: merging a set of given schemas into a single global schema. This problem has been studied since the early 1980s. It arises in building a database system that comprises several distinct databases, and in designing the schema of a database from the local schemas supplied by several user groups. The integration process requires establishing semantic mappings between the component schemas.

As databases become widely used, there is a growing need to translate data between multiple databases. This problem arises when organizations consolidate their databases and hence must transfer data from old databases to the new ones. It forms a critical step in data warehousing and data mining, two important research and commercial areas since the early 1990s. In these applications, data coming from multiple sources must be transformed to data conforming to a single target schema, to enable further data analysis.

During the late 1980s and the 1990s, applications of representation matching arose in the context of knowledge base construction, which is studied in the AI community. Knowledge bases store complex types of entities and relationships, using “extended database schemas” called ontologies. As in databases, there is a strong need to build new knowledge bases from several component ones, and to translate data among multiple knowledge bases. These tasks require solving the ontology matching problem: find semantic mappings between the involved ontologies.

In the recent years, the explosive growth of information online has given rise to even more application classes that require representation matching. One application class builds data integration systems. Such a system provides users with a uniform query

7

interface to a multitude of data sources. The system provides this interface by enabling users to pose queries against a mediated schema, which is a virtual schema that captures the domain’s salient aspects. To answer queries, the system uses a set of semantic mappings between the mediated schema and the local schemas of the data sources, in order to reformulate a user query into a set of queries on the data sources. Therefore, a critical problem in building a data-integration system is to supply the set of semantic mappings between the mediated and source schemas.

Another important application class is peer data management, which is a natural extension of data integration. A peer data management system does away with the notion of mediated schema and allows peers (i.e. participating data sources) to query and retrieve data directly from each other. Such querying and data retrieval require the creation of semantic mappings among the peers.

Recently, there has also been considerable attention on model management, which creates tools for easily manipulating models of data (e.g., data representations, website structures and ER diagrams). Here, matching has been shown to be one of the central operations.

The data sharing applications described above arise in numerous real-world domains. Applications in databases have now permeated all areas of our life. Knowledge base applications are often deployed in diverse domains such as medicine, commerce, and military. These applications also play important roles in emerging domains such as e-commerce, bioinformatics, and ubiquitous computing.

Some recent developments should dramatically increase the need for and the deployment of applications that require mappings. The Internet has brought together millions of data sources and makes possible data sharing among them. The widespread adoption of XML as a standard syntax to share data has further streamlined and eased the data sharing process. Finally, the vision of the Semantic Web is to publish data (e.g., by marking up webpages) using ontologies, thus making data that is available on the Internet become even more structured. The growth of the Semantic Web will further fuel data sharing applications and underscore the key role that representation matching plays in their deployment.

Therefore, representation matching is truly pervasive. Variations of this problem have been referred to in the literature as schema matching, ontology matching, ontology alignment, schema reconciliation, mapping discovery, reconciling representations, matching XML DTDs, and finding semantic correspondences.

1.2 Using Ontologies as Representations

Initially, ontologies are introduced as an “explicit specification of a conceptualization”. Therefore, ontologies can be used in an integration task to describe the semantics of the information sources and to make the content explicit (section 1.2.1). With respect to the integration of data sources, they can be used for the identification and association of semantically corresponding information concepts. However, in several projects ontologies take over additional roles. These roles are presented in section 1.2.2.

8

1.2.1 Content Explication

Ontologies are used in nearly all ontology–based integration approaches for the explicit description of the information source semantics. But the way, how the ontologies are employed, can be different. In general, three different directions can be identified:

Single ontology approaches Multiple ontology approaches Hybrid approaches

Figures 1, 2 and 3 give an overview of the three main architectures.

globalontology

informationsource

informationsource

informationsource

Figure 1: Single Ontology Approach

localontology

informationsource

informationsource

informationsource

localontology

localontology

Figure 2: Multiple Ontology Approach

9

localontology

informationsource

informationsource

informationsource

localontology

localontology

shared vocabulary

Figure 3: Hybrid Ontology Approach

The integration based on a single ontology seems to be the simplest approach, because it can be simulated by the other approaches. Some approaches provide a general framework where all three architectures can be implemented (e.g. DWQ [Calvanese et al., 2001]). The following sections give a brief overview of the three main ontology architectures.

1.2.1.1 Single Ontology Approaches

Single ontology approaches use one global ontology providing a shared vocabulary for the specification of the semantics. All information sources are related to one global ontology (Figure 1). A prominent approach of this kind of ontology integration is SIMS [Arens et al.]. The SIMS model of the application domain includes a hierarchical terminological knowledge base. Each source is simply related to the global domain ontology.

The global ontology can also be a combination of several specialized ontologies. A reason for the combination of several ontologies can be the modularization of a potential large monolithic ontology. The combination is supported by ontology representation formalisms i.e. importing other ontology modules (e.g. ONTOLINGUA [Gruber, 1993]).Single ontology approaches can be applied to integration problems where all information sources to be integrated provide nearly the same view on a domain. But if one information source has a different view on a domain, e.g. by providing another level of granularity, finding the minimal ontology commitment, becomes a difficult task. Also, single ontology approaches are susceptible for changes in the information sources which can affect the conceptualization of the domain represented in the ontology. These disadvantages led to the development of multiple ontology approaches.

10

1.2.1.2 Multiple Ontology Approaches

In multiple ontology approaches, each information source is described by its own ontology (Figure 2). For example, in OBSERVER [Mena et al., 1996] the semantics of an information source is described by a separate ontology. In principle, the “source ontology” can be a combination of several other ontologies but it can not be assumed, that the different “source ontologies” share the same vocabulary. The advantage of multiple ontology approaches is that no common and minimal ontology commitment about one global ontology is needed. Each source ontology can be developed without respect to other sources or their ontologies. This ontology architecture can simplify the integration task and supports the change, i.e. the adding and removing, of sources. On the other hand, the lack of a common vocabulary makes it difficult to compare different source ontologies.

1.2.1.3 Hybrid Approaches

To overcome the drawbacks of the single or multiple ontology approaches, hybrid approaches were developed (Figure 3). Similar to multiple ontology approaches, the semantics of each source is described by its own ontology. But, in order to make the local ontologies comparable to each other they are built from a global shared vocabulary. The shared vocabulary contains basic terms (the primitives) of a domain which are combined in the local ontologies in order to describe more complex semantics. Sometimes the shared vocabulary is also an ontology.

In hybrid approaches the interesting point is how the local ontologies are described. In COIN [Goh, 1997], the local description of an information, so called context, is simply an attribute value vector. The terms for the context stems from a global domain ontology and the data itself. In MECOTA [Wache et al., 1999], each source concept is annotated by a label which combines the primitive terms from the shared vocabulary. The combination operators are similar to the operators known from the description logics, but are extended e.g. by an operator which indicates that an information is an aggregation of several separated information (e.g. a street name with number). In BUSTER [Stuckenschmidt et al., 2000b], the shared vocabulary is a (general) ontology, which covers all possible refinements. E.g. the general ontology defines the attribute value ranges of its concepts. A source ontology is one (partial) refinement of the general ontology, e.g. restricts the value range of some attributes. Because the source ontologies only use the vocabulary of the general ontology, they remain comparable.

The advantage of a hybrid approach is that new sources can easily be added without the need of modification. It also supports the acquisition and evolution of ontologies. The use of a shared vocabulary makes the source ontologies comparable and avoids the disadvantages of multiple ontology approaches. But the drawback of hybrid approaches is that existing ontologies can not easily be reused, but have to be redeveloped from scratch.

11

1.2.2 Additional Roles of Ontologies

Some approaches use ontologies not only for content explication, but also as a global query model or for the verification of the (user–defined or system generated) integration description. In the following sections, these additional roles of ontologies are presented in more detail.

1.2.2.1 Query Model

Integrated information sources normally provide an integrated global view. Some integration approaches use the ontology as the global query schema. For example, in SIMS [Arens et al., 1996] the user formulates a query in terms of the ontology. Then SIMS reformulates the global query into sub-queries for each appropriate source, collects and combines the query results, and returns the results.

Using an ontology as a query model has the advantage that the structure of the query model should be more intuitive for the user because it corresponds more to the users appreciation of the domain. But from a database point of view, this ontology only acts as a global query schema. If a user formulates a query, he has to know the structure and the content of the ontology; he can not formulate the query according to a schema he would prefer. Therefore, it is questionable where the global ontology is an appropriate query model.

1.2.2.2 Verification

During the integration process several mappings must be specified from a global schema to the local source schema. The correctness of such mappings can heavily be improved if the mappings can be verified automatically. A sub-query is correct with respect to a global query if the local sub-query provides a part of the queried answers, i.e. the sub-queries must be contained in the global query (query containment). Because an ontology contains a complete specification of the conceptualization, the mappings can be validated with respect to the ontologies. Query containment means the ontology concepts corresponding to the local sub-queries are contained in the ontology concepts related to the global query.

In DWQ [Calvanese et al., 2001], each source is assumed to be a collection of relational tables. Each table is described in terms of its ontology with the help of conjunctive queries. A global query and the decomposed sub-queries can be unfolded to their ontology concepts. The sub-queries are correct, i.e. are contained in the global query, if their ontology concepts are subsumed by the global ontology concepts. The PICSEL project [Goasdoue et al., 1999], can also verify the mapping but in contrast to DWQ it can also generate mapping hypotheses automatically which are validated next with respect to a global ontology.

12

1.3 Ontologies in the Semantic Web: The need for Ontology Matching (Integration)

In the last few years, there has been put a lot of effort in the development of techniques that aim at the “Semantic Web”. This envisioned next generation of the World Wide Web will enable computers to use the information on the internet. This means that computers will not only distribute and render the information, but also select, combine, transform and relate data, to support humans in performing tasks. Such a web will be more targeted at problem solving and query answering than at mere information retrieval.

A lot of those newly developed techniques require a formal description of a part of our human environment, i.e., a description of a part of the real world. These descriptions, in various degrees of formalness and specificity, are the ontologies, which were described above. Nowadays, people are taking the first steps towards the “Semantic Web” by annotating (web) data with standard terminologies and other semantic metadata. This is providing us with a lot of freely accessible domain specific ontologies. However, to form a real web of semantics (which will allow computers to combine and infer implicit knowledge), those separate ontologies should be linked and related to each other. Adaptation of existing ontologies, and composition of new ontologies from ontologies that are already around will play central role in this process, as it will both simplify the specification of new knowledge and leverage the interconnection.

As we said above, in the envisaged distributed context of a Semantic Web, ontologies from different sources will be linked and combined. However, the reuse of existing ontologies is often not possible without considerable effort. The ontologies either need to be integrated, which means that they are merged into one new ontology, or the ontologies can be kept separate. In both cases, the ontologies have to be aligned, which means that they have to be brought into mutual agreement.

Ontology integration consists of (the iteration of) the following steps:1) Find the places in the ontologies where they overlap2) Relate concepts that are semantically close via equivalence and subsumption

relations (aligning)3) Check the consistency, coherency and non-redundancy of the result

The process of aligning two ontologies is equivalent with the process of ontology matching. The alignment of concepts between ontologies is especially difficult, because this requires understanding of the meaning of concepts. Aligning two ontologies implies changes to at least one of them. Changes to an ontology will result in a new version of an ontology. If the ontologies are not represented in the same language, a translation is often required.

1.4 Terminology and Content Specification of the Report

In this report, we will use the following terms consistently according to their specified meaning:

13

combining: Using two or more different ontologies for a task in which their mutual relation is relevant.

merging, integrating: Creating a new ontology from two or more existing ontologies with overlapping parts, which can be either virtual or physical

aligning: Bring two or more ontologies into mutual agreement, making them consistent and coherent.

mapping: Relating similar (according to some metric) concepts or relations from different sources to each other by an equivalence relation. A mapping results in a virtual integration.

articulation: The points of linkage between two aligned ontologies, i.e. the specification of the alignment.

translating: Changing the representation formalism of an ontology while preserving the semantics.

transforming: Changing the semantics of an ontology slightly (possibly also changing the representation) to make it suitable for purposes other than the original one.

version: The result of a change that may exist next to the original. versioning: A method to keep the relation between newly created ontologies,

the existing ones, and the data that conforms to them consistent.

The content specification of this report is as follows:In section 2, we will investigate all problems that arise when ontologies are combined or related, which results in a framework of relevant issues. Next, in section 3, we will use the framework to examine existing approaches for ontology combining. In section 4, we summarize the techniques and give an overview of the different approaches that are used. Finally, in section 5 we will conclude the paper and we will make some remarks based on our observations.

14

2 Problems with Ontology Combination

The combined use of multiple ontologies is hindered by several problems [Klein, 2001]. In this section, we will investigate and describe them.

The problems that underlie the difficulties in merging and aligning are the mismatches that may exist between separate ontologies. In the next subsection, we will discuss these mismatches. We will then look at the different type of problems involved with versioning and revisioning. Finally, we will discuss some practical problems that come up when one tries to combine ontologies.

Thus doing, we will build a framework with the different types of problems that can occur when relating ontologies. This framework can be used when we compare the existing approaches and tools.

2.1 Mismatches between Ontologies

Mismatches between ontologies are the key type of problems that hinder the combined use of independently developed ontologies. We will now explore how ontologies may differ. In the literature, there are a lot of possible mismatches mentioned, which are not always easy comparable. To make them more comparable, we try to classify the different types of mismatches and relate them to each other.

As a first step, we will distinguish between two levels atwhich mismatches may appear:

1) First level: The language or meta-model level. This is the level of the language primitives that are used to specify an ontology. Mismatches at this level are mismatches between the mechanisms to define classes, relations and so on.

2) Second level: The ontology or model level, at which the actual ontology of a domain lives. A mismatch at this level is a difference in the way the domain is modelled.

The distinction between these two levels of differences is made very often. [Kitakami et al., 1996] and [Visser et al., 1997] call these kinds of differences respectively non-semantic and semantic differences. Others make this distinction implicitly, by only concentrating on one of the two levels. For example, [Wiederhold, 1994] analyses domain differences (i.e. ontology level), while [Grosso et al., 1998] and [Bowers and Delcambre, 2000] look at language level differences. In the following, we will avoid the use of the words “semantic differences” for ontology level differences, because we reserve those words for a more specific type of difference (which will be described below).

In the next sub-sections, we will give an overview and characterization of different types of mismatches that can appear at each of those two levels.

15

2.1.1 Language level Mismatches

Mismatches at the language level occur when ontologies written in different ontology languages are combined. [Chalupsky, 2000] defines mismatches in syntax and expressivity. In total, we distinguish four types of mismatches that can occur, although they often coincide:

1) Syntax: Obviously, different ontology languages often use different syntaxes. For example, to define the class of chairs in RDF Schema one uses<rdfs:Class ID=”Chair”>. In LOOM, the expression (defconcept Chair) is used to define the same class. This difference is probably the most simple kind of mismatch. However, this mismatch often doesn’t come alone, but is coupled with other differences at the language level. A typical example of a “syntax only” mismatch is an ontology language that has several syntactical representations. In this simple case, a rewrite mechanism is sufficient to solve those problems.

2) Logical representation: A slightly more complicated mismatch at this level is the difference in representation of logical notions. For example, in some languages it is possible to state explicitly that two classes are disjoint (e.g. disjoint A B), whereas it is necessary to use negation in subclass statements (e.g. A subclass-of (NOT B), B subclass-of (NOT A) ) in other languages. The point here is not whether something can be expressed (the statements are logically equivalent) but which language constructs should be used to express something. Also, we can notice that this mismatch is not about the representation of concepts, but about the representation of logical notions. This type of mismatch is still relatively easy solvable, e.g. by giving translation rules from one logical representation to another.

3) Semantics of primitives: A more subtle possible difference at the meta-model level is the semantics of language constructs. Despite the fact that sometimes the same name is used for a language construct in two languages, the semantics may differ (e.g. when there are several interpretations of A equalTo B). We can note that even when two ontologies seem to use the same syntax, the semantics can differ. For example, the OIL RDF Schema syntax interprets multiple <rdfs:domain> statements as the intersection of the arguments, whereas RDF Schema itself uses union semantics.

4) Language expressivity: The mismatch at the meta-model level with the most impact is the difference in expressivity between two languages. This difference implies that some languages are able to express things that are not expressible in other languages. For example, some languages have constructs to express negation and others have not. Other typical differences in expressivity are the support of lists and sets, default values, etc. This type of mismatch has probably the most impact and is mentioned by several others. The “fundamental differences” between knowledge models that are described in [Grosso et al., 1998] are also very close to the interpretation used here.

The above list of differences at the language level can be seen as more or less compatible with the broad term “language heterogeneity” of [Visser et al., 1997].

16

2.1.2 Ontology level Mismatches

Mismatches at the ontology (or model) level happen when two or more ontologies that describe (partly) overlapping domains are combined. These mismatches may occur when the ontologies are written in the same language, as well as when they use different languages. Based on the literature, we can distinguish several types of mismatches at the model level.

[Visser et al., 1997] make a very useful distinction between mismatches in the conceptualization and explication of ontologies.

A conceptualization mismatch is a difference in the way a domain is interpreted (conceptualized), which results in different ontological concepts or different relations between those concepts.

An explication mismatch is a difference in the way the conceptualization is specified.

This distinction can manifest itself in mismatches in definitions, mismatches in terms and combinations of both. [Visser et al., 1997] list all the combinations. Four of them are related to homonym terms and synonym terms.

[Wiederhold, 1994] also mentions the problems with synonym terms (called naming differences) and homonym terms (subjective meaning). Besides that, he describes possible differences in the scope of concepts, which is an example of a conceptual mismatch. Finally, he mentions value encoding differences, for example, differences in the currency of prices.

[Chalupsky, 2000] lists four types of mismatches in ontologies. One of them, inference system bias is not considered as a real mismatch, but a reason for modelling style differences. The other three mismatches, modelling conventions, coverage and granularity and paradigms can be categorized as instances of the two main mismatch types of Visser et al. We will describe them in a slightly altered way below.

Now, we are going to relate the different types of mismatches that are distinguished by the authors cited above. The first two mismatches at the model level that we distinguished are instances of the conceptualization mismatches of Visser et al. These are semantic differences, i.e. not only the specification, but also the conceptualization of the domain is different in the ontologies that are involved.

Scope: Two classes seem to represent the same concept, but do not have exactly the same instances, although these intersect. The standard example is the class “employee”: several administrations use slightly different concepts of employee. In [Visser et al., 1997], this is called a class mismatch and is worked out further into detailed descriptions at class- or relation-level.

Model coverage and granularity: This is a mismatch in the part of the domain that is covered by the ontology or the level of detail to which that domain is modelled. Chalupsky (2000) gives the example of an ontology about cars: one ontology might model cars but not trucks. Another one might represent trucks but only classify them into a few categories, while a third one might make very

17

fine-grained distinctions between types of trucks based on their general physical structure, weight, purpose, etc.

Conceptualization differences as described above can not be solved automatically, but require knowledge and decisions of a domain expert. In the second case, the mismatch is often not a problem, but a motive to use different ontologies together. In that case, the remaining problem is to align the overlapping parts of the ontology.

The other ontology-level mismatches can be categorized as explication mismatches, in the terminology of Visser et al. The first two of them result from explicit choices of the modeller about the style of modelling:

Paradigm: Different paradigms can be used to represent concepts such as time, action, plans, causality, propositional attitudes etc. For example, one model might use temporal representations based on interval logic while another might use a representation based on point. The use of different “top-level” ontology is also an example of this kind of mismatch.

Concept description: This type of differences is called modelling conventions in [Chalupsky, 2000]. Several choices can be made for the modelling of concepts in the ontology. For example, a distinction between two classes can be modelled using a qualifying attribute or by introducing a separate class. These choices are sometimes influenced by the intended inference system. Another choice in concept descriptions is the way in which is-a hierarchy is build; distinctions between features can be made higher or lower in the hierarchy. For example, we can consider the place where the distinction between scientific and non-scientific publications is made: a dissertation can be modelled as dissertation < book < scientific publication < publication, or as dissertation < scientific book < book < publication, or even as subclass of both book and scientific publication.

Further, the next two types of differences can be classified as terminological mismatches.

Synonym terms: Concepts are represented by different names. A trivial example is the use of the term “car” in one ontology and the term “automobile” in another ontology. This type of problem is called term mismatch (T or TD) in [Visser et al., 1997]. A special type of this problem is the case that the natural language in which ontologies are described differs.Although the technical solution for this type of problems seems relatively simple (the use of thesauri), the integration of ontologies with synonyms or different languages requires usually a lot of human effort and comes with several semantic problems. Especially, one must be careful not to overlook a scope difference.

Homonym terms: The meaning of a term is different in an other context. For example, the term “conductor” has a different meaning in a music domain than in an electric engineering domain. Visser et al. (1997) calls this a concept mismatch (C or CD).

18

This inconsistency is much harder to handle; human knowledge is required to solve this ambiguity.

Finally, there is a one trivial type of difference left. Encoding: Values in the ontologies may be encoded in different formats. For

example, a date may be represented as “dd/mm/yyyy” or as “mm-dd-yy”, distance may be described in miles or kilometres etc. There are many mismatches of this type, but these are all very easy to solve. In most cases, a transformation step or wrapper is sufficient to eliminate all those differences.

2.2 Ontology Versioning

The problems listed above are mismatches between ontologies. Most projects and approaches focus on solving these mismatches. However, mismatches are not the only problems that have to be solved when one want to use several ontologies together for one task.

As changes to ontologies are inevitable in an open domain, it becomes very important to keep track of the changes and of its impact on the dependencies of that ontology. It is often not practically possible to synchronize the changes of an ontology with the revisions to the applications and data sources that use them. Therefore, a versioning method is needed to handle revisions of ontologies and the impact on existing sources. In some sense, the versioning problem can also be regarded as a derivation of ontology combination, because it results from changes (possibly required by combination tasks) to individual ontologies.

Ontology versioning covers several aspects. Although the problem is introduced by subsequent changes to one specific ontology, the most important problems are caused by the dependencies on that ontology. Therefore, it is useful to distinguish the aspects of ontology versioning. A versioning scheme should take care of the following aspects:

1) The relation between succeeding revisions of one ontology2) The relation between the ontology and its dependencies:

a. instance data that conforms to the ontologyb. other ontologies that are built from, or import the ontologyc. applications that use the ontology

The central question that a versioning scheme answers is: how to reuse existing ontologies in new situations, without invalidating the existing ones. A versioning scheme provides ways to disambiguate the interpretation of concepts for users of the ontology revisions, and it makes the compatibility of the revisions explicit. Consequently, there can be imposed the following requirements on a versioning scheme, in increasing level of difficulty:

For every use of a concept or a relation, a versioning framework should provide an unambiguous reference to the intended definition (identification)

A versioning framework should make the relation of one version of a concept or relation to other versions of that construct explicit (change tracking)

19

A versioning framework should as far as possible automatically perform conversions from one version to another, to enable transparent access (transparent translating).

There will be an examination of the current approaches with respect to those requirements.

2.3 Practical problems

Besides the technical problems that were presented in the previous sections, there are also practical problems that hinder the easy use of combined ontologies. Aligning and merging ontologies, the central aspect of ontology combining, is a complicated process and requires serious effort of the ontology designers. Until now, this task is mostly done by hand, which makes it difficult to overlook in two aspects:

1) It is difficult to find the terms that need to be aligned2) The consequences of a specific mapping (unforeseen implications) are difficult

to seeBecause it is unrealistic to hope that merging or alignment at the semantic level

could be performed completely automatically, we should take these practical aspects into consideration.

Another practical problem is the repeatability of merges. Most often, the sources that are used for the merging continue to evolve. The alignments that are created for the merging should be as much reusable as possible for the merging of the revised ontologies. This issue is very important in the context of ontology maintenance. The repeatability could for example be achieved by an executable specification of the alignment.

Summarizing the previous sections, we can construct the framework of issues that is depicted in Figure 4.

Figure 4: The resulting framework of issues that are involved in ontology combining

20

3 Current Approaches and Techniques

In this section, we will use the above framework to examine several tools and techniques that are aimed at ontology combining. We begin from the top of the framework, looking at techniques for solving language mismatches. Then, we will look at techniques for solving ontology level mismatches and user support. Of course, it is not possible to make a strict distinction between the types of problems that a technique solves, because some tools or techniques provide support for several types of problems. The place where we mention them does therefore not imply a classification, but serves as a guideline only.

3.1 Solving Language Mismatches

There are several approaches for solving the problem of integrating ontologies that are written in different knowledge representation languages. Some of them are just techniques; others also provide some kind of automated support. These approaches are presented in the following sub-sections.

3.1.1 Superimposed Meta-model

Bowers and Delcambre (2000) describe an approach to transforming information from one representation to another. Their focus is on model-based information where the information representation scheme provides structural modelling constructs (analogous to a data model in a database). For example, the XML model includes elements, attributes, and permits elements to be nested. Similarly, RDF models information through resources and properties. The goal of their work is to enable the user to apply tools of interest to the information at hand. Their approach is to represent information for a wide variety of model-based applications in a uniform way and to provide a mapping formalism that can easily transform information from one representation to another.

To achieve this, the ontology languages (i.e. the specific constructs in the language that are used to describe an ontology) are each represented in a meta-model, so called “superimposed” model. These superimposed models are represented as RDF triples. Mappings are then specified using production rules. The rules are defined over triples of the RDF representation for superimposed information. Since a triple is a simple predicate, mapping rules are specified using a logic-based language such as Prolog, which allows both specifying and implementing the mappings. There is no requirement that the mappings between superimposed layers should be complete, since only part of a model or schema may be needed while using a specific tool.

If superimposed information from a source language is being mapped to the target language, it is possible to convert data from the source layer into data that conforms to the target layer. Although the focus is on conversion, it is also possible to perform

21

integration between superimposed layers. Integration goes a step further by combining the source and target data. The mapping rules can be used to provide integration at both the schema and instance levels.

The “superimposed model approach” provides mechanisms to solve language level mismatches of syntax, representation and semantics. The mappings between the language constructs have to be specified manually. The semantic resolution of mismatches at the ontology level is not covered by this approach.

3.1.2 Layered approach to interoperability

Melnik and Decker (2000) introduce some initial ideas targeted at facilitating data interoperation using a layered approach. Their approach resembles the layering principles used in internetworking. To harness the complexity of data interoperation, Melnik and Decker suggest viewing Web-enabled information models as a series of layers:

The syntax layer: This layer is responsible for “dumping down” object-oriented information into document instances and byte streams

The object or frame layer: Its basic function is to provide applications with an object-oriented view of their domain

The semantic or knowledge representation layer: Deals with conceptual modelling and knowledge engineering tasks.

Each layer has a number of sub-layers, which correspond to a specific data modelling feature (e.g. aggregation or reification) that can be logically implemented in different ways.

We can observe that a clean separation between different layers may ease the achievement of interoperability. The first two layers that the authors distinguish map nicely onto the first two types of mismatches that were presented in Section 2.1. However, all other types of mismatches that we distinguish are comprised in the “semantic layer”. Therefore, the layering as described in its initial version only solves some of the language level mismatches.

As the authors also have noticed, considering data models in a layered fashion is a contemporary approach. For example, OIL is presented as an extension to RDF Schema, and Euzenat (2001) investigates the characteristics of interoperability of knowledge representations at various levels.

3.1.3 Open Knowledge Base Connectivity (OKBC)

The Open Knowledge Base Connectivity [Chaudhri et al. 1998] is a generic interface to Knowledge Representation Systems (KRS). An “Application Programming Interface” (API), specifies the operations that can be used to access a system by an application program. When specifying this API for knowledge representation systems, some assumptions are made about the representation used by that KRS. These assumptions are made explicit in the OKBC knowledge model.

22

A specific knowledge representation (or ontology) language can be bound to OKBC by defining a mapping from OKBC knowledge model to the specific language. The users of the ontology are then isolated from the peculiarities of specific language and can use the OKBC model. The interoperability achieved by using OKBC is at the level of the OKBC knowledge model.

OKBC thus can solve some mismatches at the language level. However, semantic differences that are beyond representation can not be solved by providing mappings to the OKBC knowledge model. More general, when using a mapping to a common knowledge model, the notions that requires a higher level of expressivity than that model provides, will be lost.

3.1.4 OntoMorph

OntoMorph [Chalupsky, 2000] is a transformation system for symbolic knowledge. It facilitates ontology merging and the rapid generation of knowledge base translators. It combines two mechanisms to describe knowledge base transformations:

1) Syntactic rewriting via pattern-directed rewrite rules that allow the concise specification of sentence-level transformations based on pattern matching

2) Semantic rewriting which modulates syntactic rewriting via (partial) semantic models and logical inference via an integrated KRS.

The integration of these mechanisms allows transformations to be based on any mixture of syntactic and semantic criteria. The OntoMorph architecture facilitates incremental development and scripted replay of transformations, which is particularly important during merging operations.

OntoMorph focuses at transformations to individual ontologies that are needed to align two or more ontologies. This is small but important step in the process of merging ontologies. In fact, step number 2 of the ontology merging process in Section 1.3 is split into three sub-steps:

2a) Design transformations to bring the sources into mutual agreement2b) Editing or morphing the sources to carry out the transformations2c) Taking the union of the morphed sources

OntoMorph facilitates step 2b, by transforming the ontologies into a common format with common names, common syntax, uniform modelling assumptions etc.

Step 2a, the design of the transformations, involves understanding of the meaning of the representation, and is therefore a less automat able task. Additionally, this step often involves human negotiation to reconcile competing views on how a particular modelling problem should be solved.

OntoMorph is able to solve several problems at the language level of ontology mismatches framework. Of course, a difference in expressivity between two languages is not solvable, but some implies loss of knowledge. Solutions for ontology level problems can also be formulated in OntoMorph. Because OntoMorph requires a clear and

23

executable specification of the transformation, the process can be repeated with modified versions of the original ontologies.

EVALUATION: OntoMorph has already been successfully applied in a couple of domains. One involved the translation of military scenario information for a plan critiquing system. In the second, it formed the core of an agent translation service called Rosetta, where it was used to translate messages between two communicating planning agents that used different representation of goals.

3.2 Ontology Level Integration and User Support

In the previous section, we observed that OntoMorph provides mechanisms and support for some model level integration, too. We will now look at a transformation system that allows the specification and execution of transformation of individual ontologies. Then, there will be the presentation of two tools that assist the user in the complicated task of performing a merge. Finally, there will be a close examination of techniques that try to automate the mapping/merging process of two ontologies. The first case is that of a complementary method to one of the above tools. Then, two methods will be exhibited that try using formal mathematical foundations for automating the alignment task. In the end, there will be a presentation of three different techniques that use machine learning techniques for automatically aligning two ontologies. For every tool or method presented here, we also accompany an evaluation of this method/tool in respect to the task of merging/alignment.

3.2.1 Algebra for Scalable Knowledge Composition

The Scalable Knowledge Composition (SKC) project developed an algebra for ontology composition. This algebra is used in the ONION system (ONtology compositION), described in [Mitra et al., 2000]. The current work in the SKC project is not solely in the area of ontology combination, but in the broader field of integrating heterogeneous data sources.

The algebra operates on ontologies that are represented by nodes and arcs (terms and relationships) in a directed labelled graph. Each algebraic operator takes as input a graph of semi-structured data and transforms it to another graph. This guarantees that the algebra is composable. The algebra operations are themselves knowledge driven, using articulation rules. The rules can be both logical rules (e.g. semantic implication between terms across ontologies) and functional rules (e.g. dealing with conversion between terms across ontologies). The composition rules are partly suggested by expert and lexical knowledge.

Intersection is the most crucial operation since it identifies the terms where linkage occurs among the domains, which is called “the articulation”. An articulation ontology contains the terms from the source ontologies that are related and their relation, and can

24

be seen as a specification of an alignment. This separate specification facilitates repeated executions of the composition.

When we relate this to the framework in Figure 4, we can observe that the algebra allows the specification of solutions to solve several conceptual and terminological mismatches. Via the functional rules, term synonyms and encoding problems can be eliminated. The logical articulation rules provides a mean to solve mismatches in scope, coverage and homonym terms. By using expert and lexical knowledge to suggest articulations, the system also meets the practical problems of finding alignments.

One of the main advantages of using an algebra for combination, is the reusability. The unified ontology is not a physical entity, but a term to denote the result of applying the articulation rules. This approach ensures minimal coupling between the sources, so that the sources can be developed and maintained independently.

3.2.2 Chimaera

Chimaera [McGuinness et al., 2000] is an ontology merging and diagnosis tool developed by the Stanford University Knowledge Systems Laboratory (KSL). Its initial design goal was to provide substantial assistance with the task of merging KBs produced by multiple authors in multiple settings. Later, it took on another goal of supporting testing and diagnosing ontologies as well. Finally, inherent in the goals of supporting merging and diagnosis are requirements for ontology browsing and editing. It is mainly targeted at lightweight ontologies.

The two major tasks in merging ontologies that Chimaera support are:1) Coalesce two semantically identical terms from different ontologies so that they

are referred to by the same name in the resulting ontology2) Identify terms that should be related by subsumption, disjointness or instance

relationships and provide support for introducing those relationships.There are many auxiliary tasks inherent in these tasks, such as identifying the

locations for editing, performing the edits, identifying when two terms could be identical if they had small modifications such as a further specialization on a value-type constraint etc.

Chimaera generates name resolution lists that help the user in the merging task by suggesting terms each of which is from a different ontology that are candidates to be merged or to have taxonomic relationships not yet included in the merged ontology. It also generates a taxonomy resolution list where it suggests taxonomy areas that are candidates for reorganization. It uses a number of heuristic strategies for finding such edit points.

Finally, Chimaera also has diagnostic support for verifying, validating and critiquing ontologies. Those functions only include domain independent tests that showed value in previous experiments.

We can observe that Chimaera can be used to solve mismatches at the terminological level. It is also able to find some similar concepts that have a different description at the model level. Further, Chimaera seems to do a great job in helping the user to find possible edit point. The diagnostic functions are difficult to evaluate, because their description is very brief.

25

EVALUATION: A set of experiments was conducted that was designed to produce a measure of the performance of Chimaera. In each experiment the focus was on: (1) coalescing ontology names, (2) performing taxonomic edits, (3) identifying ontology edit points, and (4) testing overall effectiveness in a substantial merging task. The results produced were also compared to the performance of other tools that a KB developer might reasonably use to do such a merge, absent the KB merging tool itself. At each stage in the experiment, the goal was to control for as many factors as possible and to assure that the experimental settings correspond closely to the settings in which the tool would actually be used.

The primary results of these experiments are the following:

Merging two or more substantial ontologies was essentially not doable in a time effective manner using a text-editing tool, primarily because of the difficulty of examining the taxonomy of any non-trivial ontology using only a text editor.

Chimaera is approximately 3.46 times faster than an ontology editing tool (Ontolingua) for merging substantial taxonomies. Moreover, for the portion of the taxonomy merging task for which Chimaera’s name resolution heuristics apply, Chimaera is approximately 14 times faster than an ontology editing tool (Ontolingua).

Almost all of the operations performed during a taxonomy merge would be more error-prone if they were performed using an ontology editing tool (Ontolingua), and the critical “merge class” operations would be extremely error-prone if performed using a KB editing tool.

3.2.3 PROMPT

PROMPT (formerly known as SMART) is an interactive ontology-merging tool [Noy and Musen, 2000]. It guides the user through the merging process making suggestions, determining conflicts and proposing conflict-resolution strategies. PROMPT starts with the linguistic-similarity matches of frame names for the initial comparison, but concentrates on finding clues based on the structure of the ontology and users actions . After the user selects an operation to perform, PROMPT determines the conflicts in the merged ontology that the operation has caused and proposes possible solutions to the conflict. It then considers the structure of the ontology around the arguments to the latest operations (i.e. relations among the arguments and other concepts in the ontology) and proposes other operations that the user should perform.

In the PROMPT project, a set of knowledge-base operations for ontology merging or alignment is identified. For each operation in this set is defined:

1) The changes that PROMPT performs automatically2) The new suggestions that PROMPT presents to the user3) The conflicts that the operation may introduce and that the user needs to resolve

When the user invokes an operation, PROMPT creates members of these three sets based on the arguments to the specific invocation of the operation. The conflicts that may appear in the merged ontology as the result of these operations are:

Name conflicts

26

Dangling references Redundancy in the class-hierarchy Inconsistencies

PROMPT not only points to the places where changes should be made, but also presents a list of actions to the user.

Summarizing, PROMPT gives iterative suggestions for concept merges and changes, based on linguistic and structural knowledge, and it points the user to possible effects of these changes.

EVALUATION: PROMPT has been evaluated formatively. The quality of its suggestions and its utility were measured and it was compared with another ontology-merging tool. Three controlled experiments were performed, in which human experts merged two ontologies using Protégé-2000 with PROMPT, generic Protégé-2000 and Chimaera.

The two source ontologies were the same for all the experiments: (1) the ontology for the unified problem-solving method development language ([Gennari et al., 1998]; [Fensel et al., 1999]) and (2) the ontology for the method-description language [Gennari et al., 1998]. Both ontologies describe reusable problem-solving methods, and merging them was a real-life task. The source ontologies contained the total of 134 classes and slot frames, and the resulting merged ontology had 117 frames.

The results concerning PROMPT’s suggestions were amazing. They showed that the human experts followed 90% of PROMPT’s suggestions. The experts followed 75% of the conflict resolution strategies that PROMPT proposed. During the merging process, PROMPT suggested 74% of the total knowledge-base operations that the users invoked.

Concerning PROMPT versus generic Protégé-2000, the merged ontologies were quite similar: there was only one difference in class hierarchy, and a number of minor differences in slot names and their types. The user who was using generic Protégé-2000 was able to find all the classes that should have been merged. However, using the generic Protégé-2000 required performing and explicitly specifying 60 knowledge-base operations. Since PROMPT generated and suggested most of the necessary knowledge-base operations, the PROMPT user needed to specify explicitly only 16 operations.

Concerning PROMPT versus Chimaera, PROMPT had 30% more correct suggestions than Chimaera did. Suggestions from Chimaera constituted a proper subset of PROMPT’s suggestions. 20% of Chimaera’s correct suggestions were roughly the same as PROMPT’s. The remaining 80% were much less specific than the corresponding PROMPT’s suggestions: Chimaera pointed to the class in the ontology where the action was required, and PROMPT also suggested a specific action (or alternative actions), which were required for that frame.

3.2.4 Anchor-PROMPT

The semi-automated approaches to ontology merging that do exist today such as PROMPT and Chimaera analyze only local context in ontology structure: given two similar classes, the algorithms consider classes and slots that are directly connected to the classes in question. On the other hand, Anchor-PROMPT [Noy and Musen, 2001] uses a

27

set of heuristics to analyze non-local context. The goal of Anchor-PROMPT is not to provide a complete solution to automated ontology merging but rather to augment existing methods, like PROMPT and Chimaera, by determining additional possible points of similarity between ontologies.

Anchor-PROMPT takes as input a set of pairs of related terms – anchors – from the source ontologies. Either the user identifies the anchors manually or the system generates them automatically by using lexical processing techniques and/or by taking into account the ontologies’ structure. From this set of previously identified anchors, Anchor-PROMPT produces a set of new pairs of semantically close terms. To do that, Anchor-PROMPT traverses the paths between the anchors in the corresponding ontologies. A path follows the links between classes defined by hierarchical relations or by slots and their domain and ranges. This algorithm then compares the terms along these paths to find similar terms.

Figure 5: Anchor-PROMPT’s traversal of paths between anchors

For example, suppose we identify two pairs of anchors: classes A and B and classes H and G (Figure 5). Figure 5 shows one path from A to H in the first ontology and one path from B to G in the second ontology. We traverse the two paths in parallel, incrementing the similarity score between each two classes that we reach in the same step. In this example, the similarity score between the classes C and D and between the classes E and F is incremented. The process is repeated for all the existing paths that originate and terminate in the anchor points, cumulatively aggregating the similarity score. The key point behind Anchor-PROMPT is that if two pairs of terms are similar and there are paths connecting the terms, then the elements in those paths are similar as well.

28

Consequently, the terms that often appear in the same positions on the paths going from one pairs of anchors to another will get the highest score and will be regarded as similar. A special remark must be mentioned about how Anchor-PROMPT treats the subclass-superclass links in relation to the other kinds of slots. This algorithm joins the terms linked by the subclass-superclass relation in equivalence groups. These groups are treated as a single node in the algorithm. The set of incoming edges for an equivalence-group node is the union of the sets of incoming edges for each of the group elements. Similarly, the set of outgoing edges for an equivalence-group node is the union of the sets of outgoing edges for each of its elements. The similarity score between a node and an equivalence-group node may be treated differently as will be seen in the next paragraph. Last but not least, it should be mentioned that all ontologies are mapped to the OKBC model.

EVALUATION: A formative evaluation of anchor prompt was performed by testing it on a pair of ontologies that were not only developed independently by different groups of researchers but also covered similar subject domains. This pair of ontologies was imported from the DAML ontology library [DAML, 2001] and contained:

1. An ontology for describing individuals, computer-science academic departments, universities, and activities that occur at them at the University of Meryland (UMD), and

2. An ontology for describing employees in academic institutions, publications, and relationships among research groups and projects developed at Carnegie Mellon University (CMU).

In the experiments performed, there was a variation of the set of anchor pairs that was the input to the algorithm. There was also a variation of various parameters, such as maximum path length, maximum size of equivalence groups, and constants in the similarity score computation.The results of the above experiments revealed three key points. The first point was that if the maximum equivalence-group size is less than 2, then the experiments produce many empty result sets. The second point concerned the similarity score for equivalence-group members. If the classes that shared their position with other members of an equivalence group received only 1/3 of the score, then this differentiation improved the correctness of the results by 14%. The third and most crucial point concerned the maximum length of the path (# of slots in the path) L. If L was 0, then the results were equivalent to the ones produced by Chimaera. If L was 1, then the results were equivalent with PROMPT. If L was 2, then the result precision was maximized and reached 100%. With L=3 the average result precision was reduced to 61% and with L=4 the precision went up slightly to 65%. This kind of results were expected because very long paths are unlikely to produce accurate results. On the contrary, very short paths consistently produce extremely small but also extremely precise results sets.

Although Anchor-PROMPT is a very promising method for 1-1 alignment of concepts, it has some drawbacks that we analyze below: It does not discard false matches. These false matches include unrelated terms that

appear in identical positions in one pair of paths. However, these false matches can be removed if we apply a low bound on similarity score and discard the pairs of terms with a similarity score less that the low bound.

29

This approach does not work well when the source ontologies are constructed differently. To be more specific, when one of the source ontologies is a deep one with many interlinked classes and the other ontology is a shallow one where the hierarchy has a few levels, most of the slots are related with top concepts and few notions are reified, then the results are no different from the results produced by the approaches that consider only local context.

3.2.5 FCA-MERGE

The FCA-MERGE [Stumme and Maedche, 2001] method for ontology merging is a bottom-up approach which offers a global structural description of the merging process. Its mechanism is based on application-specific instances of two given ontologies O1 and O2 that are to be merged. The overall process of merging two ontologies is depicted in figure 6 and consists of three main steps, namely:

i. Instance extraction from a given set of domain-specific text documents by applying natural language techniques. Plus computing of two formal contexts K1 and K2 based on the extracted instances.

ii. The FCA-MERGE core algorithm that derives a common context and computes a concept lattice by applying mathematically founded techniques taken from Formal Concept Analysis ([Wille, 1982]; [Ganter and Wille, 1999]).

iii. The generation of the final merged ontology based on the concept lattice with the help of an ontology engineer.

Figure 6: FCA-MERGE method

This ontology-merging method takes as input data the two ontologies and a set D of natural language documents. The documents may be taken by the target application which requires the final merged ontology. From the documents in D, instances are extracted by the application of linguistic techniques. An information extraction-based approach for ontology-based extraction is conceived, which has been implemented on top of SMES (Saarbrucken Message Extraction System), a shallow text processor for German [Neumann et al, 1997]. The architecture of SMES comprises a tokenizer based on regular expressions, a lexical analysis component including a word and a domain lexicon and a chunk parser. The tokenizer scans the text in order to identify boundaries of words and complex expressions as well as to expand abbreviations. The lexicon contains stem entries and subcategorization frames describing information used for lexical analysis and chuck parsing. Furthermore, the domain-specific part of the lexicon contains lexical

30

entries that express natural language representations of concepts and relations. Lexical analysis uses the lexicon to perform: (i) morphological analysis, (ii) recognition of named entities, (iii) part-of-speech tagging, (iv) retrieval of domain-specific information. The step (iv) associates single words or complex expressions with a concept from the ontology if a corresponding entry in the domain –specific part of the lexicon exists.

The extraction of the instances from documents is necessary because there are usually no instances which are already classified by both ontologies. However, if this situation is given, one can skip the first step and use the classification of the instances directly as input for the two formal contexts.

The second step of the ontology merging process is the FCA-MERGE core algorithm. This algorithm merges the two contexts and computes a pruned concept lattice from the merged context using FCA techniques. This pruned concept lattice has the same degree of detail as the two source ontologies. The computation of the pruned concept lattice is performed with the use of the TITANIC algorithm [Stumme et al, 2000]. Compared to other algorithms for computing concept lattices, TITANIC has the advantage that it computes the formal concepts via their key sets (or minimal generators). Key sets serve two purposes. Firstly, they indicate if the generated formal concept gives rise to the target ontology or not. A concept is new if and only if it has no key sets of cardinality one. Secondly, the key sets of cardinality two or more can be used as generic names for new concepts and they indicate the arity of new relations.

Instance extraction and the FCA-MERGE core algorithm are fully automatic. The final step of deriving the merged ontology from the concept lattice requires human interaction. Based on the pruned concept lattice and the sets of relation names R1 and R2, the ontology engineer creates the concepts relations of the target ontology. More specifically, each of the formal concepts of the pruned concept lattice is a candidate for a concept, a relation or a new subsumption in the target ontology. In addition, there is a number of queries which may be used to focus on the most relevant parts of the pruned concept lattice. Last but not least, a graphical means of the ontology engineering environment OntoEdit is offered for supporting this final step of the ontology merging method.

For obtaining good results, a few assumptions have to be met by the input data: Firstly, the documents have to be relevant to each of the source ontologies. A document, from which no instance is extracted for each source ontology, can be discarded. Secondly, the documents have to cover all concepts from the source ontologies. Concepts which are not covered have to be treated manually after the merging procedure (or the set of documents has to be expanded). Last but not least, the documents must separate the concepts well enough. If two concepts which are considered as different always appear in the same documents, FCA-MERGE will map them to the same concept in the target ontology (unless this decision is overruled by the knowledge engineer). When this situation appears to often, the knowledge engineer might want to add more documents which further separate the concepts.

EVALUATION: No evaluation of the FCA-MERGE method has been conducted so far. However, it is planned to use FCA-MERGE to generate independently a set of merged ontologies that are to be compared using standard information retrieval measures which will allow for the evaluation of the performance of FCA-MERGE. It is important to note, though, that FCA-MERGE was tested by conducting general experiments based

31

on tourism ontologies that have been modeled in an ontology engineering seminar. These ontologies contained approximately 50-60 concepts and 20-30 relations and the results of the alignment were quite promising.

3.2.6 Information-Flow-based Ontology Mapping

The IF-Map [Kalfoglou and Schorlemmer, 2002] approach and method for ontology mapping and merging exhibits the following core characteristics: use of formal definitions of ontology mapping, use of Information Flow theory, expressed in a declarative and executable language in a domain and tool-independent manner, applied as a method and as a theory for ontology mapping and being – under circumstances – fully automatic.

In figure 7 the approach to ontology mapping is illustrated. In particular, the focus is on the use of IF (Information Flow) as the underpinning mathematical foundation for establishing mappings between two ontologies. Later, these mappings will be formalized in terms of logic infomorphisms. The solid rectangular line surrounding Reference Ontology, Local Ontology 1, Local Ontology 2 denotes the existing ontologies. It is assumed that Local Ontology 1 and Local Ontology 2 are ontologies used by different communities and populated with their instances, while Reference Ontology is an agreed understanding for favoring knowledge sharing, and is not supposed to be populated. The dashed rectangular line surrounding Global Ontology denotes an ontology that does not exist yet, but will be constructed “on the fly” for the purpose of alignment. The solid arrow lines linking Reference Ontology with Local Ontology 1 and Local Ontology 2 denote IF between these ontologies formalized as logic infomorphisms. The dashed arrow lines denote the embedding from Local Ontology 1 and Local Ontology 2 into Global Ontology. This latter is the sum of the local ontologies modulo Reference ontology and the generated logic infomorphisms. This extension will be the merging part of IF-Map which is not implemented yet.

Figure 7: IF-Map scenario for ontology mapping

The IF-Map process, as depicted in figure 8, is a step-wise process that consists of four major steps: (a) ontology harvesting, (b) translation, (c) infomorhisms, and (d) display results. In the ontology harvesting step the ontology acquisition is performed.

32

Ontologies are acquired by a variety of methods: existing ontologies are used; they are downloaded from ontology libraries (like Ontolingua[Farcuhar et al., 1997] or WebOnto[Domingue, 1998] servers); they are edited in ontology editors (like Protégé) or they are harvested from the Web. As a result of the versatile ontology acquisition step, various ontology-language formats ranging from KIF [Gesenereth and Fikes, 1992] and Ontolingua to OCML [Motta, 1999], RDF, Prolog and native Protégé knowledge bases have to be dealt with.

Figure 8: The IF-Map architecture

The latter introduces the second step of the IF-Map process, that of translation. For reasons of specification of IF-Map method in Horn Logic and of executing it with the aim of a Prolog engine, the above formats are partially translated to Prolog clauses taking into account only the constructs of class taxonomy, relations and representative instances of classes. These translator programs are either written in-house or whenever possible public translators are used.

The next step in the IF-Map process is the main mapping mechanism: the IF-Map method. This step will find logic infomorhisms, if any, between the two ontologies under examination, and in the last step of the process they are displayed in RDF format. This last step involves translating back from Prolog to RDF triples with the aim of an intermediary Java Layer, where RDF is being produced using the Jena RDF API [McBride, 2001]. In addition, the results of the mapping are stored in a knowledge base for future reference and maintenance.

Although the whole IF-Map process can be fully automatic, there is a significant drawback that might transform it to a semi-automatic process. One step of the IF-Map method enforces an arity-compatibility check to ensure that the local instances of ontologies are mapped onto appropriate reference types. However, when automating this step, one must be aware of undesired assignments occurring when the prospective relations to be mapped share the same types but do not have the same semantics. The above problem is tackled in two ways: (a) a partial map of concepts from one ontology to concepts of the other or (b) some representative instances from the Local to their Reference counterparts must be classified. While the first way is full automated with the employment of a set of heuristics that work on a purely syntactic match fashion that use

33

the is-a hierarchy and type checking to find types which are shared by relations in both ontologies, if it fails then the second way must be used which requires that the knowledge engineer will classify representative instances manually.

EVALUATION: There was a large-scale experiment that was conducted in the context of the AKT project. The setting of the experiment is as follows: in the AKT project, five participating universities are contributing their own ontologies representing their own important concepts in the domain of computer-science departments in the UK. There is also a reference ontology, AKT Reference, which was built collaboratively by interested participants from all five cites. The local ontologies were populated while the reference ontology was not.

IF-MAP was applied to map AKT Reference to Southampton’s and Edinburgh’s local ontologies. These local ontologies are populated with a few thousand of instances (ranging from 5k to 18k) and a few hundreds of concepts. There are a few axioms defined and both have relations. The AKT Reference ontology is more compact, it has no instances and approximately 65 concepts with 45 relations. There are a few axioms defined as well. The result of the mapping was some concept mappings and some relation mappings that were generated automatically.

The algorithm implemented so far is of exponential complexity in the number of concepts. This will be improved if the IF-MAP method is based on an incremental construction of ontology morphisms because in this way large-scale ontologies will be tackled: first, only certain manageable fragments of the ontologies are mapped, and next, these fixed maps are used to guide the generation of larger fragments. The automatic identification of such fragments is currently under investigation.

3.2.7 CAIMAN SYSTEM

Each member in a Community of Interest organizes his documents according to his own categorization scheme (ontology). CAIMAN [Lacher and Groh, 2001] exploits this personal ontology, which is essentially the perspective of a user on a domain, to retrieve and publish relevant documents. For each concept node in the personal ontology, a corresponding node in the community ontology is identified. Documents that are only assigned to a node in the community ontology can then be proposed to the user and vice versa. This system’s approach to mapping concept nodes in two ontologies onto each other is based on an implicit extensional representation of the ontologies. Each concept node in an ontology is implicitly represented by the documents assigned to each concept. Using machine learning techniques for text classification, a measure for the probability that two concepts are corresponding is calculated for each concept node in the personal ontology.

Ontology mapping in CAIMAN is performed in the following way: for each concept node in ontology graph A, a corresponding node in ontology graph B has to be found. A probability measure is calculated that for a node Ai in the personal ontology, node Bj in the community ontology is the corresponding node. The node Bj with the maximum probability measure wins. The algorithm for finding the best suited Bj for a specific Ai is as follows:

34

For all nodes Bj ε B calculate p(Ai,Bj). Find Bj and Bk having the maximum probability measure. If d = p(Ai,Bj) – p(Ai,Bk) is above a certain threshold, then Ai and Bj are the

corresponding nodes. Else calculate the probability difference of the parent nodes of Bj and Bk until

the difference is above the certain threshold. If there is no parent node for one of the b nodes or they have a common

parent, then the decision is left to the user.

Currently, the mapping performed does not consider the graph structure of the personal ontology. But it will be included in future versions of the CAIMAN System.

The calculation of the probability measure for p(Ai,Bj) is performed with the employment of machine learning techniques for text categorization. A representative feature vector for each concept node in an ontology is calculated. Then the similarity of two of those class vectors is measured by a simple cosine measure. The representative feature vector for one concept node is calculated as a modified Rocchio centroid vector. In other words, the representative vector of a concept node represents an average of all documents assigned to that concept node. Before text classification is performed, the feature vectors have to be extracted from the documents and weighted. In CAIMAN, the BOW toolkit (http://www.cs.cmu.edu/~mcallum/bow/) is used to perform a stop-word removal as well as a subsequent feature stemming. A word-count feature vector is generated and the features are weighted with a TF/IDF (Term Frequency/Inverse Document Frequency) weighting scheme. The current implementation of CAIMAN System considers a flat list of classes for classification. It is planned that hierarchical feature selection and classification are to be implemented as presented in [Chakrabarti et al., 1998].

EVALUATION: The CAIMAN System is currently in a prototype stage. First tests of the quality of the mapping of concepts have been conducted. However, no thorough evaluation of CAIMAN services has been performed. The first test set was conducted with a collection of bookmarked pages that have been marked to the Open Directory Project (http://www.dmoz.org). For this test, a stemming algorithm and a naive Bayes classifier were used for calculating node mapping. The results were unsatisfactory because the web pages the bookmarks pointed to were mostly only title pages with links to a number of additional pages that held the actual document. The second test set is performed in two types of testbeds. The first testbed consists of user bookmarks which link to documents in PDF or PS format as the user repository. For the community repository, there are two testbeds: the RESEARCHINDEX (http://www.researchindex.com) and a self-implemented community repository, managed by the Community Items Tool [Koch and Lacher, 2000]. A number of experiments were conducted that showed that CAIMAN was able to identify corresponding nodes in two ontologies that had been assembled by the same person. CAIMAN is currently evaluated for the case of more essentially different ontologies as well as for a larger number of users.

35

3.2.8 HICAL

The rapid advances in computer technology have allowed us to archive much more information than ever before. To maintain such archives, hierarchical categorization is often used, by which information is categorized based on a concept hierarchy. The hierarchical organization is usually maintained by a single organization or individual for consistency. In practice, a concept hierarchy is not suitable for multiple purposes, requiring that concept hierarchies be constructed for individual application domains. In other words, concept hierarchies are maintained in each domain separately. This means that it is not convenient to reuse information because the concept categorization differs between concept hierarchies. In HICAL [Ichise et al., 2001] this problem is solved by introducing a machine learning method that allows a concept in one concept hierarchy to be aligned with another concept in another concept hierarchy. In this way, information that is classified under one concept in one concept hierarchy C1 and is not classified in concept hierarchy C2, will be classified in C2 if and only if there is a alignment rule that associates the concept in C1 with another concept in the C2 concept hierarchy. In addition, no change to concept hierarchy C2 is needed for the above situation to happen.

To find similar categories, the degree of similarity between two categorization criteria is determined with the use of the ‘k-statistic’ method [Fleiss, 1973]. This method is an established method of evaluating similarity between two criteria. This method is explained as follows. Suppose there are two categorization criteria, S1 and S2. Due to the fact that we can decide whether a particular instance belongs to a particular category or not, instances are divided into four classes as shown in figure 9.

Figure 9: Classification of instances between two categories

For example, N11 denotes the number of instances which belong to both categories S1 and S2. If S1 and S2 have the same criterion of categorization, then N12 and N21 become close to zero. On the other hand, if S1 and S2 have a different categorization criterion, then N11 and N22 become close to zero. The ‘k-statistic’ method utilizes the above principle to determine the similarity of the categorization criteria.

The relationship between two categorization criteria is examined from ‘top’ to ‘bottom’. The alignment algorithm is shown in figure 10. First, the most general categories in the two concept hierarchies are compared using the ‘k-statistic’. If the comparison confirms that the two categories are similar, then the algorithm outputs an alignment rule for them. At the same time, the algorithm pairs one of these two similar categories with a ‘child’ category in the other similar category. This new pair is evaluated recursively using the ‘k-statistic’ method. Each output alignment rule is applied to deciding whether a particular instance in C1 fits the concept hierarchy in C2.

36

EVALUATION: In order to evaluate the HICAL algorithm, three experiments were conducted using the Yahoo! Japan [Yahoo! Japan, 2000] and LYCOS Japan [LYCOS Japan, 2000] directories as concept hierarchies, and the links (URLs) in each directory as information instances. The Yahoo! directory contains approximately 41000 categories and 224000 URLs. LYCOS contains approximately 5700 categories and 48000 URLs. Approximately 25000 URLs are common to both Yahoo! and LYCOS. Generally speaking, as a concept hierarchy, Yahoo! contains more knowledge than LYCOS, however half of the URLs in LYCOS are not contained in Yahoo!. This demonstrates that even a concept hierarchy that contains enormous amount of information does not cover all information. In this study, three category pairs (and sub-categories) were used as the two concept hierarchies. The location in Yahoo! and LYCOS are as follows:

Yahoo!: Arts / Humanities / LiteratureLYCOS: Arts / Literature

Yahoo!: Business and Economy / CompaniesLYCOS: Business Industry / Company

Yahoo!: RecreationLYCOS: Hobby Sports

Figure 10: HICAL main algorithm

37

Cross validation (10-fold) for shared instances was conducted. Shared instances were divided into 10 data sets; 9 of these sets were used for training and the remaining set was used for testing. Ten experiments were conducted for each data set. The parameter of significance level for the ‘k-statistic’ method was set at 5%. The results were that 80% of the instances used in the experiments, with the exception of the company domain, were categorized correctly by HICAL. In the company domain, 60% of the instances were categorized appropriately. Another point was that the Yahoo-to-Lycos alignment was more accurate than the inverse operation. This can be explained by the fact that the learned rules of the Yahoo-to-Lycos alignment are likely to involve transferring instances from relatively complex categories to relatively general categories. Such a trend is likely to result in relatively accurate results.

The main drawback of the HICAL algorithm is that it does not work in cases in which the more general and more specific categories within a concept hierarchy are independent. The reason is that HICAL works in a ‘top-down’ manner. This problem will be solved in future versions of the HICAL method were a ‘bottom-up’ approach will also be considered.

3.2.9 GLUE

The GLUE [Doan et al., 2002] system applies machine learning techniques to semi-automatically create semantic mappings between concepts of two ontologies. First of all, well-founded notions of semantic similarity are described based on the joint probability distribution of the concepts involved. Instead of committing to a particular definition of similarity, GLUE calculates the joint distribution of the concepts, and lets the application use the joint distribution to compute any suitable similarity measure. Hence, such notions make this approach applicable to a broad range of ontology-matching problems that employ different similarity measures. Secondly, multi-strategy learning is used for finding the joint probability distribution, and thus the similarity value of any concept pair in two given taxonomies. The GLUE system utilizes many different types of information to maximize matching accuracy such as an instance name, an instance’s value format or the word frequencies in an instance’s value. Multi-strategy learning also makes this system extensible to additional learners, as they become available.

Finally, GLUE attempts to exploit available domain constraints and general heuristics in order to improve the matching accuracy. For the above reason, a unified approach was developed based on relaxation labeling, a powerful technique used extensively in the vision and image processing community and successfully adopted to solve matching and classification problems in natural language processing and hypertext classification.

38

Figure 11: The GLUE architecture

The basic GLUE architecture is shown in figure 11. It consists of three main modules: Distribution Simulator, Similarity Estimator and Relaxation Labeler. The Distribution Simulator takes as input two taxonomies O1 and O2 together with their data instances. Then it applies machine learning techniques to compute for every pair of concepts their joint probability distribution. This joint distribution consists of four numbers and . Thus a total of 4|O1||O2| numbers will be computed, where |Oi| is the number of nodes in taxonomy Oi. The Distribution Estimator uses a set of base learners and a meta-learner. GLUE feeds the above numbers into the Similarity Estimator, which applies a user-supplies similarity function to compute a similarity value for each pair of concepts . The output from this module is a similarity matrix between the concepts in the two taxonomies. The Relaxation Labeler module then takes the similarity matrix, together with domain-specific constraints and heuristic knowledge, and searches for the mapping configuration that best satisfies the domain constraints and the common knowledge, taking into account the observed similarities. This mapping configuration is the output of GLUE.

The GLUE system interprets ontology as taxonomy of concepts, a set of attributes for each concept and a set of relations between each concept. However, it discovers only semantic 1-1 mappings between two taxonomies. It does not deal with special kinds of mapping like ‘name maps to the concatenation of first-name and last-name’ or a concept maps to a union of concepts. Extending matching to attributes and relations and

39

considering more complex types of mapping is the subject of ongoing research. Except from the 1-1 semantic mapping of concepts, it can place a concept A into an appropriate taxonomy T and it can use different similarity measures for different concepts. In addition, it makes heavy use of the fact that we have data instances associated with the ontologies we are matching.

Although the accuracy of GLUE is quite impressive, there are some limitations that prevent it from obtaining even higher accuracy. These limitations are listed below: Some nodes cannot be matched because of insufficient training data. While there is

no general solution to this problem, in many cases it can be mitigated by adding base learners that can exploit domain characteristics to improve matching accuracy.

The Relaxation Labeler performs local optimizations that are sometimes converged to local maxima, thereby not finding correct mappings for all concepts. Here, the challenge will be in developing search techniques that work better by taking a more ‘global perspective’, but still retain the runtime efficiency of local optimization.

The two base learners used in the implementation of the system are rather single general-purpose text classifiers. Using other learners that perform domain-specific feature selection and comparison can also improve the accuracy.

Some nodes cannot be matched automatically because they are simply ambiguous. A solution to this problem is to incorporate user interaction into the matching process.

GLUE currently tries to predict the best match for every node in the taxonomy. However, in some cases, such a match simply does not exist. GLUE must be extended to be aware of such cases.

EVALUATION: GLUE was evaluated on three domains. The domains Course Catalog I and II describe courses at Cornell University and the University of Washington. The taxonomies of Course Catalog I have 34 – 39 nodes, and are fairly similar to each other. The taxonomies of Course Catalog II are much larger (166 – 176 nodes) and much less similar to each other. Courses are organized into schools and colleges, then into departments and centres within each college. The Company Profile domain uses ontologies from Yahoo.com and TheStandard.com and describes the current business status of companies. Companies are organized into sectors, then into industries within each sector. In each domain, two taxonomies were downloaded. The entire set of data instances was downloaded for each taxonomy, and some trivial data cleaning was performed such as removing HTML tags and phrases such as ‘course not offered’ from the instances. Instances of size less than 130 bytes were also removed, because they tend to be empty or vacuous, and thus do not contribute to the matching process. Then all nodes with fewer than 5 instances were removed, because such nodes cannot be matched reliably due to lack of data.

For each domain, two experiments were performed. In each experiment, GLUE was applied to find the mappings from one taxonomy to the other. The matching accuracy of a taxonomy is then the percentage of the manual mappings (for that taxonomy) that GLUE predicted correctly. Each experiment was performed four times, satisfying four different scenarios representing the accuracy produced by: (1) the name learner alone, (2) the content learner alone, (3) the meta-learner using the previous two learners, and (4) the relaxation 40abeller on top of the meta-learner (i.e., the complete GLUE system).

40

The results show that GLUE achieves high accuracy across all three domains, ranging from 66 to 97%. In contrast, the best matching results of the base learners, achieved by the content learner, are only 52 – 83%. It is interesting that the name learner achieves very low accuracy, 12 – 15% in four out of six scenarios. This is because all instances of a concept, say B, have very similar full names. Hence, when the name learner for a concept A is applied to B, it will classify all instances of B as A or not A. In cases when this classification is incorrect, which might be quite often, using the name learner alone leads to poor estimates of the joint distributions. The poor performance of the name learner underscores the importance of data instances and multi-strategy learning in ontology matching. The results clearly show the utility of the meta-learner and relaxation labeller. Even though in half of the cases the meta-learner only minimally improves the accuracy, in the other half it makes substantial gains, between 6 and 15%. And in all but one case, the relaxation Labeller further improves accuracy by 3 – 18%, confirming that it is able to exploit the domain constraints and general heuristics. In one case (from Standard to Yahoo), the relaxation Labeller decreased accuracy by 2%. In the current experiments, GLUE utilized on average only 30 to 90 data instances per leaf node. The high accuracy in these experiments suggests that GLUE can work well with only a modest amount of data.

3.2.10 Common Top Level Model

A different approach for enabling model level interoperability is the use of a common top level ontology. One of the projects that take this approach is ABC [Brickley et al., 1999], a common conceptual model to facilitate interoperability among application metadata vocabularies.

ABC aims at the interoperability of multiple metadata packages that may be associated with and across resources. These packages are by nature not semantically distinct, but overlap and relate to each other in numerous ways. It exploits the fact that many entities and relationships (e.g. people, places, creations, organizations, events, certain relationships and the like) are so frequently encountered that they do not fall clearly into the domain of any particular metadata vocabulary but apply across all of them. ABC is an attempt to:

Formally define these underlying common entities and relationships Describe them and their inter-relationships in a

s/usr/bin/smbclient –M pelikaanimple logical model Provide the framework for extending these common semantics to domain and

application-specific metadata vocabularies

A comparable approach (although more general) is the IEEE Standard Upper Ontology (IEEE SUO Working Group). This standard will specify the semantics of a general purpose upper level ontology. This will be limited to the upper level, which provides definition for general-purpose terms and provides a structure for compliant lower level domain ontologies. It is estimated to contain between 1000 and 2500 terms plus roughly ten definitional statements for each term. Is intended to provide the foundation for ontologies of much larger size and more specific scope.

41

These approaches can solve some interoperability, but requires a manual mapping of the ontologies to the common ontology.

3.3 Versioning

Only one technique is found that provides support for the ontology versioning problems. Of course, there is a lot of experience with these kinds of problems in the area of software engineering and databases, but it is not yet clear whether this can directly applied to web-based ontologies.

3.3.1 SHOE Ontology Integration

SHOE [Heflin and Hendler, 2000] is an HTML-based ontology language. The language includes techniques for combining and integrating different ontologies. SHOE provides a rule mechanism to align ontologies. Common items between ontologies can be mapped by inference rules. First, terminological differences can be mapped using simple if-and-only-if rules. Second, scope differences require mapping a category to the most specific category in the other domain that subsumes it. Third, some encoding differences can be handled by mapping individual values. Not all encodings can be mapped in SHOE, for example arithmetic functions would be needed to map meters to feet.

To solve versioning problems, the SHOE project gives version numbers to ontologies and suggests three ways to incorporate the results of an ontology integration effort. These revising schemes allows for the different effects of revisions on the compatibility:

1) In the first approach, a new mapping ontology that extends all the existing ones is created. Users of the integrated ontology should explicitly conform to the newly created ontology

2) Second, each ontology that is involved in the integration could be revised with the mutual relations to the other ontologies

3) Third, it is possible to create a new intersection ontology, which will be extended by the already existing ontologies.

Any source can commit to the ontology of its choice, thus saying that it agrees with any conclusions that logically follow from its statements and the rules in the ontology. Agents are free to pick which ontologies they use to interpret a source and depending on the differences between these two ontologies they may get the intended meaning or an alternate one.

The SHOE versioning facilities provide both identification of the revisions and an explicit specification of its relation to other versions.

42

4 Overview of Approaches

Below, there is an overview of the approaches that are used in the projects and techniques mentioned above. We also list some papers that describe a similar approach.

Tables 1 and 2 relate the tools and approaches that were presented to the above framework. The table should read as follows:

An ‘A’ means “tool or technique solves this without user interaction (automatically)”

An ‘U’ means “tool or technique suggests solutions to the user” An ‘M’ means “tool or technique provides a mechanism for specifying the

solution to this problem”

Issues SKC

Chim.

PROMPT

SHOE

OntoM.

Metamodel

OKBC

Layering

Languagelevelmismatches

Syntax M M M MRepresentatio

nM M M M

Semantics M M MExpressivity

Ontologylevelmismatches

Paradigm MConcept

descriptionM

Coverage of model

Scope of concepts

M U U M M

Synonyms M U U M MHomonyms M UEncoding M M M

PracticalProblems

Finding alignments

U U U

Diagnosis of results

A A A

Repeatability A A AOntologyversioning

Identification MChange tracking

M

Translation

Table 1 – Table of problems and approaches for combined use of ontologies

We discovered four different approaches that aim at enabling interoperability between different ontologies at the language level:

1) Aligning the meta-model: The constructs in the language are formally specified in a general model (Bowers and Delcambre, 2000)

2) Layered interoperability: Aspects of the language are split up in clearly defined layers, as a result of what interoperability can be solved layer by layer (Melnik and Decker, 2000)

3) Transformation rules: The relation between two specific constructs in different ontology languages is described with a rule that specifies the transformation from the one to the other (OntoMorph)

4) Mapping onto a common knowledge model: The constructs of an ontology language are mapped onto a common knowledge model (OKBC)

43

We can observe that the third approach can be used to implement the fourth.

Issues Anchor-PROMPT

FCA-MERGE

IF-MAP

CAIMAN HICAL GLUE

Languagelevelmismatches

Syntax M URepresentation M U

SemanticsExpressivity

Ontologylevelmismatches

Paradigm MConcept

descriptionU A

Coverage of model

Scope of concepts

U A A A

Synonyms U U A A A AHomonyms U A A A AEncoding M

PracticalProblems

Finding alignments

A U U A A U or A

Diagnosis of results

U

Repeatability A A A A A AOntologyversioning

IdentificationChange tracking

Translation

Table 2 – Table of problems and approaches for combined use of ontologies (continued)

We want to recall that the alignment of concepts is a task that requires understanding of the meaning of concepts and cannot be fully automated. Consequently, at the model level, we found tools that suggest alignments and mappings based on heuristics matching algorithms and provide means to specify these mappings. Such tools support the user in finding the concepts in the separate ontologies that might be candidates for merging. Some tools go a little bit further by even suggesting the actions that should be performed. Roughly spoken, there are two types of heuristics:

1) Linguistic based matches: Terms with the same word-stem, nearby terms in WordNet or even simpler heuristics, like omitting hyphens and capitalizing all terms, etc. Examples can be found in (Hovy, 1998 []; Knight and Luk, 1994 []);

2) Structural and model similarity: See for example the techniques described in Chimaera and Weinstein and Birmingham (1999) [].

Although a full automation of the alignment task is not possible, new techniques have arisen that automate the alignment task but either require a manual or semi-automatic kick-start or their results produced are not always satisfactory. These techniques are group into three categories. The first category from the other two is totally different in respect to its approach. The other two categories are similar because the take into account the instances of the ontologies in question and they use formal methods. The first part of the above similarity provides aid along with the formal methods in dealing with synonyms, homonyms and scope of concepts but it requires that any of the two ontologies in question ought to have a significant amount of appropriate instances. The three categories are:

44

1. Anchor-PROMPT is an attempt to extent structural heuristics methods like Chimaera and PROMPT considering global context in addition to local context.

2. Techniques that use formal mathematical foundations to automate the alignment process taking into account ontology instances. We analyzed FCA-MERGE and IF-MAP as representative methods of this category. FCA-MERGE also uses linguistic analysis of the documents/instances of the ontologies.

3. Approaches that use machine learning techniques (CAIMAN, HICAL and GLUE) and require the input ontologies in alignment to be instantiated. The results from these methods are similar to the ones found from the previous referred methods. The GLUE system is very innovative and extensible and stands out in respect to the other two approaches.

A slightly different approach for interoperability at the model is the use of a common top level ontology. This approach is only useful if there is a willingness to conform to a common standard.

There are also different approaches for diagnosing or checking the results of the alignments. We have observed two types of checks:

1) Domain independent verification and validation checks: Name conflicts, dangling references etc. (Chimaera and others)

2) Validation that requires some kind of reasoning: Redundancy in the class hierarchy, value restrictions that violate class inheritance etc. (OntoMorph, PROMPT)

Several tools support an executable specification of the mappings and transformations (SKC, OntoMorph, PROMPT, Anchor-PROMPT, FCA-MERGE, IF-MAP, CAIMAN, HICAL, GLUE). This allows re-merging of revised ontologies. In this way, the intellectual effort that is invested in finding and formulating the alignments is preserved.

Finally, most techniques and tools do not deal with versioning. Only SHOE elaborates on schemes that enable the combined use of different ontologies. They mention three ways to integrate separate (revisions of) ontologies without invalidating the existing ones.

45

5 Conclusion and Remarks

In this paper, we analyzed the problems that hinder the combined use of ontologies. These problems are of several kinds and may occur at several levels. The analysis of the problems yielded a framework that is used to examine what solutions are provided by current tools and techniques. This examination is still very general, and will be worked out further in the future.

We have seen that there are several approaches that provide reasonable support for language level interoperability. Mismatches in expressiveness between languages are not solvable, and consequently, none of the approaches takes this into account.

The most difficult problems are those of conceptual integration. There are a lot of techniques and heuristics for suggesting alignments. Many of them automate or semi-automate the alignment task. However, we think that semantic mapping at the model level will remain a task that requires a certain level of human intervention. This is because not only the alignment techniques are not formally evaluated but also their results are not always correct.

Finally, in an open environment such as the Web, versioning methods will be very important. We have seen that this aspect is underdeveloped in most approaches. We think that more comprehensive schemes for interoperability of ontologies are required.

47

6 References

[Arens et al., 1996] Yigal Arens, Chun-Nan Hsu, and Craig A. Knoblock. Query processing in the sims information mediator. In Advanced Planning Technology. AAAI Press, California, USA, 1996.

[Bowers and Delcambre, 2000] Shawn Bowers and Lois Delcambre. Representing and transforming model-based information. In First Workshop on the Semantic Web at the Fourth European Conference on Digital Libraries, Lisbon, Portugal, September 18–20, 2000.

[Brickley et al., 1999] Dan Brickley, Jane Hunter, and Carl Lagoze. ABC: A logical model for metadata interoperability, October 1999. Harmony discussion note, see http://www.ilrt.bris.ac.uk/discovery/harmony/docs/abc/abc_draft.html.

[Calvanese et al., 2001] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Description logics for information integration. In Computational Logic: From Logic Programming into the Future (In honour of Bob Kowalski), Lecture Notes in Computer Science. Springer-Verlag, 2001. To appear.

[Chakrabarti et al., 1998] Chakrabarti, S.; Dom, B.; Agrawal, R.; and Raghavan, P. 1998. Scalable feature selection, classification and signature generation for organizing large text databases into hierachical topic taxonomies. The VLDB Journal (7):pp. 163–178.

[Chalupsky, 2000] Hans Chalupsky. OntoMorph: A translation system for symbolic logic. In Anthony G. Cohn, Fausto Giunchiglia, and Bart Selman, editors, KR2000: Principles of Knowledge Representation and Reasoning, pages 471–482, San Francisco, CA, 2000. Morgan Kaufmann.

[Chaudhri et al., 1998] Vinay K. Chaudhri, Adam Farquhar, Richard Fikes, Peter D. Karp, and James P. Rice. OKBC: A programmatic foundation for knowledge base interoperability. In Proceedings of the 15th National Conference on Artificial Intelligence(AAAI-98) and of the 10th Conference on Innovative Applications of Artificial Intelligence (IAAI-98), pages 600–607, Menlo Park, July 26–30 1998. AAAI Press.

[DAML, 2001] DAML (2001). DAML ontology library. http://www.daml.org/ontologies/.

[Doan, 2002] Doan AnHai. Learning to map between structured representations of data. Phd, University of Washington, 2002.

[Doan et al., 2002] A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the Semantic Web. In Proceedings of the 11th International World Wide Web Conference (WWW 2002), Hawaii, USA, May 2002.

49

http://www.ilrt.bris.ac.uk/discovery/harmony/docs/abc/abc_draft.html

[Domingue, 1998] J. Domingue. Tadzebao and WebOnto: Discussing, Browsing, and Editing Ontologies on the Web. In Proceedings of the 11th Knowledge Acquisition, Modelling and Management Workshop, KAW’98, Banff, Canada, April 1998.

[Euzenat, 2001] Jérôme Euzenat. Towards a principled approach to semantic interoperability. In Asuncion Gomez-Perez, Michael Gruninger, Heiner Stuckenschmidt, and Michael Uschold, editors, Workshop on Ontologies and Information Sharing, IJCAI’01, Seattle, USA, August 4–5, 2001.

[Farcuhar et al., 1997] A. Farcuhar, R. Fikes, and J. Rice. The Ontolingua Server: a tool for collaborative ontology construction. International Journal of Human-Computer Studies, 46(6):707-728, June 1997.

[Fensel et al., 1999] Fensel, D., Benjamins, V.R., Motta, E. and Wielinga, R. (1999). UPML: A Framework for knowledge system reuse. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden.

[Fleiss, 1973] Joseph L. Fleiss. Statistical Methods for Rates and Proportions. John Wiley & Sons, 1973.

[Ganter and Wille, 1999] B. Ganter, R. Wille: Formal Concept Analysis: mathematical foundations. Springer 1999.

[Gennari et al., 1998] Gennari, J.H., Grosso, W. and Musen, M.A. (1998). A method-description language: An initial ontology with examples. In: Proceedings of the Eleventh Banff Knowledge Acquisition for Knowledge-Bases Systems Workshop, Banff, Canada.

[Gesenereth and Fikes, 1992] R. Gesenereth and R. Fikes. Knowledge Interchange Format. Computer Science Dept., Stanford University, 3.0 edition, 1992. Technical Report, Logic-92-1.

[Goasdoué et al., 1999] François Goasdoué, Véronique Lattes, and Marie-Christine Rousset. The use of carin language and algorithms for information integration: The picselproject,. International Journal of Cooperative Information Systems (IJCIS), 9(4):383 – 401, 1999.

[Goh, 1997] Cheng Hian Goh. Representing and Reasoning about Semantic Conflicts in Heterogeneous Information Sources. Phd, MIT, 1997.

[Grosso et al., 1998] William E. Grosso, John H. Gennari, Ray W. Fergerson, and Mark A. Musen. When knowledge models collide (how it happens and what to do). In Proceedings of the 11th Workshop on Knowledge Acquisition, Modeling and Management(KAW ’98), Banff, Canada, April 18–23 1998.

[Gruber, 1993] Tom Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–220, 1993.

50

[Heflin and Hendler, 2001] Jeff Heflin and James Hendler. Dynamic ontologies on the web. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 443–449. AAAI/MIT Press, Menlo Park, CA, 2000.

[Ichise et al., 2001] R. Ichise, H. Takeda, and S. Honiden. Rule Induction for Concept Hierarchy Alignment. In Proceedings of the Workshop on Ontology Learning at the 17th International Joint Conference on Arti_cial Intelligence (IJCAI), 2001.

[Kalfoglou and Schorlemmer, 2002] Y. Kalfoglou, and M. Schorlemmer (2002). Information Flow based ontology mapping. In Proceedings 1st International Conference on Ontologies, Databases and Applications of Semantics (ODBASE'02), Irvine, CA, USA.

[Kitakami et al., 1996] H. Kitakami, Y. Mori, and M. Arikawa. An intelligent system for integrating autonomous nomenclature databases in semantic heterogeneity. In Database and Expert System Applications, DEXA’96, number 1134 in Lecture Notes in Computer Science, pages 187–196, Z¨urich, Switzerland, 1996.

[Klein, 2001] Klein M. Combining and relating ontologies: an analysis of problems and solutions. In: Proceedings of the Workshop on Ontologies and Information Sharing, IJCAI'01, Seattle, USA, August 4-5, 2001.

[Koch and Lacher, 2000] Koch, M., and Lacher, M. S. 2000. Integrating community services - a common infrastructure proposal. In Proc. Knowledge-Based Intelligent Engineering Systems and Allied Technologies, pp. 56–59.

[Lacher and Groh, 2001] M. Lacher and G. Groh. Facilitating the exchange of explicit knowledge through ontology mappings. In Proceedings of the 14th Int. FLAIRS conference, 2001.

[LYCOS Japan, 2000] LYCOS Japan. http://www.lycos.co.jp, 2000.

[McBride, 2001] B. McBride. Jena: Implementing the RDF model and syntax specification. Technical report, HP Labs at Bristol, UK, December 2001. Semantic Web Activity, http://www.hpl.hp.com/semweb/.

[McGuinness et al., 2000] Deborah L. McGuinness, Richard Fikes, James Rice, and Steve Wilder. An environment for merging and testing large ontologies. In Anthony G. Cohn, Fausto Giunchiglia, and Bart Selman, editors, KR2000: Principles of Knowledge Representation and Reasoning, pages 483–493, San Francisco, 2000. Morgan Kaufmann.

[Melnik and Decker, 2000] Sergey Melnik and Stefan Decker. A layered approach to information modeling and interoperability on the web. In Electronic proceedings of the ECDL 2000 Workshop on the Semantic Web, Lisbon, Portugal, September 21 2000.

51

http://www.lycos.co.jp/

[Mena et al., 1996] E. Mena, V. Kashyap, A. Sheth, and A. Illarramendi. Observer: An approach for query processing in global information systems based on interoperability between pre-existing ontologies. In Proceedings 1st IFCIS International Conference on Cooperative Information Systems (CoopIS ’96). Brussels, 1996.

[Mitra et al., 2000] Prasenjit Mitra, Gio Wiederhold, and Martin Kersten. A graph-oriented model for articulation of ontology interdependencies. In Proceedings of Conference on Extending Database Technology, (EDBT 2000), Konstanz, Germany, March 2000. Also, Stanford University Technical Note, CSL-TN-99-411, August, 1999.

[Motta, 1999] E. Motta. Reusable Components for Knowledge Models: Case Studies in Parametric Design Problem Solving, volume 53 of Frontiers in Artificial Intelligence and Applications. IOS Press, 1999.

[Neumann et al., 1997] G. Neumann, R. Backofen, J. Baur, M. Becker, C. Braun: An information extraction core system for real world German text processing. Proc. ANLP-97, Washington.

[Noy and Musen, 2001] N. Noy and M. Musen. Anchor-PROMPT: Using Non-Local Context for Semantic Matching. In Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), 2001.

[Noy and Musen, 2000] Natalya Fridman Noy and Mark A. Musen. PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 2000. AAAI/MIT Press.

[Stuckenschmidt et al., 2000b] Heiner Stuckenschmidt, Holger Wache, Thomas V¨ogele, and Ubbo Visser. Enabling technologies for interoperability. In Ubbo Visser and Hardy Pundt, editors, Workshop on the 14th International Symposium of Computer Science for Environmental Protection, pages 35–46, Bonn, Germany, 2000. TZI, University of Bremen.

[Stumme et al., 2000] G. Stumme, R. Taouil, Y. Bastide, N. Pasquier, L. Lakhal: Fast computation of concept lattices using data mining techniques. Proc. KRDB ´00, http://sunsite.informatik.rwth-aachen.de/Publications/CEURWS/, 129–139.

[Stumme and Maedche, 2001] G. Stumme and A. Maedche. Ontology Merging for Federated Ontologies on the Semantic Web. In Proceedings of the International Workshop for Foundations of Models for Information Integration (FMII-2001), Viterbo, Italy, September 2001.

[Visser et al., 1997] Pepijn R. S. Visser, Dean M. Jones, T. J. M. Bench-Capon, and M. J. R. Shave. An analysis of ontological mismatches: Heterogeneity versus interoperability. In AAAI 1997 Spring Symposium on Ontological Engineering, Stanford, USA, 1997.

52

http://sunsite.informatik.rwth-aachen.de/Publications/CEURWS/

[Wache et al., 1999] H. Wache, Th. Scholz, H. Stieghahn, and B. K¨onig-Ries. An integration method for the specification of rule–oriented mediators. In Yahiko Kambayashi and Hiroki Takakura, editors, Proceedings of the International Symposium on Database Applications in Non-Traditional Environments (DANTE’99), pages 109–112, Kyoto, Japan, November, 28-30 1999.

[Wiederhold, 1994] Gio Wiederhold. An algebra for ontology composition. In Proceedings of 1994 Monterey Workshop on Formal Methods, pages 56–61, U.S. Naval Postgraduate School, Monterey CA, September 1994.

[Wille, 1982] R. Wille: Restructuring lattice theory: an approach based on hierarchies of concepts. In: I. Rival (ed.): Ordered sets. Reidel, Dordrecht, 445–470.

[Yahoo! Japan, 2000] Yahoo! Japan. http://www.yahoo.co.jp/, 2000.[LYCOS Japan, 2000] LYCOS Japan. http://www.lycos.co.jp/, 2000.

53

Introductionhy566/Deliverables/final/Kritik_Stratak... · Web view1 Introduction 7 1.1 Applications...

Documents

Transcript of Introductionhy566/Deliverables/final/Kritik_Stratak... · Web view1 Introduction 7 1.1 Applications...