ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1....

58
Internship Report July-December 2012 MARY S. MOUROUTSOU USAGE OF THE ILSP FOCUSED MONOLINGUAL CRAWLER FOR DEVELOPING THESAURI OF HISTORICAL PERIOD NAMES RETRIEVED FROM THE WEB ATHENS, 2013

Transcript of ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1....

Page 1: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Internship Report

July-December 2012

MARY S. MOUROUTSOU

USAGE OF THE ILSP FOCUSED MONOLINGUAL CRAWLER FOR DEVELOPING THESAURI OF

HISTORICAL PERIOD NAMES RETRIEVED FROM THE WEB

ATHENS, 2013

Supervisor: Dr. Stella MarkantonatouDr. Vasilis Papavasiliou

INSTITUTE FOR LANGUAGE AND SPEECH PROCESSING/R.C. “ATHENA”

Page 2: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

II

Page 3: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

USAGE OF THE ILSP FOCUSED MONOLINGUAL CRAWLER FOR DEVELOPING THESAURI OF

HISTORICAL PERIOD NAMES RETRIEVED FROM THE WEB

III

Page 4: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Internship Report

July-December 2012

MARY S. MOUROUTSOU

USAGE OF THE ILSP FOCUSED MONOLINGUAL CRAWLER FOR DEVELOPING THESAURI OF

HISTORICAL PERIOD NAMES RETRIEVED FROM THE WEB

ATHENS, 2013

Supervisor: Dr. Stella MarkantonatouDr. Vasilis PapavasiliouINSTITUTE FOR LANGUAGE AND SPEECH PROCESSING/R.C. “ATHENA”

Page 5: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means
Page 6: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

III

Page 7: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

This is an internship report. I did my internship with the Institute of Language and Speech Processing of “Athena”-Research and Innovation Center in Information, Communication and Knowledge Technologies (ILSP/R.C. “Athena”) in the framework of the postgraduate course “Technoglossia” that is organized by the Department of Linguistics/Division of Literature/School of Philosophy/National and Capodistrian University of Athens and the National Technical University of Athens

V

Page 8: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

VII

Page 9: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

ContentsContents................................................................................................ IX

Πίνακας Εικόνων...................................................................................XI

CHAPTER ONE.........................................................................................1

1. Introduction................................................................................1

CHAPTER TWO........................................................................................3

2. Corpora and Crawlers.................................................................3

2.1 Introduction.........................................................................3

2.2 WEB CRAWLER.....................................................................4

2.2.1 Definition......................................................................4

2.2.2 Examples of Web crawlers............................................7

2.2.3 Crawling Strategies.......................................................9

2.3 Corpora................................................................................9

CHAPTER THREE...................................................................................11

3. The FMC....................................................................................11

3.1 Focused Monolingual Crawler (FMC).................................11

3.1.1 FMC Architecture........................................................11

3.1.2 FMC – User Interface..................................................15

CHAPTER FOUR.....................................................................................17

4. The experiments.......................................................................17

4.1 Experiment I.......................................................................18

4.1.1 Input...........................................................................18

4.1.2 Output........................................................................19

4.2 Experiment II......................................................................20

4.2.1 Input...........................................................................20

4.2.2 Output........................................................................21

4.3 Experiment III.....................................................................23

4.3.1 Input...........................................................................23

4.3.2 Output........................................................................23

IX

Page 10: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

CHAPTER FOUR.....................................................................................25

5. Conclusions and Future Work..................................................25

5.1 Conclusions........................................................................26

5.1.1 The functional features...............................................26

5.2 Future work........................................................................28

Bibliography.........................................................................................31

X

Page 11: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Πίνακας ΕικόνωνPicture 1.1: The root of the URL...............................................................5Picture 1.2: The architecture of a Web Crawler.......................................7Picture 2.1: A typical workflow for acquiring monolingual domain-specific data.............................................................................................12Picture 2.2: User Interface......................................................................15Picture 3.1: TermList – Experiment I........................................................18Picture 3.2: URLlist – Experiment I..........................................................19Picture 3.3: Optional Parameters............................................................19Picture 3.4: text review Database............................................................20Picture 3.5: TermList – Experiment IΙ.......................................................21Picture 3.6: Text review – Experiment IΙ..................................................22Picture 3.7: TermList – Experiment ΙIΙ.....................................................23Picture 3.8: Text review – Experiment IΙI.................................................24Picture 4.1: Detailed documentation......................................................28

XI

Page 12: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

XII

Page 13: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

XIII

Page 14: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

CHAPTER ONE

1. Introduction

The aim of this work is to study period names of Modern Greek history. We develop a thesaurus that will be useful to information retrieval, machine translation, digital libraries and educational technologies. Their value for lexicographic work in English and other languages, as well as the background of the use of corpora in lexicography, have been described elsewhere (Kilgarriff and Rundell 2002, Kilgarriff et al. 2004).

A thesaurus is a “controlled and dynamic vocabulary of semantically and generically related terms, which covers a specific domain of knowledge”( Foskett, 1997).

In order to develop thesauri, corpora dense in terms from the particular thematic domain are necessary. There are not any Greek corpora dense in period names. There do exist some general purpose Greek corpora, Hellenic National Corpus being the standard example, as well as Corpus of Greek Texts, but the problem is that their aren’t built for use in specialized domains such as education or medicine.

So, we develop a thesaurus drawing on domain corpora retrieved from the Web 2.0. To construct these corpora we needed period names as seeds.

1

Page 15: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

We could develop our thesaurus by retrieving period names from books of history, but our approach is to seek them from the Web. This decision was taken based on three main reasons:

1. One reason is that period names are geographically constrained. For instance, in the Ionian Islands of Greece the period name “Σεισμός» means the great earthquake of 1953 and a few months after. However, nowhere else in Greece the same word which literally means “earthquake” is used as a period name or at least, as a period name with this denotation. So, in order to a good variety period name you would have to go through a massive bibliography. The Web seemed to offer a good alternative.

2. If large and varied corpora of Modern Greek existed one might be able to use them rather than using the Web. However, such corpora do not exist so the Web is the obvious solution.

3. Corpora, even if they existed, they would be perhaps made of texts of some age. Period names, however, are very much like Multi Word Expressions, in the sense that new ones are coined all the time. Again, the Web seems to be the best choice for your purposes.

For the construction of our corpus from the WEB we used a language tool developed by the Institute for Language and Speech Processing (ILSP/ “Athena” R.C.). ILSP’s Focused Monolingual Crawler (FMC) is a program that explores the Web and automatically downloads pages for a specific domain. We used ILSP’s FMC because:

It is available as a web service

It is the only crawler for Greek

It is inexpensive, as it is free.

It has a user-friendly interface, that does not require programming knowledge.

The crawler was not trained to retrieve period names, so we had to train it. In this document we report on the training of the particular crawler and the kind and the amount of resources we collected.

2

Page 16: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

CHAPTER TWO

2. Corpora and Crawlers

2.1 IntroductionThe sum of human knowledge is increasing continuously. This large amount of data has created the need for a thorough and easy to use storage and organization method. The World Wide Web interface arose in the 1960s as an answer to this exact need. Even though this was a most impressive storage feat, there were still some problems concerning the retrieval of specific data to answer user enquiries. Thus, the development of specialized software begun, a process that led to the design of the first search engines.

A search engine is a combination of different software sets that includes:

a crawler (alternatively known as a bot or a spider), that explores the Web, beginning from a predetermined by the user group of

3

Page 17: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

sites and using the hypertext links found on these sites discovers more relevant pages online

a catalogue, that creates a comprehensive index from the found sites

an application that compares the list of retrived sites with the initial search request and returns a finalized index (the search results).

2.2 WEB CRAWLER

2.2.1 DefinitionWeb crawler is a program, that visits automatically and methodically all web pages and indexes them. This process is known as web-crawling or spidering and is used by search engines to download pages from the Web, index them and provide fast searches.

A user is giving a list of URLs to visit that is called seeds queue. URL (Uniform Resource Locator) is a URI (Uniform Resource Identifier) that can trace where an identified resource is and a mechanism for retrieving it. The web crawler visits a URL from the seed queue, downloads the web page, identifies the hyperlinks in each page, extracts URLs from their HTML and adds new URLs to the “crawl frontier”, which is the new list to visit. The sum of the repeated visits keeps on happening recursively. A web crawler, for example, may be fed only with the home page of a site and then download the rest of it. A site has a tree-structure. The root starts from the first URL. All the hypertext links are the sons of the root and so on (Picture 1.1). This process and the order these visits are made, is unique for each web crawler according to a set of policies.

4

Page 18: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Picture 1.1: The root of the URL

The size of the Web and the ongoing change in the contents of its pages make the task assigned to a web crawler particularly demanding. Thus the main challenge facing a web crawler is to account for these obstacles and nonetheless provide suitable results in an acceptable amount of time. So, the crawler utilizes a normalization process, in order not to visit the same pages again and again.

The behavior of a web crawler is the outcome of a combination of policies:

A selection policy that states which pages to download. Because the Web is gigantic and a web crawler cannot download all the pages, the crawler must select and visit the most relevant pages to the domain. Lawrence and Giles (1999) showed than no search engine indexed more than 16% of the Web, meaning that even the most comprehensive search engine indexes a small fraction of the entire Web.

A re-visit policy that states when to check for changes to the pages. The web pages are changing at different rates. Till the crawling has finished it is possible that some new pages have been created from the users of the Web. Also pages can be updated on a minor or a major level. Unfortunately, there still is an average 5.3% of the links returned by search engines that points to deleted pages as Lawrence and Giles sustain. All of these factors lead to the conclusion that the web crawler has to revisit the pages and refresh them.

5

Starting URL or Root of the tree

URL URL

URL URLURL URL

URL URLURL URL URL URL URL URL URLURL

Page 19: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

A politeness policy that states how to avoid overloading Web sites. The use of a web crawler can have an impact on the overall performance of a site. The load on the visited Web sites from known crawlers ranges between 3-4 minutes and 20 seconds. This explains the fact that many web server administrators complain about the overloading. A partial solution is the robots exclusion protocol or robot exclusion standard, that states which parts of a Web server are not allowed to be accessed by crawlers.

A parallelization policy that states how to coordinate distributed Web crawlers. Because of the size of the Web there are crawlers that run multiple processes in parallel. Therefore, in a shorter time the download rate can be maximized while minimizing the overhead from parallelization. Certainly, the parallel crawlers should not visit the same web pages more than once, consequently the system requires a policy for assigning the new URLs discovered during the crawling process.

To sum up, a crawling system has to be characterized by certain qualities. Foremost is flexibility, meaning that is should be suitable for a wide variety of scenarios. Moreover, high performance and scalability of utmost importance, as such a software should be scalable to at least on thousand pages/second and ideally extending up to millions of pages. Furthermore, one should not forget fault tolerance, as the program should not only process in valid HTML code and deal with unexpected Webserver behavior, but also be able to handle stopped processes or interruptions in network services. Last but not least is maintainability and configurability. The interface is important to be appropriate for monitoring the crawling process and include parameters, like download sped, statistics on the pages or even amounts of data stored. Shkapenyuk and Suel noted that while it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.

The architecture of a web crawler most of the times is kept as a business secret. An algorithm can be written in any programming language, although, JAVA, Perl and C# are the most popular ones. The typical high level architecture of web crawlers is shown in Picture 1.2.

6

Page 20: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Picture 1.2: The architecture of a Web Crawler

2.2.2 Examples of Web crawlersBellow follows a list of published crawlers in chronological order and some details for each one.

RBSE (Eichmann, 1994) is the first published crawler. It consisted of two programs: the first one, the “spider” maintains a queue in a relational database, and the second one, the “mite”, is a modified WWW ASCII browser that downloads pages from the Web.

The WebCrawler (Pinkerton, 1994) was developed by a student and was used to build the first publicly-available full-text index of sub-set of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breath-first exploration of Web graph. It also included a real-time crawler.

The World Wide Web Worm (McBryan, 1994) was used to build a simple index of documents titles and URLs. The index could be searched by using the grep UNIX command.

The crawler of the Internet Archive was designed to archive periodic snapshots of a large portion of the Web.

The personal search agent SPHINK consists of a Java class library that implements multi-threaded Web page retrieval and HTML paring, and a graphical user inter face to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.

7

Page 21: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

An early version of Google crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL has been previously seen. If not, the URL was added to the queue of the URL server.

The CobWeb (da Silva et al., 1999) uses a central "scheduler" and a series of distributed "collectors". The collectors parse the downloaded Web pages and send the discovered URLs to the scheduler, which in turn assign them to the collectors. The scheduler enforces a breadth-first search order with a politeness policy to avoid overloading Web servers. The crawler is written in Perl.

Mercator (Heydon and Najork, 1999; Najork and Heydon, 2001) is a distributed, modular web crawler written in Java. Its modularity arises from the usage of interchangeable "protocol modules" and "processing modules". Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extracts new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics.

WebRACE (Zeinalipour-Yazti and Dikaiakos, 2002) is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for "subscriptions" to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to crawl from.

8

Page 22: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

2.2.3 Crawling StrategiesThere is a number of different scenarios in which a web crawler is used for data mining and acquisition. Below, we briefly describe some strategies that a crawler can use:

Breadth-first crawling: the crawler starts from the set of pages that are given by the user and then explores other Web pages by following hypertext links exactly in the order they are discovered.

Repetitive crawling: Because of the speed that indexes of the web sites change, some pages require the crawling is repeated periodically to keep indexes updated.

Target crawling: to increase the likelihood of downloading web pages of desired type or category a web crawler could use a targeted approach.

Deep Web crawling: not all the data are accessible via the Web. A crawler uses this strategy if there are data contained in databases. This means that one can approach them through the medium of appropriate requests of special forms.

2.3 CorporaIn order to construct a rich thesaurus with period names from the

Modern Greek history and study their linguistic behavior a very large corpus with different types of text should be used. The need for a large variety of texts is intensified by the fact that period names are very often created and used within geographically well-defined communities. The available Greek corpora are of small or medium size and certainly not dense in such expressions. So, tools for easily constructing corpora rich in the respective material of interest are of great importance. Crawlers, being programs that search the web to fetch texts that fit to a given description, have been presented as a solution to the problem of corpus creation.

For our study we used a revised version of the Focused Monolingual Crawler (FMC) developed by the Institute for Language and Speech Processing (ILSP/ “Athena” R.C.). We have chosen this particular crawler because it is available as a web service and it has a friendly interface for

9

Page 23: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

users with medium-to-low familiarity with such programs. Furthermore, it can work with Greek. The aim of this research is to develop; using texts retrieved from the web a domain specific corpus, dense in the specialized terminology for historical periods.

10

Page 24: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

CHAPTER THREE

3. The FMC

3.1 Focused Monolingual Crawler (FMC)

3.1.1 FMC ArchitectureAs mentioned above, a crawler is a program that automatically downloads pages from the Web. It starts with a seed set of pages given by the user, downloads them, extracts hyperlinks and crawls the new pages. This process is repeated until all hyperlinks have been checked (Brandman et al.). Focused Monolingual Crawlers (FMC) “seek, acquire, index, and maintain pages on a specific set of topics that represent a narrow segment of the web” (Chakrabarti et al., 1999). As Skadina et al. (2012) note:

11

Page 25: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Given a narrow domain (topic) and a language, FMC has to be fed with two input datasets: (i) a list of topic definition multi-word term expressions and (ii) a list of topic related URLs. The user can configure FMC in a variety of ways, e. g. set file types to download, domain filtering options, self-terminating conditions, crawling politeness parameters, ect.

A typical workflow for acquiring monolingual domain-specific data is illustrated in Picture 2.1.

Picture 2.2: A typical workflow for acquiring monolingual domain-specific data.

First of all, the user manually collects a URL list, makes a seed and provides them to the Frontier, as the schedule of the FMC is called. Then, there is a page fetcher. Every single crawling loop requires a large amount of time. This part helps the crawler to visit in parallel (multi-threading) more than one webpage, so it can provide reasonable speed-up and efficient use of available bandwidth. Also, at this step the user can prescribe a number of parameters, like the number of harvesters,

12

Page 26: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

(πρόταση για το τι είναι ο harvester), he wants to be used. The normalizer detects the text encoding and the formats while parsing each page. Next, it transforms all these pages into unified format (plain text) and text encoding (UTF-8). Webpages are promising sources for text-analytical research. However, they may turn out to be troublesome when not cleaned of the “noise”. Noise data of web documents can be categorized into two groups as global noise and local noise (Yi et al., 2003), and include advertisements, navigation bar, and copyright notices. Therefore, they are tracked down and marked as “boilerplate” by the Boilerpipe (Kohlschutter et al, 2010). The next module identifies the language. As Prokopidis et al. (2011) report:

The FMC uses the Cybozu language identification library that considers n-grams as features and exploits a Naïve Bayes classifier for language identification. If the document is not in the target language, the webpage is excluded from the next step.

Likewise, a topic classifier makes relevance judgments between the domain that has been specified by the user and the crawled pages to decide on link expansion. A page relevance score p is calculated by the type:

Where:

N is the amount of terms in the domain definition is the weight of term i, is the weight of location j and nij denotes the number of occurrences of term i in location j. The four discrete locations in a web page are title, metadata,

keywords, and plain text. The corresponding weights for these locations are 10, 4, 2, and 1

Moreover, the links are extracted from all the pages, even if whether they are irrelevant to the domain or not, in the targeted language. The web pages that belong to the target domain are selected to visit relevant

13

Page 27: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

web pages earlier. An additional step is the Exporter. In this phase, the stored web documents are scanned and their metadata are extracted. There is much information included there, such as the title, the original URL, keywords, etc. An example of an XML file is given in Figure 2:

<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href='http://nlp.ilsp.gr/panacea/xces-xslt/ cesDoc.xsl' type='text/xsl'?>

<cesDoc version="0.4" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.xces. org/schema/2003" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<cesHeader version="0.4">

<fileDesc>

<titleStmt>

<title>ΔΕΚΕΜΒΡΙΑΝΑ 1944: Μια μεθοδευμένη Σφαγή «Anarchy press gr</title>

<respStmt>

<resp>

<type>Crawling and normalization</type>

<name>ILSP</name>

</resp>

Figure 2. Part of XML data.

There are web pages which have the same content but are referenced by different URLs. However, crawler resources are wasted in fetching duplicate pages, the storage cost increases and the quality of search indexes is reduced. An estimate by Fetterly et al. (2003) shows that approximately 29% of web pages are duplicates while magnitude is increasing. An efficient solution was adopted for this problem, namely the de-duplication strategy included in the Nutch framework which involves the construction of a text profile based on quantized word frequencies. (Prokopidis et al., 2011).

14

Page 28: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

3.1.2 FMC – User InterfaceThe FMC is a language technology tool available at the official web page of ILSP NLP Web Sevices (http://nlp.ilsp.gr/soaplab2-axis/#ilsp.ilsp_fmc_row) (Figure 3). The documentation of this web service is at at http://registry.elda.org/services/160. The user defines the parameters as he or she chooses. There are three mandatory parameters and four optional.

Picture 2.2: User Interface

MANDATORY PARAMETERS: Language: The user can choose any language from English,

Greek, Spanish, Italian and German. The language identifier checks if the web page is written in the target language.

15

Page 29: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

TermList: a list containing a comprehensive description of each term that comprises three categories: <relevance, term, topic>. It is very important to evaluate the relevance weight correctly otherwise the results may be misleading. Consider example (1):

o 90: μετά τον εμφύλιο = ΧΡΠΡo 95: περίοδος εμφυλίου πολέμου = ΧΡΠΡo 20: Ελευθέριος Βενιζέλος = ΛΟo 100: κίνημα στο Γουδί = ΧΡΠΡo 45: περίοδος = ΛΟ

UrlList: manually is given to the tool a seed of a list of URLs relevant to the topic.

OPTIONAL PARAMETERS

Insert_xslt: it defines the stylesheet of the output XML file. It contributes to a more readable form of the output.

MaxTime: determines the time when the crawler stops. The user can set it from one to a hundred minutes. The loops are repeated in cycles, so if the defined time will expire, the crawler will not stop until the end of the running cycle.

MinimumLength: the minimum number of tokens that a paragraph should have. If the number of tokens is less than MinimumLength, then the paragraph will take the value ooi-length.

ThreadsNumber: the number of harvesters.

16

Page 30: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

CHAPTER FOUR

4. The experiments

This section describes the three experiments that were conducted with the crawler and the results obtained. Crawling is the procedure that can help the researchers who want to collect pages from the Web and develop their own Web Corpus.

As we have already mentioned in Section 2.2 a focused crawler is a special type of crawler that seeks out pages about a specific topic (e.g. history) and avoids irrelevant areas of the Web.

Greek was used for all the three experiments. There are over 6000 languages in the world (Gordon, 2005). Greek is not an under-resourced language as the basic information technology is available for this it. Furthermore, Greek has a relative substantial presence in the web (Berment, 2004). On the other hand, there is a clear need for large corpora and electronic lexica and of some state-of-the-art software. Certainly, it is very important that FMC is a language technology tool that applies to Greek.

17

Page 31: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

4.1 Experiment I

4.1.1 Input

An important parameter is the TermList. This list played a key role, because it is used for filtering historical periods-specific content from the documents. We chose to look for period names of the Greek history from the 1821 till today. We focussed on periods such as World War I and Greco-Turkish War, World War II, Greek Civil War, Postwar Recovery, and Restoration of democracy. Initially, our terms were collected from the High School History textbooks. Also, the fact that the researcher has to evaluate the relevance weight of the term seemed to be not too difficult, because at the beginning our key-words were only historical periods (Picture 3.1).

90:ελληνική επανάσταση=ΧΡΠΡ

100:εποχή του 1821=ΧΡΠΡ

90:περίοδος ανεξαρτησίας=ΧΡΠΡ

90:περίδος του Όθωνα=ΧΡΠΡ

90:φάση της μοναρχίας=ΧΡΠΡ

Picture 3.3: TermList – Experiment I

As explained in Section 1 crawling starts from a given set of URLs. The list of URLs was retrieved using Web engines. A set of historical periods names in Greek was used as search terms to find relevant seed URLs. At the first experiment only a few dozen URLs were used in the crawling phase.

http://el.wikipedia.org/wiki/%CE%A0%CF%8D%CE%BB%CE%B7:%CE%9A%CF%8D%CF%81%CE%B9%CE%B1

18

Page 32: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

http://www.schools.ac.cy/klimakio/Themata/Epikaira/1821/index.html

http://1gym-chiou.chi.sch.gr/historia.htm

http://www.historical-museum.gr/home.html

http://www.nhmuseum.gr/

Picture 3.2: Sample of the URLlist – Experiment I

In all the three experiments, we used the same values of optional parameters. The time parameter specified that the crawler would work for an hour before it stopped. Furthermore, it was specified that the minimum number of tokens that a paragraph should have would be ten and the number of harvesters twenty (picture 3.3)

Optional Parameters Value

MaxTime 60

MinimumLength 10

ThreadsNumber 20

Picture 3.3: Optional Parameters

4.1.2 OutputThe output of FMC is a text file with a list of URLs pointing to XML documents. The user has the option to see the boilerplate or not. Also, the tool can provide other types of information, as we have explained in Section 2.??, such as the title or the heading.

Our research on periodization expressions needs the corpus for two reasons:

1. To confirm the historical periods we already know

2. To determine the contexts where the specified historical names occur or their characteristics. The aim was to enhance our

19

Page 33: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

searching tools and obtain more results relevant to the periods in question.

In Experiment I, the output contained almost 500 URLs. This was a rather small number of URLs. This is why, after reviewing the texts, we decided to enrich the Terms list with new terms that we identified in our Experiment I results. In Picture 3.4 below some of the historical periods and the contexts that came to the surface are shown stored in a purpose made Database.

Picture 3.4: text review Database

We realized that the FMC would return more useful results if the terms included more expressions than historical period names only. So, with the next experiment we enriched the terms as explained in the next Section.

4.2 Experiment II

4.2.1 InputDrawing on the experience from Experiment I, we decided to make several changes, initially, at the TermList. The enriched list contained not

20

Page 34: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

only historical periods but relevant terms as well. In this round, we used names of important historical persons (name entities) and events, such as battles. Now the challenge was to evaluate the relevance weight correctly, otherwise the results were going to misguide the search. Furthermore, we created another topic to distinguish the terms that we added from the historical periods. We noticed that certain words functioned as heads of period names (e.g. πόλεμος, εποχή, περίοδος).In addition we identified certain contexts where period names occurred more frequently. In Picture 3.5 some examples from the new TermList are shown:

20:Κολοκοτρώνης=ΛΟ

20:Ανδρούτσος=ΛΟ

60:άλωση της Τριπολιτσάς=ΛΟ

70:πριν την απελευθέρωση=ΧΡΠΡ

70:μετά την απελευθέρωση=ΧΡΠΡ

90:πριν τον εμφύλιο πόλεμο=ΧΡΠΡ

90:μετά τον εμφύλιο πόλεμο=ΧΡΠΡ

100:κατά τη διάρκεια του εμφυλίου πολέμου=ΧΡΠΡ

50:περίοδος=ΛΟ

60:συνταγματική μοναρχία=ΛΟ

70:εμφύλια διαμάχη=ΛΟ

70:εμφύλια σύγκρουση=ΛΟ

60:Εθνικός Διχασμός=ΧΡΠΡ

50:καθεστώς Μεταξά=ΛΟ

Picture 3.5: Extract of the TermList – Experiment IΙ

Moreover, we doubled the size of the URLs.

21

Page 35: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

4.2.2 OutputThe Output of the Experiment II contained much better results. The text file consisted of almost 1000 URLs. We processed the texts manually and filled our database with examples of each term. We decided that the relevant weight of person named entities should be lower than the initially specified 20%, because we did not want the crawler to bring us biographical stories.

Positive was the fact that many of our terms were repeated and as the number of URLs increased the number of our terms remained the same for specific periods (Picture 3.6). Despite this result, we decided to do one more experiment with a richer URL list.

Picture 3.6: Text review – Experiment IΙ

22

Page 36: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

4.3 Experiment III

4.3.1 InputIn Experiment II we focused on the TermList. Now the only correction we had to do was the relevant weight of some words or expressions. More specifically, we noticed that the words aimed to assist the key terms had to be assigned a weight less than 25%. Also, the more common a word, the less relevant weight it should have.

The third experiment focused on the URL list. We decided to make an advances search with the terms that came out from the previous stage. In this phase we tried to find URLs of blogs and forums to captivate the written language that is closer to idioms. It was more difficult to achieve this goal. Idioms are linguistic expressions or lexical items representing objects, concepts or phenomena of material life particular to a given culture (Adelnia et al. 2011).

50:φάση της κάμψης=ΧΡΠΡ

50:απελευθέρωση=ΛΟ

70:φάση των επιτυχιών=ΧΡΠΡ

90:περίοδος των δεκαεπτά ετών=ΧΡΠΡ

90:η εποχή της οκταετίας=ΧΡΠΡ

90:η εποχή του Καραμανλή=ΧΡΠΡ

90:επί Βενιζέλου=ΧΡΠΡ

Picture 3.7: Extract of TermList – Experiment ΙIΙ

4.3.2 OutputThe text file after all the corrections gave us 1650 URLs, even though the given time was again one hour. There weren’t any idioms found, but

23

Page 37: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

there were specified the contents of the historical periods. Examples are given in the Picture 3.7.

Picture 3.8: Text review – Experiment IΙI

24

Page 38: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

CHAPTER FOUR

5. Conclusions and Future Work

In this Section we present our conclusions drawn from the work we described so far and discuss open issues and future research.

25

Page 39: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

5.1 Conclusions

The large size and the dynamic nature of World Wide Web emphasize the need of making and evaluating tools that can help the user to search, collect and manage information. Focused crawlers try to solve these problems within a specific domain (Hesham, 2008). The quality of the data that are downloaded and the effectiveness of the focused crawling depend on many factors.

Web has a variety of data, such as religious scripts, medical text, blogs and forums, audio and video (Hoda, 2010). We believe that the FMC is a language tool that helps the scientific community to do a step forward in corpus-based lexicography, language learning, and linguistic research for Greek. Likewise, other NLP tasks that require further analysis may benefit more from a carefully collected corpus from the Web.

The construction of a thesaurus based on web texts is an important application for various fields. Using the same framework described here, it is possible to collect a much larger corpus of freely available web texts. The researcher by making a corpus from the Web takes advantage of:

The size The range The up-dated texts The availability and The multimodality: sound, sight and text (Fletcher, 2011)

Our contribution lies in evaluating the FMC and in proposing hints for its effective usage in order to build a thesaurus. We take up this discussion in the next section.

26

Page 40: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

5.1.1 The functional featuresThe tool has already proved to be useful. Especially, based on our

experiments, one very significant observation is the URL list. The FMC needs for starting a seed of URLs. Unfortunately, this has to be done manually. A start point that we recommend is general search engines or additional sources, such as literature references and databases. There should be a hyperlink from the user interface.

Furthermore, when putting together the Term list, it is very important to evaluate the relevance weight correctly, otherwise the results may be misleading. The user has to remember that the more common use a word has in everyday life, the less relevant weight it has to the domain. This happens because commonly used words can easily be found in irrelevant texts.

If words have three letters or less the FMC deletes them automatically. Consequently, most of the Greek conjunctions are deleted. In our case, several of the useful linguistic contexts contained such words and that reduced precision of results.

ISO (International Organization for Standardization) defines the usability of a human-made object, whether this is a software application or a tool, like FMC, as “the extent to which a product can be used by specified users to achieve goals with effectiveness, efficiency and satisfaction in specified context of use” (ISO, 2011). The usability of FMC is quite good. There is a detailed documentation with examples. (Picture 4.1).

27

Page 41: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Picture 4.1: Detailed documentation

The user is able to start using the FMC after a sort introduction which is a very positive feature. To the negative side as regards usability of FMC is that an undo function is not available.

As regards the interface, at the top is the button that serves the function “Run Servise”. This should be moved to another location perhaps at the bottom, beside the button that gives the instruction “Reset Fields”. Also, the button “Reset Fields” might be better to have an extra question to confirm the initial instruction. Last, userfriendliness would profit both for native-speakers and learners of Greek by providing a Greek interface and by offering the option to choose between English and Greek tag sets.

Another positive feature of the FMC is that the user is notified when a mistake is made. An error report is created and the crawling stops automatically, but the program doesn’t tolerate small errors.

The output that FMC exports, provides a variety of information. It would be helpful to store all these data in a database, so that further processing would be facilitated. Future work

5.2 Future work

We used the FMC to develop a corpus that would allow us to do two things:

1. To collect a significant amount of periodization terms2. To identify and study a significant amount of linguistic contexts in

which periodization terms occur. This study would help us to retrieve more periodization terms

28

Page 42: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

3. Greece is a country whose various geographical parts did not experience a completely identical historical evolution. Therefore, we could look for periodization and localization in parallel.

At the moment we have developed a database (see the Appendix). We will organize the periodization terms along several dimensions: denotation, locality, period of usage, linguistic structure, English equivalents. We will probably use some, hopefully simple, ontological schema to encode the terms and their relations.

29

Page 43: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Bibliography

McEnery Τ., R. Xiao & Y. Tono. (2006). Corpus Based Language Studies: an advanced resource book. London & New York: Routledge.

Marianne Hundt, Nadja Nesselhauf, Carolin Biewer. (2007). Corpus Linguistics and the Web. Amsterdam: Rodopi.

Skadina, I., Aker, A., Mastropavlos;N., Su, F., Tufis, D., Verlic, M., Vasiljevs, A., Babych, B., Glaros;N. (2012). Collecting and Using Comparable Corpora for Statistical Machine Translation. In Proceedings of LREC 2012, 21-27 May, Istanbul, Turkey.

Onn Brandman, Junghoo Cho, Hector Garcia-Molina, Narayanan Shivakumar. (20??). Crawler-Friendly Web Servers. Dept. of Computer Science Stanford.

Chakrabarti, S., Van Den Berg, M., and Dom, B. 1999. Focused Crawling: A New Approach to Topic-Specific Resource Discovery. In Proceedings of the Eight World Wide Web Conference, Toronto, Canada.

L. Yi, and B. Liu., 2003. Web page cleaning for Web mining through feature weighting, in Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03).Acapulco, Mexico.

Kohlschütter, C., Fankhauser, P., and Nejdl, W. 2010. “Boilerplate Detection using Shallow Text Features”. The Third ACM International Conference on Web Search and Data Mining.

D. Fetterly, M. Manasse, and M. Najork. November 2003. On the evolution of clusters of near-duplicate web pages. In LA-WEB '03: Proceedings of the First Conference on Latin American Web Congress, page 37,

Gordon, R.G.J., (ed.). 2005. Ethnologue: Languages of the World, Fifteenth edition. Dallas, Tex.: SIL International.

31

Page 44: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

Berment V., 2004. “Méthodes pour informatiser des langues et des groupes de langues peu dotées” PhD Thesis, J. Fourier University – Grenoble I, May 2004..

A.Adelnia, H. V. Dastjerdi. 2011. Traslation of Idioms: A Hard Task for the Translator. Theory and Practice in Language Studies. July 2011.

Hesbam A., 2008. “Self Ranking and Evaluation Approach for Focused Crawler Based on Multi-Agent System”. The International Arab Journal of Information Technology.

Almpanidis, G., Kotropoulos, C. and Pitas, I. 2007. Combining text and link analysis for focused crawling – an application for vertical search engines. Information Systems, 32(6), 886-908.

Lindgaard, G., Fernandes, G., Dudek, C. & Brown, I. (2006). Attention web designers: You have 50 milliseconds to make a good first impression! Behaviour and Information Technology, 25 (2): 115-126.

"ISO 9241-1:1992". International Organization for Standardization. Retrieved 22 July 2011

K. Hoda,. An overview of Urdu on the Web, Date Retrieved (14, 09, 2010), Available http://www.urdustudies.com/pdf/20/25Resources-Hoda.pdf

W. H. Fletcher. 2011. Corpus Analysis of World Wide Web. Encyclopedia of Applied Linguistics. Wiley-Blackwell

Foskett, D.J.: Thesaurus. In: Sparck Jones, K., Willet, P. (eds.): Readings in Information Retrieval. Morgan Kaufmann Publishers, San Francisco, California (1997) 111-134

32

Page 45: ΔΙΑΠΑΝΕΠΙΣΤΗΜΙΑΚΟ – ΔΙΑΤΜΗΜΑΤΙΚΟ …  · Web viewIntroduction. 1. CHAPTER TWO. 3. 2 ... nowhere else in Greece the same word which literally means

33