μRaptor: A DOM-based system with appetite for hCard elements

Post on 06-Jul-2015

79 views 3 download

description

Winner system of the Linked Data for Information Extraction Challenge 2014, LD4IE at ISWC

Transcript of μRaptor: A DOM-based system with appetite for hCard elements

μRaptor A DOM based system with appetite for hCard elements

μRaptor

is hungry

Training Phase

Clean the HTML

Training Phase

Clean the HTML

DOM sub-trees

Training Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

author

Training Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

CSS Selectors

Training Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

CSS Selectors

vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE

vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com

vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE

vcard:email mailto : ALPHA @ ALPHANUMERIC . com

vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE

vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER

vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER

We could determine patterns for emails for example:

… or even for birthdays

Extraction Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

Pattern Detection

CSS Selectors

Extraction Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

Pattern Detection

Elements Qualification

CSS Selectors

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

Pattern Detection

Elements Qualification

Models Validation

CSS Selectors

Extraction Phase

RDF Model From μRaptor

RDF Model Test set

?

= 0.94 = 0.7 = 0.8

μRaptor

https://github.com/emir-munoz/uraptor

We made the discovery of the new μRaptor

species and I am very pleased some researchers

helped us understanding its feeding habits

Godzilla is a doll compared to μRaptor! I am

currently working on a script for an upcoming

movie

As a kid I always wanted to see an actual

dinosaur. Today my dream comes true

Damn, he is better than me!