μRaptor: A DOM-based system with appetite for hCard elements

13
μRaptor A DOM based system with appetite for hCard elements

description

Winner system of the Linked Data for Information Extraction Challenge 2014, LD4IE at ISWC

Transcript of μRaptor: A DOM-based system with appetite for hCard elements

Page 1: μRaptor: A DOM-based system with appetite for hCard elements

μRaptor A DOM based system with appetite for hCard elements

Page 2: μRaptor: A DOM-based system with appetite for hCard elements

μRaptor

is hungry

Page 3: μRaptor: A DOM-based system with appetite for hCard elements
Page 4: μRaptor: A DOM-based system with appetite for hCard elements

Training Phase

Clean the HTML

Page 5: μRaptor: A DOM-based system with appetite for hCard elements

Training Phase

Clean the HTML

DOM sub-trees

Page 6: μRaptor: A DOM-based system with appetite for hCard elements

Training Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

author

Page 7: μRaptor: A DOM-based system with appetite for hCard elements

Training Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

CSS Selectors

Page 8: μRaptor: A DOM-based system with appetite for hCard elements

Training Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

CSS Selectors

vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE

vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com

vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE

vcard:email mailto : ALPHA @ ALPHANUMERIC . com

vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE

vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER

vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER

We could determine patterns for emails for example:

… or even for birthdays

Page 9: μRaptor: A DOM-based system with appetite for hCard elements

Extraction Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

Pattern Detection

CSS Selectors

Page 10: μRaptor: A DOM-based system with appetite for hCard elements

Extraction Phase

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

Pattern Detection

Elements Qualification

CSS Selectors

Page 11: μRaptor: A DOM-based system with appetite for hCard elements

Clean the HTML

DOM sub-trees

CSS class co-occurrence

Value Constraints

Pattern Detection

Elements Qualification

Models Validation

CSS Selectors

Extraction Phase

RDF Model From μRaptor

RDF Model Test set

?

= 0.94 = 0.7 = 0.8

Page 12: μRaptor: A DOM-based system with appetite for hCard elements

μRaptor

https://github.com/emir-munoz/uraptor

Page 13: μRaptor: A DOM-based system with appetite for hCard elements

We made the discovery of the new μRaptor

species and I am very pleased some researchers

helped us understanding its feeding habits

Godzilla is a doll compared to μRaptor! I am

currently working on a script for an upcoming

movie

As a kid I always wanted to see an actual

dinosaur. Today my dream comes true

Damn, he is better than me!