Children’s Hospital Informatics...
Transcript of Children’s Hospital Informatics...
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Expanding a Gazetteer-Based Approach forGeo-Parsing Disease Alerts
Mikaela Keller, John S. Brownstein, Clark C. Freifeld
Children’s Hospital Informatics Programat the Harvard-MIT Division of Health Sciences and Technology
fistname.lastname[at]childrens.harvard.edu
July 9th, 20081 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
MotivationHealthMap
2 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
MotivationHealthMap
I System of informal sources surveillance for automaticdetection of disease outbreak alerts.
I Provides automatic filtering into several taxonomies andmapping into world map.
I Who is interested in this information?I ≈ 20,000 unique visitors per month from around the
world.I Cited as a resource on: UN.org, FDA.gov,
EpiCentro.iss.it, PBS.org and a variety academic Websites
I Website usage tracking → Most avid users:government-related domains, including WHO, CDC,ECDC, etc.
3 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
MotivationHealthMap
4 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geo-parsing
An Extension of the Gazetteer-based Approach
Experiments
5 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geo-parsing
An Extension of the Gazetteer-based Approach
Experiments
6 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geographical Mapping
I Geo-indexing: Associate a geographic index(ie lat/long) to a text
I Gazetteer: List of geographic keywords and phraseswith their associated index
I Geo-parsing: Extraction of geographically relevantkeywords and phrases from a document
7 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geographical Mapping
I Geo-indexing: Associate a geographic index(ie lat/long) to a text
I Gazetteer: List of geographic keywords and phraseswith their associated index
I Geo-parsing: Extraction of geographically relevantkeywords and phrases from a document
brazil
brazil nut
canada
canada goose
england
queen alexandra hospital
inserm
taiwan cdc
cdc
100
193
106
6
0
12
0
279
279
137
137
137
137
137
186186
153
153
kabul
afghan officials
taliban
province of badakhshan
bangladesh
japan bangladesh hospital
afghanistan
aussies announce
australia
geographical reference country code
7 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geographical Mapping
I Geo-indexing: Associate a geographic index(ie lat/long) to a text
I Gazetteer: List of geographic keywords and phraseswith their associated index
I Geo-parsing: Extraction of geographically relevantkeywords and phrases from a document
newspaper The Sun reports that a staff member was rushed to The
British Maranello.employee is quoted as saying from"We’re not happy," a
Ferrari F1 factoryDisease outbreak at
Posted 19 May 2008 at 08:25 GMT
working despite an outbreak of the infectious disease tuberculosis.
The Sun said the staff member was a carbon fibre specialist, and Ferrari
It is reported that the 25 technicians whose job it is to build Kimi Räikkönen
tests to see if they have also been infected.
"We think the lab should be shut down until we get an all−clear. It feels as if
Ferrari are putting cars before people."
Ferrari said Räikkönen and Massa will not need tests.
and Felipe Massa’s cars for each grand prix are currently undergoing blood
confirmed that he has a "mild form" of tuberculosis.
untreated − kills up to half of its victims.
hospital and diagnosed with TB, a common and deadly disease that − if
Ferrari’s Formula One factory Italy Workers at in have been told to keep
British
7 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Human Geo-parsing
Investigations into measles outbreak in Long Urun in Belagadistrict reveal the cases originated from Penans living incramped and unsanitary conditions.
I Human reader rely both on lexical and contextual info.
I Interesting features: syntax, capitalization, voweldistribution, etc.
8 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Automatic Geo-parsing
I Geo-parsing can be seen as an Information Extractionproblem.
I Recent approaches to IE (Named Entity Recognition,Semantic Role Labelling, etc.) rely on heavily annotateddataset (expensive to obtain).
I HealthMap so far:I purposely crafted gazetteer and look up tree algorithm,I country resolution → higher resolution,
I Question: Can we do something before datasetannotation?
→ Gazetteer contains tons of info.
9 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
HealthMap Gazetteer
brazil
brazil nut
canada
canada goose
england
queen alexandra hospital
inserm
taiwan cdc
cdc
100
193
106
6
0
12
0
279
279
137
137
137
137
137
186186
153
153
kabul
afghan officials
taliban
province of badakhshan
bangladesh
japan bangladesh hospital
afghanistan
aussies announce
australia
geographical reference country code
10 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Main Idea
I Use the gazetteer to label a dataset with locations.
I Augment the dataset with extra linguistic information(Part-of-Speech tags, capitalization status, etc.).
I Hide the lexical information → Generalize on thecontext.
NNS IN VBP VBG NN INRB DT NN NNS
Health authorities in are closely monitoring an upsurge of dengue fever cases.NNP
NNP NNS IN VBP VBG NN INRB DT NN NNS
Health authorities in are closely monitoring an upsurge of dengue fever cases.NNP
Sentence:
Train sentence:
NNP
New Caledonia
11 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geo-parsing
An Extension of the Gazetteer-based Approach
Experiments
12 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Word Representation
ix =
capitalizationstatus
0, ... 0, 1, 0, ... 0, 0, ... 0, 1, 0,... 0, 0, 1, 0
dictionary index
part−of−speech
index
I dictionary index i ∈ 0, . . . , |D|I part-of-speech index (obtained from NEC’s SENNA
part-of-speech tagger, reported acc = 96.85%)
I capitalization status: first, all or none
13 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Hiding words
I Location words are less frequent than typical words
I ⇒ Threshold on frequency of words: hide less frequentwords
0 5 10 15 20Minimum term frequency
0
5
10
15
20
25
30
35
40
Perc
en
tag
e
hidden wordshidden location terms
14 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
A Neural Network
φ φ φφ
xi−hw xi+hwxi−hw+1 xi
ψ
P(loc|x) P(none|x)
I NN(i , x) = P(yi |xi−hw , . . . , xi , . . . , xi+hw )
I Inspired by (Bengio et al JMLR2003, Collobert&WestonACL2007)
15 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
A Neural Network
φ φ φφ
xi−hw xi+hwxi−hw+1 xi
ψ
P(loc|x) P(none|x)
I φ : xi → φ(xi ) ∈ Rd ← 1st Multi-Layer Perceptron.
I φ(xi ) compact representation of xi learned duringtraining.
I ψ : Rd×n 7→ Ω ← 2nd Multi-Layer Perceptron.15 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Geo-parsing
An Extension of the Gazetteer-based Approach
Experiments
16 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Training
I Two training sets built from disease outbreak alertsretrieved by HealthMap in 2007:
I T0: 1000 alerts (354,898 tagged words)I T1: 2500 alerts (1,130,139 tagged words)
I Sub-dictionary sizes:
min. freq. 0 4 10 20
T0 dict. size 15,000 5,000 3,000 1,700
T1 dict. size 25,300 9,500 5,900 3,900
I A validation set: 1000 alerts (465,297 words to tag,11,169 with target loc)
I Trained several neural networks for each min. freq.threshold, with tuned number of hidden units, fixedwindow size=9, and fixed number of dimension ofφ(x)=10
17 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Training set increase
0 5 10 15 20Minimium term frequency
0.0
0.2
0.4
0.6
0.8
1.0
F-s
core
loc (T0 trained)loc (T1 trained)
18 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Training set increase
0 5 10 15 20Minimium term frequency
0.0
0.2
0.4
0.6
0.8
1.0
F-s
core
’loc’ wordshidden ’loc’ wordsvisible ’loc’ words
I ’none’ target F-score ≈ 99% (≈ 95% for hidden ’none’words).
19 / 20
GazetteerExpansion
M. Keller,J. Brownstein,
C. Freifeld
Motivation
Geo-parsing
GazetteerExtension
Experiments
Conclusion
Conclusion
I We presented a method that transfers prior knowledgeencoded in a rule-based approach into a statisticallearning approach.
I It is able to discover/recover geographic referencesbased on the context in which they appear.
I The performance increases with training set size despitethe fabricated nature of the data.
I Future Work:I Test on manually labeled data.I Integrate to a more conventional method for
geo-parsing relying on a labeled dataset.I Add more complex features such as Semantic Role
Labels...
20 / 20