Children’s Hospital Informatics...

23
Gazetteer Expansion M. Keller, J. Brownstein, C. Freifeld Motivation Geo-parsing Gazetteer Extension Experiments Conclusion Expanding a Gazetteer-Based Approach for Geo-Parsing Disease Alerts Mikaela Keller, John S. Brownstein, Clark C. Freifeld Children’s Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology fistname.lastname[at]childrens.harvard.edu July 9th, 2008 1 / 20

Transcript of Children’s Hospital Informatics...

Page 1: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Expanding a Gazetteer-Based Approach forGeo-Parsing Disease Alerts

Mikaela Keller, John S. Brownstein, Clark C. Freifeld

Children’s Hospital Informatics Programat the Harvard-MIT Division of Health Sciences and Technology

fistname.lastname[at]childrens.harvard.edu

July 9th, 20081 / 20

Page 2: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

MotivationHealthMap

2 / 20

Page 3: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

MotivationHealthMap

I System of informal sources surveillance for automaticdetection of disease outbreak alerts.

I Provides automatic filtering into several taxonomies andmapping into world map.

I Who is interested in this information?I ≈ 20,000 unique visitors per month from around the

world.I Cited as a resource on: UN.org, FDA.gov,

EpiCentro.iss.it, PBS.org and a variety academic Websites

I Website usage tracking → Most avid users:government-related domains, including WHO, CDC,ECDC, etc.

3 / 20

Page 4: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

MotivationHealthMap

4 / 20

Page 5: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geo-parsing

An Extension of the Gazetteer-based Approach

Experiments

5 / 20

Page 6: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geo-parsing

An Extension of the Gazetteer-based Approach

Experiments

6 / 20

Page 7: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geographical Mapping

I Geo-indexing: Associate a geographic index(ie lat/long) to a text

I Gazetteer: List of geographic keywords and phraseswith their associated index

I Geo-parsing: Extraction of geographically relevantkeywords and phrases from a document

7 / 20

Page 8: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geographical Mapping

I Geo-indexing: Associate a geographic index(ie lat/long) to a text

I Gazetteer: List of geographic keywords and phraseswith their associated index

I Geo-parsing: Extraction of geographically relevantkeywords and phrases from a document

brazil

brazil nut

canada

canada goose

england

queen alexandra hospital

inserm

taiwan cdc

cdc

100

193

106

6

0

12

0

279

279

137

137

137

137

137

186186

153

153

kabul

afghan officials

taliban

province of badakhshan

bangladesh

japan bangladesh hospital

afghanistan

aussies announce

australia

geographical reference country code

7 / 20

Page 9: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geographical Mapping

I Geo-indexing: Associate a geographic index(ie lat/long) to a text

I Gazetteer: List of geographic keywords and phraseswith their associated index

I Geo-parsing: Extraction of geographically relevantkeywords and phrases from a document

newspaper The Sun reports that a staff member was rushed to The

British Maranello.employee is quoted as saying from"We’re not happy," a

Ferrari F1 factoryDisease outbreak at

Posted 19 May 2008 at 08:25 GMT

working despite an outbreak of the infectious disease tuberculosis.

The Sun said the staff member was a carbon fibre specialist, and Ferrari

It is reported that the 25 technicians whose job it is to build Kimi Räikkönen

tests to see if they have also been infected.

"We think the lab should be shut down until we get an all−clear. It feels as if

Ferrari are putting cars before people."

Ferrari said Räikkönen and Massa will not need tests.

and Felipe Massa’s cars for each grand prix are currently undergoing blood

confirmed that he has a "mild form" of tuberculosis.

untreated − kills up to half of its victims.

hospital and diagnosed with TB, a common and deadly disease that − if

Ferrari’s Formula One factory Italy Workers at in have been told to keep

British

7 / 20

Page 10: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Human Geo-parsing

Investigations into measles outbreak in Long Urun in Belagadistrict reveal the cases originated from Penans living incramped and unsanitary conditions.

I Human reader rely both on lexical and contextual info.

I Interesting features: syntax, capitalization, voweldistribution, etc.

8 / 20

Page 11: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Automatic Geo-parsing

I Geo-parsing can be seen as an Information Extractionproblem.

I Recent approaches to IE (Named Entity Recognition,Semantic Role Labelling, etc.) rely on heavily annotateddataset (expensive to obtain).

I HealthMap so far:I purposely crafted gazetteer and look up tree algorithm,I country resolution → higher resolution,

I Question: Can we do something before datasetannotation?

→ Gazetteer contains tons of info.

9 / 20

Page 12: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

HealthMap Gazetteer

brazil

brazil nut

canada

canada goose

england

queen alexandra hospital

inserm

taiwan cdc

cdc

100

193

106

6

0

12

0

279

279

137

137

137

137

137

186186

153

153

kabul

afghan officials

taliban

province of badakhshan

bangladesh

japan bangladesh hospital

afghanistan

aussies announce

australia

geographical reference country code

10 / 20

Page 13: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Main Idea

I Use the gazetteer to label a dataset with locations.

I Augment the dataset with extra linguistic information(Part-of-Speech tags, capitalization status, etc.).

I Hide the lexical information → Generalize on thecontext.

NNS IN VBP VBG NN INRB DT NN NNS

Health authorities in are closely monitoring an upsurge of dengue fever cases.NNP

NNP NNS IN VBP VBG NN INRB DT NN NNS

Health authorities in are closely monitoring an upsurge of dengue fever cases.NNP

Sentence:

Train sentence:

NNP

New Caledonia

11 / 20

Page 14: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geo-parsing

An Extension of the Gazetteer-based Approach

Experiments

12 / 20

Page 15: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Word Representation

ix =

capitalizationstatus

0, ... 0, 1, 0, ... 0, 0, ... 0, 1, 0,... 0, 0, 1, 0

dictionary index

part−of−speech

index

I dictionary index i ∈ 0, . . . , |D|I part-of-speech index (obtained from NEC’s SENNA

part-of-speech tagger, reported acc = 96.85%)

I capitalization status: first, all or none

13 / 20

Page 16: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Hiding words

I Location words are less frequent than typical words

I ⇒ Threshold on frequency of words: hide less frequentwords

0 5 10 15 20Minimum term frequency

0

5

10

15

20

25

30

35

40

Perc

en

tag

e

hidden wordshidden location terms

14 / 20

Page 17: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

A Neural Network

φ φ φφ

xi−hw xi+hwxi−hw+1 xi

ψ

P(loc|x) P(none|x)

I NN(i , x) = P(yi |xi−hw , . . . , xi , . . . , xi+hw )

I Inspired by (Bengio et al JMLR2003, Collobert&WestonACL2007)

15 / 20

Page 18: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

A Neural Network

φ φ φφ

xi−hw xi+hwxi−hw+1 xi

ψ

P(loc|x) P(none|x)

I φ : xi → φ(xi ) ∈ Rd ← 1st Multi-Layer Perceptron.

I φ(xi ) compact representation of xi learned duringtraining.

I ψ : Rd×n 7→ Ω ← 2nd Multi-Layer Perceptron.15 / 20

Page 19: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Geo-parsing

An Extension of the Gazetteer-based Approach

Experiments

16 / 20

Page 20: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Training

I Two training sets built from disease outbreak alertsretrieved by HealthMap in 2007:

I T0: 1000 alerts (354,898 tagged words)I T1: 2500 alerts (1,130,139 tagged words)

I Sub-dictionary sizes:

min. freq. 0 4 10 20

T0 dict. size 15,000 5,000 3,000 1,700

T1 dict. size 25,300 9,500 5,900 3,900

I A validation set: 1000 alerts (465,297 words to tag,11,169 with target loc)

I Trained several neural networks for each min. freq.threshold, with tuned number of hidden units, fixedwindow size=9, and fixed number of dimension ofφ(x)=10

17 / 20

Page 21: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Training set increase

0 5 10 15 20Minimium term frequency

0.0

0.2

0.4

0.6

0.8

1.0

F-s

core

loc (T0 trained)loc (T1 trained)

18 / 20

Page 22: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Training set increase

0 5 10 15 20Minimium term frequency

0.0

0.2

0.4

0.6

0.8

1.0

F-s

core

’loc’ wordshidden ’loc’ wordsvisible ’loc’ words

I ’none’ target F-score ≈ 99% (≈ 95% for hidden ’none’words).

19 / 20

Page 23: Children’s Hospital Informatics Programprior-knowledge-language-ws.wdfiles.com/local--files/start/keller... · i-hw+1 x i x i+hw y P(loc|x) P(none|x) I φ: x i →φ(x i) ∈Rd

GazetteerExpansion

M. Keller,J. Brownstein,

C. Freifeld

Motivation

Geo-parsing

GazetteerExtension

Experiments

Conclusion

Conclusion

I We presented a method that transfers prior knowledgeencoded in a rule-based approach into a statisticallearning approach.

I It is able to discover/recover geographic referencesbased on the context in which they appear.

I The performance increases with training set size despitethe fabricated nature of the data.

I Future Work:I Test on manually labeled data.I Integrate to a more conventional method for

geo-parsing relying on a labeled dataset.I Add more complex features such as Semantic Role

Labels...

20 / 20