Wikipedia as controlled vocabulary

Post on 12-May-2015

9.139 views 0 download

description

The Essentials of Metadata and Taxonomy - Henry Stewart EventThe Next Wave: Using Wikipedia as a Controlled Vocabulary * Leveraging an online resource for internal use * Integrating pre-existing unique identifications numbers (UIDs) * Inherited relations * Capturing and cataloging * Risks and remedies Chris Sizemore BBC Future Technology & Media and Silver Oliver, BBC Future Technology & Media

Transcript of Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

I’m about ‘Victorians’

BBC Topic Page

I’m about ‘Victorian

s’

Outside the BBC

BBC silo #1 BBC silo #3

BBC silo #2

BBC Topic Page

I’m about ‘Victorian

s’

viktorianisch

V잊도 r 이안

Ελληνικά

NY Times, flickr,

wikipedia

Outside the BBC

BBC silo #1 BBC silo #3

BBC silo #2

An index language exists primarily to:

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

Could Wikipedia be used as a universal

language for identifying subjects?

Story of Wikipedia-as-CV

Story of Wikipedia-as-CV: personal origins

Story of Wikipedia-as-CV: personal origins

We needed a system to categorise movie & TV

reviews

Story of Wikipedia-as-CV: personal origins

So of course we built a categorisation system from scratch -- including its own

controlled vocab

Story of Wikipedia-as-CV: personal origins

And when people saw the system, they always said: “Hey, that reminds me of

Internet Movie Database…”

Story of Wikipedia-as-CV: personal origins

It struck me that the way Internet Movie Database is set up isn’t dissimilar to the structure of a

thesaurus or a very flat taxonomy…

Story of Wikipedia-as-CV: personal origins

But its’s one where the emphasis is on “related to”, not broader/narrower,

synonym, antonym, etc

Story of Wikipedia-as-CV: personal origins

From then, I couldn’t help but be drawn to websites where the structure

is clearly:

Story of Wikipedia-as-CV: personal origins

From then, I couldn’t help but be drawn to websites where the structure

is clearly: “a single primary Concept per page --

and pages for related Concepts link to each other”

Story of Wikipedia-as-CV: personal origins

Could those “one Concept per page” webpages be used as “terms” as in a

controlled vocabulary?

Are some websites actually “indexing

languages” in disguise?

conText --a Wikipedia-as-CV auto-categoriser

prototype

conText -- a Wikipedia-as-CV auto-categoriser

prototype:http://sells.welcomebackstage.com:5000/item/

submit

Demo of conText -- a Wikipedia-as-CV auto-categoriser

prototype

Demo of conText -- a Wikipedia-as-CV auto-categoriser

prototype:

Take text from audience!

Wikipedia is already being used across the Web as a form of

subject identification & disambiguation, in a grassroots

way:

Wikipedia is already being used across the Web as a form of

subject identification & disambiguation, in a grassroots

way:

in the form of hyperlinks embedded by authors in blog

posts, news articles, music reviews, etc everywhere!

http://en.wikipedia.org/wiki/British

http://en.wikipedia.org/wiki/Science_fiction

http://en.wikipedia.org/wiki/BBC

http://en.wikipedia.org/wiki/Time_travel

http://en.wikipedia.org/wiki/Dr_who

http://en.wikipedia.org/wiki/Tardis

These days, by convention, when you link to Wikipedia from your webpage, more than saying “go and have a look at this other

page”, you are more likely giving a definition to a concept referred to in your content…

These days, by convention, when you link to Wikipedia from your webpage, more than saying “go and have a look at this other

page”, you are more likely giving a definition to a concept referred to in your content…

Also used in this way for specific domains are Internet Movie Database (for films & TV

programmes), MySpace (for bands), Amazon (for books), etc

For general knowledge, though,

Wikipedia is becoming the Web’s defacto

controlled vocabulary

http://en.wikipedia.org/wiki/Heerlen

http://en.wikipedia.org/wiki/Beethoven

http://en.wikipedia.org/wiki/Amsterdam

http://en.wikipedia.org/wiki/Van_Gogh_Museum

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

Wikipedia pages provide the best scope

notes in the world

Wikipedia pages provide the best scope

notes in the worldWikipedia-as-CV benefits from being developed through a social process, maintained and kept

current by the Wikipedia community

Wikipedia pages provide the best scope

notes in the worldWikipedia-as-CV benefits from being developed through a social process, maintained and kept

current by the Wikipedia community

Each concept represents a consensus view and its meaning can be understood simply by reading the

associated Wikipedia page

Wikipedia pages provide the best scope

notes in the world

For each Concept, the document edit history, discussion around concept definition, & debate is

important here…

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

So, we can tag pretty accurately semi-automatically with globally

unique subject identifiers using this approach…

So what?

So, we can tag pretty accurately semi-automatically with globally

unique subject identifiers using this approach…

So what?

Un-silo your content repository quickly and cheaply, by connecting

it to the Web via Wikipedia

Now playing vs. the Web

Now playing vs. the Web

Why not bring in BBC Archive materials to this service via Wikipedia-as-CV tagging and linked data bridge between Wikipedia & MusicBrainz?

By using Wikipedia-as-CV, you can get your

repository onto this diagram quickly,

for free

An index language exists primarily to:

• Allow an indexer to represent the subject matter of documents in a consistent way

• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer

• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate

F.W. LancasterVocabulary control for information retrieval

A Web-scale, globally accessible index language accidentally exists:

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa

• It provides means whereby a searcher can modulate a search and/or browse strategy to attain comprehensive or selective results as user needs dictate

A Web-scale, globally accessible index language accidentally exists:

• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way

• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa

• It provides means whereby a searcher can modulate a search and/or browse strategy to attain comprehensive or selective results as user needs dictate

• It adds Web-scale navigation & cross-reference possibilities

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabularyWikipedia is a controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabularyWikipedia is a controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia is a controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

Chris SizemoreSilver OliverBBC

Wikipedia is a controlled vocabulary

Much thanks!

Questions, comments, & constructive criticism?

Chris SizemoreSilver OliverBBC

Wikipedia as controlled vocabulary

http://flickr.com/photos/deniscollette/1817034358/