Post on 12-May-2015
description
Chris SizemoreSilver OliverBBC
Wikipedia as controlled vocabulary
I’m about ‘Victorians’
BBC Topic Page
I’m about ‘Victorian
s’
Outside the BBC
BBC silo #1 BBC silo #3
BBC silo #2
BBC Topic Page
I’m about ‘Victorian
s’
viktorianisch
V잊도 r 이안
Ελληνικά
NY Times, flickr,
wikipedia
Outside the BBC
BBC silo #1 BBC silo #3
BBC silo #2
An index language exists primarily to:
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer
• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer
• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate
F.W. LancasterVocabulary control for information retrieval
Could Wikipedia be used as a universal
language for identifying subjects?
Story of Wikipedia-as-CV
Story of Wikipedia-as-CV: personal origins
Story of Wikipedia-as-CV: personal origins
We needed a system to categorise movie & TV
reviews
Story of Wikipedia-as-CV: personal origins
So of course we built a categorisation system from scratch -- including its own
controlled vocab
Story of Wikipedia-as-CV: personal origins
And when people saw the system, they always said: “Hey, that reminds me of
Internet Movie Database…”
Story of Wikipedia-as-CV: personal origins
It struck me that the way Internet Movie Database is set up isn’t dissimilar to the structure of a
thesaurus or a very flat taxonomy…
Story of Wikipedia-as-CV: personal origins
But its’s one where the emphasis is on “related to”, not broader/narrower,
synonym, antonym, etc
Story of Wikipedia-as-CV: personal origins
From then, I couldn’t help but be drawn to websites where the structure
is clearly:
Story of Wikipedia-as-CV: personal origins
From then, I couldn’t help but be drawn to websites where the structure
is clearly: “a single primary Concept per page --
and pages for related Concepts link to each other”
Story of Wikipedia-as-CV: personal origins
Could those “one Concept per page” webpages be used as “terms” as in a
controlled vocabulary?
Are some websites actually “indexing
languages” in disguise?
conText --a Wikipedia-as-CV auto-categoriser
prototype
conText -- a Wikipedia-as-CV auto-categoriser
prototype:http://sells.welcomebackstage.com:5000/item/
submit
Demo of conText -- a Wikipedia-as-CV auto-categoriser
prototype
Demo of conText -- a Wikipedia-as-CV auto-categoriser
prototype:
Take text from audience!
Wikipedia is already being used across the Web as a form of
subject identification & disambiguation, in a grassroots
way:
Wikipedia is already being used across the Web as a form of
subject identification & disambiguation, in a grassroots
way:
in the form of hyperlinks embedded by authors in blog
posts, news articles, music reviews, etc everywhere!
http://en.wikipedia.org/wiki/British
http://en.wikipedia.org/wiki/Science_fiction
http://en.wikipedia.org/wiki/BBC
http://en.wikipedia.org/wiki/Time_travel
http://en.wikipedia.org/wiki/Dr_who
http://en.wikipedia.org/wiki/Tardis
These days, by convention, when you link to Wikipedia from your webpage, more than saying “go and have a look at this other
page”, you are more likely giving a definition to a concept referred to in your content…
These days, by convention, when you link to Wikipedia from your webpage, more than saying “go and have a look at this other
page”, you are more likely giving a definition to a concept referred to in your content…
Also used in this way for specific domains are Internet Movie Database (for films & TV
programmes), MySpace (for bands), Amazon (for books), etc
For general knowledge, though,
Wikipedia is becoming the Web’s defacto
controlled vocabulary
http://en.wikipedia.org/wiki/Heerlen
http://en.wikipedia.org/wiki/Beethoven
http://en.wikipedia.org/wiki/Amsterdam
http://en.wikipedia.org/wiki/Van_Gogh_Museum
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer
• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate
F.W. LancasterVocabulary control for information retrieval
Wikipedia pages provide the best scope
notes in the world
Wikipedia pages provide the best scope
notes in the worldWikipedia-as-CV benefits from being developed through a social process, maintained and kept
current by the Wikipedia community
Wikipedia pages provide the best scope
notes in the worldWikipedia-as-CV benefits from being developed through a social process, maintained and kept
current by the Wikipedia community
Each concept represents a consensus view and its meaning can be understood simply by reading the
associated Wikipedia page
Wikipedia pages provide the best scope
notes in the world
For each Concept, the document edit history, discussion around concept definition, & debate is
important here…
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer
• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate
F.W. LancasterVocabulary control for information retrieval
So, we can tag pretty accurately semi-automatically with globally
unique subject identifiers using this approach…
So what?
So, we can tag pretty accurately semi-automatically with globally
unique subject identifiers using this approach…
So what?
Un-silo your content repository quickly and cheaply, by connecting
it to the Web via Wikipedia
Now playing vs. the Web
Now playing vs. the Web
Why not bring in BBC Archive materials to this service via Wikipedia-as-CV tagging and linked data bridge between Wikipedia & MusicBrainz?
By using Wikipedia-as-CV, you can get your
repository onto this diagram quickly,
for free
An index language exists primarily to:
• Allow an indexer to represent the subject matter of documents in a consistent way
• Bring the vocabulary used by the searcher into coincidence with the vocabulary used by the indexer
• Provide means whereby a searcher can modulate the search strategy to attain comprehensive or selective results as user needs dictate
F.W. LancasterVocabulary control for information retrieval
A Web-scale, globally accessible index language accidentally exists:
A Web-scale, globally accessible index language accidentally exists:
• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way
A Web-scale, globally accessible index language accidentally exists:
• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way
• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa
A Web-scale, globally accessible index language accidentally exists:
• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way
• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa
• It provides means whereby a searcher can modulate a search and/or browse strategy to attain comprehensive or selective results as user needs dictate
A Web-scale, globally accessible index language accidentally exists:
• It encourages multiple indexers across the Web to represent the subject matter of any content in a consistent way
• It brings the vocabulary used by info seekers into coincidence with the vocabulary used by indexers -- the searchers ARE indexers, and vice versa
• It provides means whereby a searcher can modulate a search and/or browse strategy to attain comprehensive or selective results as user needs dictate
• It adds Web-scale navigation & cross-reference possibilities
Chris SizemoreSilver OliverBBC
Wikipedia as controlled vocabularyWikipedia is a controlled vocabulary
Chris SizemoreSilver OliverBBC
Wikipedia as controlled vocabularyWikipedia is a controlled vocabulary
Chris SizemoreSilver OliverBBC
Wikipedia as controlled vocabulary
Chris SizemoreSilver OliverBBC
Wikipedia is a controlled vocabulary
Chris SizemoreSilver OliverBBC
Wikipedia as controlled vocabulary
Chris SizemoreSilver OliverBBC
Wikipedia is a controlled vocabulary
Much thanks!
Questions, comments, & constructive criticism?
Chris SizemoreSilver OliverBBC
Wikipedia as controlled vocabulary
http://flickr.com/photos/deniscollette/1817034358/