Εισαγωγή στην...
date post
21-Jan-2016Category
Documents
view
39download
0
Embed Size (px)
description
Transcript of Εισαγωγή στην...
: , , : (1) , (2)
keywords? (semantics) ! :
30 : (classification) (categorization) : ( ) :
Web Search Engines (Digital Libraries) Peer to Peer Web Services /
/ (Data Mining)
50-60 1945: Vannenar Bushs As we may think1960+: Gerald Salton1978: ACM SIGIR 1992: TREC
Unstructured (text) vs. structured (database) data in 1996
Unstructured (text) vs. structured (database) data in 2006
Computer Centered View ( )- - -
Human Centered View ( ) - -
(Retrieval) (Browsing) (Hidden web)
format ;pdf/word/excel/html? ; ;
; ; e-mail; email ;o ;
( )
Queries
, stop lists .
User Interface
.. [D, Q, F, R(qi, dj)] :1)- D 2) - Q . 3) - F , - R(qi, dj) , qi Q dj D. . qi.
..
* Inverted file :Structure for the efficient location of the occurrences of a term inside a text collection.Structure :Set of inverted lists, that are stored inside a file in a disk. Inverted list:a list that contains the occurrences of a term inside the texts of a collectionStructure of an inverted list[3] number of documents in the inverted list that contain the specific term pair : the term appears in the document 1, twiceDepending on the requirements of the application an inverted list record can contain various kinds of information (e.g. number of the paragraph where the term appears etc.)
*[3]
[3]
[2]
[2]
[2]
Inverted file t1 t2 t3t4 t5t2 t1t3 t5t4 t2t1 t4t2 t1Algorithm for Inverted File creationd1d2d3t1Mapping terms toInverted lists t2t3t4t5Document Collection
WWWurl(.. Yahoo) ,
Host namePage nameAccess methodURL = Universal Resource Locator
http://www.ceid.upatras.gr/ir/
2-10B , 8-12 : 10-100
http://www.netcraft.com/Survey
/: , W3C : 55 : 82%, 15: 13% : , , (marketing) (30-40% ) ? ~8 / Bow-tie
: : + :
(~40%)
(~25%)
Transactional (web-mediated) (~35%) hub see whats there
(html, xml), mp3, images, video, ... = data base accessthe invisible webproprietary content, etc.
(80% ) bandwidth , feedback,
: .. Google crawl anchor-text. : ( n , ...) crawling.
: URL . : URLs
URLs 20,000 URLs random conjunctive query
Choose random searches extracted from a local log or build random searches Use only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantage:Might be a good reflection of the human perception of coverage
p1,p2:Pr[p1 p2] ~ 1/4 2 SCC : >28 2 : ~16 : ~7
Power Laws -
x y power law y x-c log y = -c*log x
power law
Zipfy : x : o x-
Power law c=1
y 1/x
Power laws Web?Broder et. al. 1999x = #links iy = # x linksy x-2.09
Power laws Web?()x = #links iy = # x linksy x-2.72
Web Web Web
WebKumar et. al. Stochastic models for the Web Graph, FOCS 2000vt+1 t Web
Web t+1 d d>1 - ?
vt+1 1- i- v
Web .
Power laws!To d :
:Enterprise search Peer-2-Peer (P2P) search
Peer-to-Peer GnutellaKazaaBearshareAimsterGroksterMorpheus
- on page ,
-- off-page, web-specific Link ( connectivity) Click-through ( click on)Anchor-text ( )
? context
Boolean : exact, prefix, phrase,: AND, OR, AND NOT, NEAR, : TITLE:, URL:, HOST:, AND ,
TF : TF, keywords, , (headers), ... IDF : IDF, corpus, query log,
- off-page, web-specific - Link ( connectivity) - Click-through ( ) - Anchor-text ( )
Crawling- corpus
Query language determination and different rankingIntegration of Search and Text AnalysisContext determination spatial (user location/target location)query stream (previous queries)personal (user profile)
Context useResult restrictionRanking modulation
(Crawling) ;
; (refresh policy)
;
;
Crawler . Crawler Crawler . Crawler . Crawlers (Crawling)
Crawling - Interest Driven
Crawling - Interest Driven & A new approach to topic-specific web resource discovery Chakrabarti et al. 8th WWW conference 1999If Q is the user interest then:
Crawling - Popularity DrivenLocation Driven
Context Graph:Context graph created for each seed document .Root is the seed document.Nodes at each level show documents with links to documents at next higher level. Updated during crawl itself .Approach:Construct context graph and classifiers using seed documents as training data.Perform crawling using classifiers and context graph created.
Context Graph Crawling
Context Graph Crawling
Crawling - f ( ) =
f=F(i)
Crawling -
Crawling - Synchronizing a database to improve freshness.Cho, Molina. In Pro-ceedings of the International Conference on Management of Data, 2000.
Page Repository
Page Repository : RPA Streaming Access
LogHashHash-LogStreaming Access +! -!+RPA ~+!~Page Addition +!-!~
Page Repository conflicts vs. freshness obsolete pages :
Indexing
Indexing text index
inverted files suffix arrays signature filesstructure (link) indexutility index
Ranking and Link Analysis O ! PageRank : The pagerank citation ranking:Bringing order to the web. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. Technical report, Computer Science Department, Stanford University,1998. (Google)
HITS: Authoritative sources in a hyperlinked environment. Jon Kleinberg. Journal of the ACM, 46(5):604-632, November 1999. (Clever IBM, Teoma).
PageRank .www.upatras.gr#in_links=760www.stanford.edu#in_links=33600www.upatras.grwww.stanford.edu
PageRankstrongly connected graph
PageRank random surfer model strongly connected assumption problem: rank leak, rank sink
PageRank random surfer model
(1) Markov n , nn P. , . 1 i,j n, Pij j , i. Markov chain .
(2) Markov , Steady-state distribution. a = (a1, an) row vector steady-state . a, aP. a=aP, a () P.( P .)
Hypertext Induced Topic Search (HITS) Q. authority hubQ=greek university
Authority : www.upatras.gr www.auth.gr Hub:www.gunet.grUniversities Worldwide http://geowww.uibk.ac.at/univ/world.html
Hypertext Induced Topic Search (HITS)max{d}
Hypertext Induced Topic Search (HITS)
Hypertext Induced Topic Search (HITS)
Hypertext Induced Topic Search (HITS)
Hypertext Induced Topic Search (HITS) jaguar randomized algorithms abortion
Tag/position heuristics tags , sections
Anchor Text , anchor text . hubs/authorities. Anchor text 6-8 , link anchor.
Web sites, site
Web Mining Taxonomy
Web Content MiningKeywordTerm AssociationSimilarity SearchClassificationClusteringNatural Language Processing
Web Usage Mining
OrderingDuplicatesConsecutiveMaximalSupportAssociation RulesNNNNFreq(X)/#transactionsEpisodesYNNNFreq(X)/#timewindowsSequential patternsYNNYFreq(X)/#customersForward sequencesYNYYFreq(X)/#forward sequencesMaximal forward sequencesYYYYFreq(X)/#clicks
R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999. Christofer Manning, Pradhakar Raghavan, Hunrich Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008. (http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html). Witten, A. Moffat, T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, 1999.G. Salton, M. McGill, An Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983.Van Reijsbergen, Information Retrieval, London: Butterworths, 1979Van Reijsbergen, The Geometry of Information Retrieval, Cambridge University Press, 2005W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice Hall, EngleWood Cliffs, NJ. USA 1992. : http://mmlab.ceid.upatras.gr/ir
B. Allen, Information Tasks: Towards a User-Centered Approach to Information Systems. Academic Press, San Diego, CA, 1996.M. Attalah ed., Algorithms and Theory of Computation Handbook CRC Press 1999.D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997.V.S. Subrahmanian. Principles of Multimedia Database Systems, Morgan Kaufmann, 1998.Ian H. Witten, Alist