Εισαγωγή στην...

Click here to load reader

download Εισαγωγή στην Ανάκτηση Πληροφορίας και στις Εφαρμογές της

of 80

  • date post

    21-Jan-2016
  • Category

    Documents

  • view

    39
  • download

    0

Embed Size (px)

description

Εισαγωγή στην Ανάκτηση Πληροφορίας και στις Εφαρμογές της. Εισαγωγικά. ΑΠ : αναπαράσταση , αποθήκευση , οργάνωση και προσπέλαση σε αντικείμενα πληροφορίας Επίκεντρο η πληροφοριακή ανάγκη του χρήστη Πληροφοριακή ανάγκη χρήστη : - PowerPoint PPT Presentation

Transcript of Εισαγωγή στην...

  • : , , : (1) , (2)

  • keywords? (semantics) ! :

  • 30 : (classification) (categorization) : ( ) :

  • Web Search Engines (Digital Libraries) Peer to Peer Web Services /

  • / (Data Mining)

  • 50-60 1945: Vannenar Bushs As we may think1960+: Gerald Salton1978: ACM SIGIR 1992: TREC

  • Unstructured (text) vs. structured (database) data in 1996

  • Unstructured (text) vs. structured (database) data in 2006

  • Computer Centered View ( )- - -

    Human Centered View ( ) - -

  • (Retrieval) (Browsing) (Hidden web)

  • format ;pdf/word/excel/html? ; ;

    ; ; e-mail; email ;o ;

  • ( )

  • Queries

    , stop lists .

    User Interface

  • .. [D, Q, F, R(qi, dj)] :1)- D 2) - Q . 3) - F , - R(qi, dj) , qi Q dj D. . qi.

  • ..

  • * Inverted file :Structure for the efficient location of the occurrences of a term inside a text collection.Structure :Set of inverted lists, that are stored inside a file in a disk. Inverted list:a list that contains the occurrences of a term inside the texts of a collectionStructure of an inverted list[3] number of documents in the inverted list that contain the specific term pair : the term appears in the document 1, twiceDepending on the requirements of the application an inverted list record can contain various kinds of information (e.g. number of the paragraph where the term appears etc.)

  • *[3]

    [3]

    [2]

    [2]

    [2]

    Inverted file t1 t2 t3t4 t5t2 t1t3 t5t4 t2t1 t4t2 t1Algorithm for Inverted File creationd1d2d3t1Mapping terms toInverted lists t2t3t4t5Document Collection

  • WWWurl(.. Yahoo) ,

  • Host namePage nameAccess methodURL = Universal Resource Locator

    http://www.ceid.upatras.gr/ir/

  • 2-10B , 8-12 : 10-100

    http://www.netcraft.com/Survey

  • /: , W3C : 55 : 82%, 15: 13% : , , (marketing) (30-40% ) ? ~8 / Bow-tie

  • : : + :

    (~40%)

    (~25%)

    Transactional (web-mediated) (~35%) hub see whats there

  • (html, xml), mp3, images, video, ... = data base accessthe invisible webproprietary content, etc.

  • (80% ) bandwidth , feedback,

  • : .. Google crawl anchor-text. : ( n , ...) crawling.

  • : URL . : URLs

    URLs 20,000 URLs random conjunctive query

  • Choose random searches extracted from a local log or build random searches Use only queries with small results sets. Count normalized URLs in result sets.Use ratio statisticsAdvantage:Might be a good reflection of the human perception of coverage

  • p1,p2:Pr[p1 p2] ~ 1/4 2 SCC : >28 2 : ~16 : ~7

  • Power Laws -

    x y power law y x-c log y = -c*log x

  • power law

    Zipfy : x : o x-

    Power law c=1

    y 1/x

  • Power laws Web?Broder et. al. 1999x = #links iy = # x linksy x-2.09

  • Power laws Web?()x = #links iy = # x linksy x-2.72

  • Web Web Web

  • WebKumar et. al. Stochastic models for the Web Graph, FOCS 2000vt+1 t Web

  • Web t+1 d d>1 - ?

    vt+1 1- i- v

  • Web .

    Power laws!To d :

  • :Enterprise search Peer-2-Peer (P2P) search

  • Peer-to-Peer GnutellaKazaaBearshareAimsterGroksterMorpheus

  • - on page ,

    -- off-page, web-specific Link ( connectivity) Click-through ( click on)Anchor-text ( )

    ? context

  • Boolean : exact, prefix, phrase,: AND, OR, AND NOT, NEAR, : TITLE:, URL:, HOST:, AND ,

    TF : TF, keywords, , (headers), ... IDF : IDF, corpus, query log,

  • - off-page, web-specific - Link ( connectivity) - Click-through ( ) - Anchor-text ( )

    Crawling- corpus

  • Query language determination and different rankingIntegration of Search and Text AnalysisContext determination spatial (user location/target location)query stream (previous queries)personal (user profile)

    Context useResult restrictionRanking modulation

  • (Crawling) ;

    ; (refresh policy)

    ;

    ;

  • Crawler . Crawler Crawler . Crawler . Crawlers (Crawling)

  • Crawling - Interest Driven

  • Crawling - Interest Driven & A new approach to topic-specific web resource discovery Chakrabarti et al. 8th WWW conference 1999If Q is the user interest then:

  • Crawling - Popularity DrivenLocation Driven

  • Context Graph:Context graph created for each seed document .Root is the seed document.Nodes at each level show documents with links to documents at next higher level. Updated during crawl itself .Approach:Construct context graph and classifiers using seed documents as training data.Perform crawling using classifiers and context graph created.

    Context Graph Crawling

  • Context Graph Crawling

  • Crawling - f ( ) =

    f=F(i)

  • Crawling -

  • Crawling - Synchronizing a database to improve freshness.Cho, Molina. In Pro-ceedings of the International Conference on Management of Data, 2000.

  • Page Repository

  • Page Repository : RPA Streaming Access

    LogHashHash-LogStreaming Access +! -!+RPA ~+!~Page Addition +!-!~

  • Page Repository conflicts vs. freshness obsolete pages :

  • Indexing

  • Indexing text index

    inverted files suffix arrays signature filesstructure (link) indexutility index

  • Ranking and Link Analysis O ! PageRank : The pagerank citation ranking:Bringing order to the web. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. Technical report, Computer Science Department, Stanford University,1998. (Google)

    HITS: Authoritative sources in a hyperlinked environment. Jon Kleinberg. Journal of the ACM, 46(5):604-632, November 1999. (Clever IBM, Teoma).

  • PageRank .www.upatras.gr#in_links=760www.stanford.edu#in_links=33600www.upatras.grwww.stanford.edu

  • PageRankstrongly connected graph

  • PageRank random surfer model strongly connected assumption problem: rank leak, rank sink

  • PageRank random surfer model

  • (1) Markov n , nn P. , . 1 i,j n, Pij j , i. Markov chain .

  • (2) Markov , Steady-state distribution. a = (a1, an) row vector steady-state . a, aP. a=aP, a () P.( P .)

  • Hypertext Induced Topic Search (HITS) Q. authority hubQ=greek university

    Authority : www.upatras.gr www.auth.gr Hub:www.gunet.grUniversities Worldwide http://geowww.uibk.ac.at/univ/world.html

  • Hypertext Induced Topic Search (HITS)max{d}

  • Hypertext Induced Topic Search (HITS)

  • Hypertext Induced Topic Search (HITS)

  • Hypertext Induced Topic Search (HITS)

  • Hypertext Induced Topic Search (HITS) jaguar randomized algorithms abortion

  • Tag/position heuristics tags , sections

  • Anchor Text , anchor text . hubs/authorities. Anchor text 6-8 , link anchor.

  • Web sites, site

  • Web Mining Taxonomy

  • Web Content MiningKeywordTerm AssociationSimilarity SearchClassificationClusteringNatural Language Processing

  • Web Usage Mining

    OrderingDuplicatesConsecutiveMaximalSupportAssociation RulesNNNNFreq(X)/#transactionsEpisodesYNNNFreq(X)/#timewindowsSequential patternsYNNYFreq(X)/#customersForward sequencesYNYYFreq(X)/#forward sequencesMaximal forward sequencesYYYYFreq(X)/#clicks

  • R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999. Christofer Manning, Pradhakar Raghavan, Hunrich Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008. (http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html). Witten, A. Moffat, T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, 1999.G. Salton, M. McGill, An Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983.Van Reijsbergen, Information Retrieval, London: Butterworths, 1979Van Reijsbergen, The Geometry of Information Retrieval, Cambridge University Press, 2005W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice Hall, EngleWood Cliffs, NJ. USA 1992. : http://mmlab.ceid.upatras.gr/ir

  • B. Allen, Information Tasks: Towards a User-Centered Approach to Information Systems. Academic Press, San Diego, CA, 1996.M. Attalah ed., Algorithms and Theory of Computation Handbook CRC Press 1999.D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997.V.S. Subrahmanian. Principles of Multimedia Database Systems, Morgan Kaufmann, 1998.Ian H. Witten, Alist