Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay

15
Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay Ziv Bar-Yossef et al IBM Almaden and T.J Watson Research Centers Mark Strohmaier

description

Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay. Ziv Bar-Yossef et al IBM Almaden and T.J Watson Research Centers. Mark Strohmaier. Problem Motivation. Determining if a link is dead is not trivial Using dead links as a decay signal is very noisy. - PowerPoint PPT Presentation

Transcript of Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay

Page 1: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay

Ziv Bar-Yossef et alIBM Almaden and T.J Watson Research Centers

Mark Strohmaier

Page 2: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Problem Motivation

Determining if a link is dead is not trivial

Using dead links as a decay signal is very noisy

Page 3: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Estimating Proportion of Dead Pages

Begin at a page

Probability 1-σ of randomly walking off, probability σ of declaring success

If walk to a dead page, declare failure

Overall decay score of the page is chance of failing for that page

Page 4: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Differences from PageRank

Decay of a page can be processed in isolation

Very easy to reduce a page's decay score

Page 5: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Three types of dead pages

Malformed URL

Host does not exist

Page does not exist on host

Page 6: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Detecting 'soft-404s'

Query a given server for a page that does not likely exist

Record the server's response to the dummy request

Observe the behaviour when the legitimate request is sent

Page 7: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Measure of Decay

If D is the set of all dead pages

Page 8: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Experimental Procedure

Page 9: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

First Round of Experiments

1000 pages were randomly chosen from a web crawl of two billion

475 were already dead

Of 710 dead links, 207 pointed to soft-404s

Page 10: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

First Round of Experiments

Page 11: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Decay Score versus Dead Links

Page 12: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Second Experiment – WWW Conference links

Page 13: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Third Experiment – Yahoo Leaf Nodes

Page 14: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Final Experiment – FAQs.org

Page 15: Sic Transit Gloria Telae:  Towards an Understanding of the Web's Decay

Conclusions and Remarks

A number of tools exist for identifying dead links, but few exist for identifying decay

Incorporating decay calculations into search results could be used to improve rankings

Decay computations could also be used to improve crawling