Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay
description
Transcript of Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay
Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay
Ziv Bar-Yossef et alIBM Almaden and T.J Watson Research Centers
Mark Strohmaier
Problem Motivation
Determining if a link is dead is not trivial
Using dead links as a decay signal is very noisy
Estimating Proportion of Dead Pages
Begin at a page
Probability 1-σ of randomly walking off, probability σ of declaring success
If walk to a dead page, declare failure
Overall decay score of the page is chance of failing for that page
Differences from PageRank
Decay of a page can be processed in isolation
Very easy to reduce a page's decay score
Three types of dead pages
Malformed URL
Host does not exist
Page does not exist on host
Detecting 'soft-404s'
Query a given server for a page that does not likely exist
Record the server's response to the dummy request
Observe the behaviour when the legitimate request is sent
Measure of Decay
If D is the set of all dead pages
Experimental Procedure
First Round of Experiments
1000 pages were randomly chosen from a web crawl of two billion
475 were already dead
Of 710 dead links, 207 pointed to soft-404s
First Round of Experiments
Decay Score versus Dead Links
Second Experiment – WWW Conference links
Third Experiment – Yahoo Leaf Nodes
Final Experiment – FAQs.org
Conclusions and Remarks
A number of tools exist for identifying dead links, but few exist for identifying decay
Incorporating decay calculations into search results could be used to improve rankings
Decay computations could also be used to improve crawling