1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

Post on 19-Dec-2015

215 views 3 download

Transcript of 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

1

DiffusionRank: A Possible Penicillin for Web Spamming

Haixuan Yang

Group MeetingJan. 16, 2006.

2

Outline

Introduction DiffusionRank

Model Establishment Computation consideration Discussion on γ

Results Conclusions

3

Introduction

PageRank Tries to find the importance of a Web page based on

the link structure. The importance of a page i is defined recursively in

terms of pages which point to it:

It proves to be effective for ranking Web pages.

4

Introduction PageRank

Two problems: The incomplete information about the Web structure.

Solution: predict the Web Structure as a random graph. The web pages manipulated by people for

commercial interests. About 70% of all pages in the .biz domain are spam About 35% of the pages in the .us domain belong to spam

category. Two methods used for manipulating spam pages

Link Stuffing Keyword Stuffing

Solution: DiffusionRank

5

An example for manipulation

The rank value of node 1 can be increased greatly!

6

Why? Two reasons

Over-democratic All pages are born equal--equal voting ability of one

page: the sum of each column is equal to one. Input-independent

For any given non-zero initial input, the iteration will converge to the same stable distribution.

Heat Diffusion Model -- a natural way to avoid these two factors

Pages are not equal as some pages are born with high temperatures while others are born with low temperatures.

Different initial temperature distributions will give rise to different temperature distributions after a fixed time period.

7

DiffusionRank On an undirected graph

Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j.

Solution:

8

DiffusionRank On an undirected graph

Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j.

Solution:

On a directed graph Assumption: there is extra energy imposed on

the link (j, i) such that the heat flow only from j to i if there is no link (i,j).

Solution:

On a random directed graph Assumption: the heat flow is proportional to the

probability of the link (j,i). Solution:

9

DiffusionRank On a random directed graph

Solution:

The initial value f(i,0) in f(0) is set to be 1 if i is trusted and 0 otherwise according to the inverse PageRank.

10

Computation consideration Approximation of heat kernel

N=? When N>=30, the real eigenvalues of are

less than 0.01; when N>=100, they are less than 0.005. We use N=100 in the paper.

When N tends to infinity

11

Discuss γ

γcan be understood as the thermal conductivity. When γ=0, the ranking value is most robust to

manipulation since no heat is diffused, but the Web structure is completely ignored;

When γ= ∞, DiffusionRank becomes PageRank, it can be manipulated easily.

Whenγ=1, DiffusionRank works well in practice

12

DiffusionRank Advantages

Can detect Group-group relations Can cut Graphs Anti-manipulation

+1

-1

γ= 0.5 or 1

13

DiffusionRank Experiments

Data: a toy graph (6 nodes) a middle-size real-world graph (18542 nodes) a large-size real-world graph crawled from CUHK

(607170 nodes) Compare with TrustRank and PageRank

14

Results The tendency of

DiffusionRank when γ becomes larger

On the toy graph

15

Anti-manipulation On the toy graph

16

Anti-manipulation on the middle graph and the large graph

17

Stability--the order difference between ranking results for an algorithm before it is manipulated and those after that

18

Conclusions

This anti-manipulation feature enables DiffusionRank to be a candidate as a penicillin for Web spamming.

DiffusionRank is a generalization of PageRank (when γ=∞).

DiffusionRank can be employed to detect group-group relation.

DiffusionRank can be used to cut graph.