1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

18
1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    3

Transcript of 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

Page 1: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

1

DiffusionRank: A Possible Penicillin for Web Spamming

Haixuan Yang

Group MeetingJan. 16, 2006.

Page 2: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

2

Outline

Introduction DiffusionRank

Model Establishment Computation consideration Discussion on γ

Results Conclusions

Page 3: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

3

Introduction

PageRank Tries to find the importance of a Web page based on

the link structure. The importance of a page i is defined recursively in

terms of pages which point to it:

It proves to be effective for ranking Web pages.

Page 4: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

4

Introduction PageRank

Two problems: The incomplete information about the Web structure.

Solution: predict the Web Structure as a random graph. The web pages manipulated by people for

commercial interests. About 70% of all pages in the .biz domain are spam About 35% of the pages in the .us domain belong to spam

category. Two methods used for manipulating spam pages

Link Stuffing Keyword Stuffing

Solution: DiffusionRank

Page 5: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

5

An example for manipulation

The rank value of node 1 can be increased greatly!

Page 6: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

6

Why? Two reasons

Over-democratic All pages are born equal--equal voting ability of one

page: the sum of each column is equal to one. Input-independent

For any given non-zero initial input, the iteration will converge to the same stable distribution.

Heat Diffusion Model -- a natural way to avoid these two factors

Pages are not equal as some pages are born with high temperatures while others are born with low temperatures.

Different initial temperature distributions will give rise to different temperature distributions after a fixed time period.

Page 7: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

7

DiffusionRank On an undirected graph

Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j.

Solution:

Page 8: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

8

DiffusionRank On an undirected graph

Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j.

Solution:

On a directed graph Assumption: there is extra energy imposed on

the link (j, i) such that the heat flow only from j to i if there is no link (i,j).

Solution:

On a random directed graph Assumption: the heat flow is proportional to the

probability of the link (j,i). Solution:

Page 9: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

9

DiffusionRank On a random directed graph

Solution:

The initial value f(i,0) in f(0) is set to be 1 if i is trusted and 0 otherwise according to the inverse PageRank.

Page 10: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

10

Computation consideration Approximation of heat kernel

N=? When N>=30, the real eigenvalues of are

less than 0.01; when N>=100, they are less than 0.005. We use N=100 in the paper.

When N tends to infinity

Page 11: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

11

Discuss γ

γcan be understood as the thermal conductivity. When γ=0, the ranking value is most robust to

manipulation since no heat is diffused, but the Web structure is completely ignored;

When γ= ∞, DiffusionRank becomes PageRank, it can be manipulated easily.

Whenγ=1, DiffusionRank works well in practice

Page 12: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

12

DiffusionRank Advantages

Can detect Group-group relations Can cut Graphs Anti-manipulation

+1

-1

γ= 0.5 or 1

Page 13: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

13

DiffusionRank Experiments

Data: a toy graph (6 nodes) a middle-size real-world graph (18542 nodes) a large-size real-world graph crawled from CUHK

(607170 nodes) Compare with TrustRank and PageRank

Page 14: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

14

Results The tendency of

DiffusionRank when γ becomes larger

On the toy graph

Page 15: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

15

Anti-manipulation On the toy graph

Page 16: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

16

Anti-manipulation on the middle graph and the large graph

Page 17: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

17

Stability--the order difference between ranking results for an algorithm before it is manipulated and those after that

Page 18: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

18

Conclusions

This anti-manipulation feature enables DiffusionRank to be a candidate as a penicillin for Web spamming.

DiffusionRank is a generalization of PageRank (when γ=∞).

DiffusionRank can be employed to detect group-group relation.

DiffusionRank can be used to cut graph.