Download - A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Transcript

Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali

A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Bilbao, Spain30 Aug – 3 Sep

CERTH ITI AUTH

Page 2: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 2

overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)

– core set detection

– (μ,ε)-space exploration

– core set expansion

• evaluation

• conclusions

Page 3: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 3

overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)

– core set detection

– (μ,ε)-space exploration

– core set expansion

• evaluation

• conclusions

Page 4: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 4

tag clustering

• starting point: folksonomy, i.e. annotation scheme produced by the set of users,

resources, tags of a social tagging system, e.g. delicious, flickr, BibSonomy (Mika, 2005)

• observation I: folksonomies a direct encoding of the views of users on how content

items should be organized through a flexible annotation scheme

• observation II: tags used to describe the same resources tags related to each other

(meaningful semantic association)

Mika, P.: Ontologies are us: A unified model of social networks and semantics. ISWC 2005, LNCS 3729, 522-536, Springer-Verlag (2005)

Page 5: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 5

why is tag clustering useful?

• information exploration and navigation (Begelman et al., 2006; Simpson, 2008)

• automatic content annotation (Brooks, 2006)

• user profiling (Gemmell, 2008)

• content clustering (Giannakidou, 2008)

• tag sense disambiguation (Au Yeung, 2009)

Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings of WAIM 2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008)

Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering. Proceedings of WWW '06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006)

Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008)

Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article: http://www.pui.ch/phred/automated_tag_clustering (2006)

Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)

Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of 20th ACM Conference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009)

Page 6: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 6

overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)

– core set detection

– (μ,ε)-space exploration

– core set expansion

• evaluation

• conclusions

Page 7: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 7

existing solutions (i) :: conventional clustering

• conventional clustering schemes

represent tags in some feature space and employ standard clustering method, e.g.:

• k-means (Giannakidou et al., 2008)

• hierarchical agglomerative clustering (HAC)

(Brooks et al., 2006; Gemmell et al., 2008)

Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008)

Page 8: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 8

existing solutions (i) :: conventional clustering

• problems with conventional clustering

• needs number of clusters to be defined: very hard to even estimate it in large-scale tagging systems

• not easily scalable:– k-means (Lloyd’s): O(I C n D)

– HAC: O(n2 logn)

n: number of tags, I: number of iterations, C: number of clusters, D: number of dimensions

HAC is hardly applicable since it requires n2 memory for storing the dissimilarity matrix

Page 9: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 9

existing solutions (ii) :: community detection

• use of community detection methods on tag graphs (derived from folksonomies) to find groups of tags that are more densely connected to each other than to the rest of the graph

• community detection methods largely address shortcomings of conventional clustering (Begelman et al., 2006; Simpson, 2008; Au Yeung et al., 2009) schemes– efficient: complexity O(n logn)– do not require number of communities to be provided as input

(typically use modularity maximization)

Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article: http://www.pui.ch/phred/automated_tag_clustering (2006)

Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)

Page 10: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 10

existing solutions (ii) :: community detection

• existing community detection schemes also suffer from problems

– modularity maximization typically leads to highly skewed cluster size distribution (Simpson, 2008):

few gigantic clusters and numerous small ones gigantic clusters (representing even half the number of objects) are not useful for IR

– not possible to leave noisy objects out of cluster structure

– not possible to have overlap among clusters (which is useful in the context of tag clustering)

Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)

Page 11: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 11

overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)

– core set detection

– (μ,ε)-space exploration

– core set expansion

• evaluation

• conclusions

Page 12: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 12

hybrid graph clustering

• our solution is based on a structure-connected community detection approach (Xu et al., 2007) that is based on the concept of structural similarity and (μ,ε)-cores:– nodes on the graph are structurally similar when they have many

neighbors in common

– a (μ,ε)-core is a node that has at least μ neighboring nodes with which it has structural similarity at least ε

• extended in two ways:– parameter space exploration raises the need for setting

parameters

– core community expansion permits overlap among communities

Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD '07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007)

Page 13: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 13

hybrid graph clustering

• hybrid scheme:

– (μ,ε)-core identification and structure connected cluster extraction (original approach)

– (μ,ε)-parameter space exploration

makes scheme completely parameter-free

– cluster expansion

increases coverage, permits overlap among clusters

Page 14: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 14

structure connected cluster extraction

• structural similarity between nodes u, w on a graph G = {V, E}:

• ε-neighborhood:

• (μ,ε)-core:

• direct structure reachability of w w.r.t. to core u:

• cluster extraction (Xu et al., 2007): starting from a (μ,ε)-core node grow the cluster to contain all nodes that

are directly structure reachable to it or reachable through a chain of nodes that are directly structure reachable to each other

Page 15: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 15

structure connected cluster extraction

• edge labels denote structural similarity values between nodes

• blue nodes are (μ, ε)-cores for μ = 5 and ε = 0.65

• gray nodes are directly structure reachable from (μ, ε)-cores

• the rest of nodes are left out of the cluster structure

Page 16: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 16

parameter space exploration

• original approach needs parameter setting that is troublesome for complex datasets

• parameter interpretation:– μ: a high value for μ will lead to fewer and larger clusters,

i.e. only nodes with degree of at least μ will be considered to be cores

– ε: a high value for ε will make the cluster extraction process stricter, i.e. less nodes will be assigned to clusters

• in fact, a single (μ,ε) parameter pair is unlikely to discover all interesting clusters

Page 17: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 17

parameter space exploration

• search for clusters at multiple parameter pairs

• identify the highest quality clusters (high μ, high ε), then proceed to less profound clusters

• exclude nodes that have already been assigned to a cluster from being re-assigned makes process faster

• log-sampling along μaxis for faster exploration

Page 18: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 18

cluster expansion

• the original structure connected approach may be too strict and thus leave too many nodes out of the clustering structure

• an expansion process attempts to mitigate this weakness• for each extracted core cluster, a local expansion process is conducted

that attaches neighboring nodes• the expansion is based on a simple greedy maximization of a local cluster

density measure called subgraph modularity (Luo et al., 2006):

• nodes with very high degree (belonging to the top 10 percentile of the degree distribution) are not considered in this process in order to make the expansion process more efficient

Luo, F., Wang, J. Z., Promislow, E.: Exploring Local Community Structures in Large Networks. Proceedings of the 2006 IEEE/WIC/ACM international Conference on Web Intelligence. IEEE Computer Society, 233-239 (2006)

Page 19: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 19

cluster expansion

(a) before attaching node 11M(S) = 1.429

(b) after attaching node 11M(S) = 2.4

Page 20: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 20

overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)

– core set detection

– (μ,ε)-space exploration

– core set expansion

• evaluation

• conclusions

Page 21: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 21

evaluation :: overview

goal: compare the quality of tag clusters produced by our method (HGC) with the one produced by state-of-the-art, namely:(a) modularity-maximization method by Clauset et al., 2004 (CNM)(b) original structure connected graph clustering by Xu et al., 2007 (SCAN)

two kinds of evaluation:

• direct small-scale evaluationsubjective assessment of the produced tag clusters by eyeballing to see whether tags belonging to the same cluster are related

• indirect large-scale evaluationevaluate how useful the produced cluster structure is for some IR task, namely tag recommendation if tag clusters are good, performance of tag recommendation based on them will be good as well

Page 22: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 22

evaluation :: datasets

• three different folksonomy datasets of various sizes:

• resulting tag graphs (large component)

average degree

average clustering coefficient

Page 23: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 23

direct evaluation (i)

examples of unrelated tags placed in the same gigantic community by CNM

Page 24: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 24

direct evaluation (ii)examples of interesting HGC communities

Page 25: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 25

indirect evaluation :: setup (i)

• process– simple tag recommender based on tag clusters:

• input tag

• find containing community

• recommend most frequent tags of the same community

naïve technique, but fair for comparing the effectiveness of the used tag cluster structure

– the three competing tag cluster structures (CNM, SCAN, HGC) were used by the recommender

– historic tagging data were used as ground truth

• for each user one tag was used as input and the rest were considered as the “correct” output

• very frequent tags (top 5%) were left out of this process in order not to allow trivial (very generic) recommendations to mask the actual results

Page 26: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 26

indirect evaluation :: setup (ii)

• measures– RTP: number of correct recommendations per

recommender instance

– UTP: number of unique correct recommendations

– P: precision, i.e. ratio of correct recommendations over total recommendation per recommender instance

– R: recall, i.e. ratio of correct recommendations of a recommender instance over all correct tags according to ground truth

– F-measure

– P@1, P@5: Precision in the top-1/top-5 recommendations

Page 27: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 27

• for SCAN, we used the (μ,ε)-pair that yielded the highest F-measure

• both SCAN and HGC perform considerably better than CNM

• HGC results in more unique correct recommendations and higher recall

• the cluster expansion step was responsible for the largest increase in recall and corresponding drop in precision

conclusion: given the task and the evaluation setup, we would prefer HGC, since: (a) it is parameter free, (b) it leads to more correct recommendations

indirect evaluation :: results

Page 28: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 28

overview

• tag clustering / intro

• existing solutions - limitations

• hybrid graph clustering (HGC)

– core set detection

– (μ,ε)-space exploration

– core set expansion

• evaluation

• conclusions

Page 29: A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

Symeon Papadopoulos (CERTH-ITI, AUTH) 29

conclusions

contributions:• efficient tag clustering scheme that addresses several shortcomings of previous

approaches– no need for setting the number of clusters– no gigantic communities– noisy tags left out of cluster structure– possibility for overlap among communities

caveats:• despite being efficient compared to conventional clustering schemes, the method

is still much slower than the original SCAN (Xu et al., 2007)• the fact that previously assigned nodes are not taken into account when a new

(μ,ε) pair is explored, distorts the actual clustering results

future work:• investigate means of making parameter exploration more efficient• evaluate the value of permitting overlap among communities