A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

download A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

of 30

  • date post

    11-May-2015
  • Category

    Technology

  • view

    2.911
  • download

    1

Embed Size (px)

description

Conference presentation.Full reference:S. Papadopoulos, Y. Kompatsiaris, A. Vakali. “A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies”. In Proceedings of DaWaK'10, 12th International Conference on Data Warehousing and Knowledge discovery (Bilbao, Spain), Springer-Verlag, 65-76

Transcript of A Graph-based Clustering Scheme for Identifying Related Tags in Folksonomies

  • 1.A Graph-based Clustering Scheme for Identifying Related Tags in FolksonomiesSymeon Papadopoulos, Yiannis Kompatsiaris, Athena VakaliBilbao, Spain 30 Aug 3 SepCERTH ITI AUTH

2. overview tag clustering / intro existing solutions - limitations hybrid graph clustering (HGC) core set detection (,)-space exploration core set expansion evaluation conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 2 3. overview tag clustering / intro existing solutions - limitations hybrid graph clustering (HGC) core set detection (,)-space exploration core set expansion evaluation conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 3 4. tag clustering starting point: folksonomy, i.e. annotation scheme produced by the set of users,resources, tags of a social tagging system, e.g. delicious, flickr,BibSonomy (Mika, 2005) observation I: folksonomies a direct encoding of the views of users on how contentitems should be organized through a flexible annotation scheme observation II: tags used to describe the same resources tags related to each other(meaningful semantic association) Mika, P.: Ontologies are us: A unified model of social networks and semantics. ISWC 2005, LNCS 3729, 522-536, Springer-Verlag (2005) Symeon Papadopoulos (CERTH-ITI, AUTH) 4 5. why is tag clustering useful? information exploration and navigation (Begelman et al., 2006;Simpson, 2008) automatic content annotation (Brooks, 2006) user profiling (Gemmell, 2008) content clustering (Giannakidou, 2008) tag sense disambiguation (Au Yeung, 2009)Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article:http://www.pui.ch/phred/automated_tag_clustering (2006)Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008)Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of 20th ACMConference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009)Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings of WAIM2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008) Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering. Proceedings of WWW 06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006) Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008) Symeon Papadopoulos (CERTH-ITI, AUTH)5 6. overview tag clustering / intro existing solutions - limitations hybrid graph clustering (HGC) core set detection (,)-space exploration core set expansion evaluation conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 6 7. existing solutions (i) :: conventional clustering conventional clustering schemesrepresent tags in some feature space and employstandard clustering method, e.g.: k-means (Giannakidou et al., 2008) hierarchical agglomerative clustering (HAC) (Brooks et al., 2006; Gemmell et al., 2008) Giannakidou, E., Koutsonikola, V. A., Vakali, A., Kompatsiaris, Y.: Co-Clustering Tags and Social Data Sources. Proceedings of WAIM 2008: 9th International Conference on Web-Age Information Management. IEEE, 317-324 (2008) Brooks, C. H., Montanez, N.: Improved annotation of the blogosphere via autotagging and hierarchical clustering. Proceedings of WWW 06: 15th international Conference on World Wide Web. ACM, New York, NY, 625-632 (2006) Gemmell, J., Shepitsen A., Mobasher B., Burke, R.: Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering. Data Warehousing and Knowledge Discovery 5182, 196-205 (2008) Symeon Papadopoulos (CERTH-ITI, AUTH) 7 8. existing solutions (i) :: conventional clustering problems with conventional clustering needs number of clusters to be defined: very hard to evenestimate it in large-scale tagging systems not easily scalable: k-means (Lloyds): O(I C n D) HAC: O(n2 logn) n: number of tags, I: number of iterations, C: number of clusters, D: number of dimensions HAC is hardly applicable since it requires n2 memory for storing the dissimilarity matrixSymeon Papadopoulos (CERTH-ITI, AUTH)8 9. existing solutions (ii) :: community detection use of community detection methods on tag graphs (derivedfrom folksonomies) to find groups of tags that are moredensely connected to each other than to the rest of the graph community detection methods largely address shortcomingsof conventional clustering (Begelman et al., 2006; Simpson,2008; Au Yeung et al., 2009) schemes efficient: complexity O(n logn) do not require number of communities to be provided as input (typically use modularity maximization) Begelman, G., Keller, P., Smadja, F.: Automated Tag Clustering: Improving search and exploration in the tag space. Online article: http://www.pui.ch/phred/automated_tag_clustering (2006) Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008) Au Yeung, C. M., Gibbins, N., Shadbolt., N.: Contextualising Tags in Collaborative Tagging Systems. Proceedings of 20th ACM Conference on Hypertext and Hypermedia, pages 251-260, Turin, Italy, 29 June - 1 July, ACM (2009)Symeon Papadopoulos (CERTH-ITI, AUTH) 9 10. existing solutions (ii) :: community detection existing community detection schemes also sufferfrom problems modularity maximization typically leads to highly skewed cluster size distribution (Simpson, 2008): few gigantic clusters and numerous small ones gigantic clusters (representing even half the number of objects) are not useful for IR not possible to leave noisy objects out of cluster structure not possible to have overlap among clusters (which is useful in the context of tag clustering) Simpson, E.: Clustering Tags in Enterprise and Web Folksonomies. Technical Report HPL-2008-18 (2008) Symeon Papadopoulos (CERTH-ITI, AUTH)10 11. overview tag clustering / intro existing solutions - limitations hybrid graph clustering (HGC) core set detection (,)-space exploration core set expansion evaluation conclusions Symeon Papadopoulos (CERTH-ITI, AUTH) 11 12. hybrid graph clustering our solution is based on a structure-connected communitydetection approach (Xu et al., 2007) that is based on theconcept of structural similarity and (,)-cores: nodes on the graph are structurally similar when they have many neighbors in common a (,)-core is a node that has at least neighboring nodes with which it has structural similarity at least extended in two ways: parameter space exploration raises the need for setting parameters core community expansion permits overlap among communities Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD 07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007)Symeon Papadopoulos (CERTH-ITI, AUTH)12 13. hybrid graph clustering hybrid scheme: (,)-core identification and structure connectedcluster extraction (original approach) (,)-parameter space exploration makes scheme completely parameter-free cluster expansion increases coverage, permits overlap among clustersSymeon Papadopoulos (CERTH-ITI, AUTH) 13 14. structure connected cluster extraction structural similarity between nodes u, w on a graph G = {V, E}: -neighborhood: (,)-core: direct structure reachability of w w.r.t. to core u: cluster extraction (Xu et al., 2007): starting from a (,)-core node grow the cluster to contain all nodes thatare directly structure reachable to it or reachable through a chain ofnodes that are directly structure reachable to each other Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: SCAN: A Structural Clustering Algorithm for Networks. Proceedings of KDD 07: 13th international Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, 824-833 (2007)Symeon Papadopoulos (CERTH-ITI, AUTH)14 15. structure connected cluster extraction edge labels denote structural similarity values between nodes blue nodes are (, )-cores for = 5 and = 0.65 gray nodes are directly structure reachable from (, )-cores the rest of nodes are left out of the cluster structure Symeon Papadopoulos (CERTH-ITI, AUTH)15 16. parameter space exploration original approach needs parameter setting that istroublesome for complex datasets parameter interpretation: : a high value for will lead to fewer and larger clusters, i.e. only nodes with degree of at least will be considered to be cores : a high value for will make the cluster extraction process stricter, i.e. less nodes will be assigned to clusters in fact, a single (,) parameter pair is unlikely todiscover all interesting clusters Symeon Papadopoulos (CERTH-ITI, AUTH)16 17. parameter space exploration search for clusters at multiple parameter pairs identify the highest quality clusters (high , high ), thenproceed to less profound clusters exclude nodes that havealready been assignedto a cluster from beingre-assigned makesprocess faster log-sampling along axis for fasterexplorationSymeon Papadopoulos (CERTH-ITI, AUTH)17 18. cluster expansion the original structure connected approach may be too strict and thusleave too many nodes out of the clustering structure an expansion process attempts to mitigate this weakness for each extracted core cluster, a local expansion process is conductedthat attaches neighboring nodes the expansion is based on a simple greedy maximization of a local clusterdensity measure called subgraph modularity (Luo et al., 2006): nodes with very high degree (belonging to the top 10 percentile of thedegree distribution) are not considered in this process in order to makethe expansion process more efficientLuo, F., Wang, J. Z., Promislow, E.: Exploring Local Community Structures in Large Networks. Proceedings o