The Number of Groups in an Aggregated Approach in Taxonomy with the Use of Stability Measures and Classical Indices – A Comparative Analysis
DOI:
https://doi.org/10.18778/0208-6018.357.04Keywords:
taxonomy, clustering, cluster ensemble, cluster stabilityAbstract
Recently, the two concepts that have been often discussed in the literature on taxonomy are the cluster ensemble and stability. An interesting proposal regarding the combination of these two concepts was presented by Șenbabaoğlu, Michailidis, and Li, who proposed as a measure of stability a proportion of ambiguously clustered pairs (PAC) for selecting the optimal number of groups in the cluster ensemble. This proposal appeared in the field of genetic research, but as the authors themselves write, the method can be successfully used also in other research areas.
The aim of this paper is to compare the results of indicating the number of clusters (k parameter) using the aggregated approach in taxonomy and the above-mentioned measure of stability and classical indices (e.g. Caliński–Harabasz, Dunn, Davies–Bouldin).
Downloads
References
Aldenderfer M.S., Blashfield R.K. (1984), Cluster analysis, Sage, Beverly Hills.
Google Scholar
Anderberg M.R. (1973), Cluster analysis for applications, Academic Press, New York–San Francisco–London.
Google Scholar
Ben-Hur A., Guyon I. (2003), Detecting stable clusters using principal component analysis, “Methods in Molecular Biology”, no. 224, pp. 159–182.
Google Scholar
Brock G., Pihur V., Datta S., Datta S. (2008), clValid: an R package for cluster validation, “Journal of Statistical Software”, vol. 25(4), pp. 1–22, https://doi.org/10.18637/jss.v025.i04
Google Scholar
Caliński R.B., Harabasz J. (1974), A dendrite method for cluster analysis, “Communications in Statistics”, vol. 3, pp. 1–27.
Google Scholar
Chiu D.S., Talhouk A. (2018), diceR: an R package for class discovery using an ensemble driven approach, “BMC Bioinformatics”, no. 19, 11, https://doi.org/10.1186/s12859-017-1996-y
Google Scholar
Davies D.L., Bouldin D.W. (1979), A Cluster Separation Measure, “IEEE Transactions on Pattern Analysis and Machine Intelligence”, vol. 1(2), pp. 224–227.
Google Scholar
Dudoit S., Fridlyand J. (2003), Bagging to improve the accuracy of a clustering procedure, “Bioinformatics”, vol. 19(9), pp. 1090–1099.
Google Scholar
Dunn J.C. (1974), Well-Separated Clusters and Optimal Fuzzy Partitions, “Journal of Cybernetics”, vol. 4(1), pp. 95–104.
Google Scholar
Eurostat (2019), Database, https://ec.europa.eu/eurostat/web/main/data/database (accessed: 20.11.2021).
Google Scholar
Everitt B.S., Landau S., Leese M. (2001), Cluster analysis, Edward Arnold, London.
Google Scholar
Fang Y., Wang J. (2012), Selection of the number of clusters via the bootstrap method, “Computational Statistics and Data Analysis”, no. 56, pp. 468–477.
Google Scholar
Fred A., Jain A.K. (2002), Data clustering using evidence accumulation, “Proceedings of the Sixteenth International Conference on Pattern Recognition”, pp. 276–280.
Google Scholar
Gordon A.D. (1987), A review of hierarchical classification, “Journal of the Royal Statistical Society”, ser. A, pp. 119–137.
Google Scholar
Gordon A.D. (1996), Hierarchical classification, [in:] P. Arabie, L.J. Hubert, G. de Soete (eds.), Clustering and classification, World Scientific, Singapore, pp. 65–121.
Google Scholar
Henning C. (2007), Cluster-wise assessment of cluster stability, “Computational Statistics and Data Analysis”, no. 52, pp. 258–271.
Google Scholar
Hornik K. (2005), A CLUE for CLUster ensembles, “Journal of Statistical Software”, no. 14, pp. 65–72.
Google Scholar
Kaufman L., Rousseeuw P.J. (1990), Finding groups in data: an introduction to cluster analysis, Wiley, New York.
Google Scholar
Kuncheva L.I., Vetrov D.P. (2006), Evaluation of stability of k-means cluster ensembles with respect to random initialization, “IEEE Transactions on Pattern Analysis & Machine Intelligence”, vol. 28(11), pp. 1798–1808.
Google Scholar
Leisch F. (1999), Bagged clustering, “Adaptive Information Systems and Modeling in Economics and Management Science”, Working Papers, SFB, no. 51.
Google Scholar
Lord E., Willems M., Lapointe F.J., Makarenkov V . (2017), Using the stability of objects to determine the number of clusters in datasets, “Information Sciences”, no. 393, pp. 29–46.
Google Scholar
Marino V., Presti L.L. (2019), Stay in touch! New insights into end-user attitudes towards engagement platforms, “Journal of Consumer Marketing”, no. 36, pp. 772–783.
Google Scholar
Monti S., Tamayo P., Mesirov J., Golub T. (2003), Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data, “Machine Learning”, no. 52, pp. 91–118.
Google Scholar
Șenbabaoğlu Y., Michailidis G., Li J.Z. (2014), Critical limitations of consensus clustering in class discovery, “Scientific Reports”, no. 4, 6207, https://doi.org/10.1038/srep06207
Google Scholar
Shamir O., Tishby N. (2008), Cluster stability for finite samples, “Advances in Neural Information Processing Systems”, no. 20, pp. 1297–1304.
Google Scholar
Sokołowski A. (1995), Percentage points of the similarity measure for partitions, “Statistics in Transition”, vol. 2(2), pp. 195–199.
Google Scholar
Suzuki R., Shimodaira H. (2006), Pvclust: an R package for assessing the uncertainty in hierarchical clustering, “Bioinformatics”, vol. 22(12), pp. 1540–1542.
Google Scholar
Volkovich Z., Barzily Z., Toledano-Kitai D., Avros R. (2010), The Hotteling’s metric as a cluster stability index, “Computer Modelling and New Technologies”, vol. 14(4), pp. 65–72.
Google Scholar