The Number of Clusters in Binary Data Sets
Keywords: clustering algorithms, number of clusters, binary data
Abstract: Partitioning a given data set into homogeneous groups is a widely used method in exploratory data analysis and data mining. Despite the variety of cluster algorithms, which can be found in the statistical and neural network literature, a central question remains to be unsolved: How many clusters are there in a data set? The most common used methods for finding the number of clusters is to compute certain index measures for solutions with different number of clusters.
Binary data sets are found frequently in data mining problems, as for example in the analysis of questionnaires. In our presentation we compare and analyze the performance of several index measures on binary data sets. The indexes presented include indexes which are well-known in the literature and performed well on previously published studies (on non-binary data) and some new indexes. In order to evaluate the performance, artificial binary data sets with known properties, which we generated to resemble real world data sets, are used as well as some empirical data sets.
It turns out that none of the indexes which can be found thus far in the literature performs satisfactory in every situation. Thus, a combination of several indexes seems to be necessary.