Enhancing K-Means Clustering with Post-Redistribution

ABSTRACT


INTRODUCTION
Clustering is a widely used unsupervised learning technique applied across many domains, including market segmentation [1], scientific discovery [2], and recommendation systems [3].K-means is widely favored as a clustering algorithm due to its simplicity and efficiency [4].However, K-means can get stuck in a local optimum, where it converges to a suboptimal solution.The algorithm tries to reduce the sum of squared distances between clusters as much as possible (inertia or SSE), but it can hit local optima, which can lead to cluster assignments that aren't the best overall solution.This characteristic should be considered when employing K-means for clustering tasks.To overcome this limitation, several techniques and variations of K-means have been proposed.Researchers often employ two categorical approaches in clustering analysis: Determining the optimal number of clusters (k) is crucial for capturing the inherent structure of the data [5].Getting a balanced population distribution within clusters improves robustness by lowering the effect of noise and outliers, which leads to more reliable clustering results [6].Balancing representation improves cluster quality by addressing biases in imbalanced datasets, ensuring fair representation of all clusters, and preventing underrepresented minorities [7].This population equilibrium not only enhances robustness, but it also makes clustering algorithms better at generalization [6,7].This means that the results can be used in real-world situations beyond the training dataset.For example, in customer segmentation, a few dominant customer segments may emerge while others remain underrepresented.In network analysis, a handful of highly connected nodes can skew the population distribution across clusters.Such imbalances pose challenges when analyzing relative cluster importance and prevalence.They can also obscure useful patterns within underpopulated groups.
In this context, inter-dependence and intra-dependence aim to balance compactness within clusters and separation between clusters while achieving a balanced population distribution within clusters [8].The inter-dependence approach maximizes cluster separation using metrics like the silhouette score [9] and the Davies-Bouldin index [10].The intra-dependence approach ensures internal cohesion, minimizing intra-cluster variance.Regardless of the approach, a balanced population distribution within clusters is essential to prevent biased results.The Gini coefficient [11] is a valuable metric for achieving this balance.
Given the existence of numerous good ideas aimed at enhancing the clustering quality of the K-means algorithm, we propose an enhanced version of the K-means clustering algorithm by introducing a novel post-processing redistribution step.This step is designed to address the issue of cluster imbalance and improve the overall quality and compactness of the clusters.By adding the postprocessing redistribution technique-based diameter, we were able to greatly lower the evaluation metrics that were used to judge the performance of clustering.This enhancement significantly impacted the balance within the clusters, leading to more wellorganized and tightly grouped data points within each cluster.
The following sections of this paper are structured in the following manner.In the present paper, Section 2 provides an overview of the preliminary concepts and background information relevant to our study.Section 3 presents a detailed analysis of the algorithm that we have developed for our research.The experimental results and discussion findings have been succinctly outlined in Sections 4, and 5, while a brief conclusion has been presented in Section 6.

K-means algorithm
K-means algorithm begins by randomly selecting K centers.To calculate the distance between a sample xj and a center ci, the Euclidean distance formula is used: Here, d represents the dimensionality of the samples.Next, each sample is assigned to the cluster center that is closest to it.In the subsequent step, the cluster centers are updated using the mean of the samples assigned to each cluster: where, mi represents the total number of samples belonging to the cluster determined by the center ci.The distances between the samples and cluster centers are recalculated, and this process is repeated until the algorithm converges.

The Davies-Bouldin Index (DBI)
The Davies-Bouldin Index (DBI) is a measure used to evaluate the quality of clustering results.It quantifies the average similarity between clusters, taking into account both the inter-cluster and intra-cluster distances.A lower DBI value indicates better clustering results.
The formula for calculating the Davies-Bouldin Index for a set of clusters is as follows: DBI is the Davies-Bouldin Index.K is the total number of clusters.SSW (Sum of Squares Within a cluster) is a cohesion metric in an i-cluster.
SSB (Sum of Squares Between clusters) is a metric for separating between two clusters.

The Gini coefficient
The Gini coefficient is a statistical measure that is used to represent the level of income or wealth inequality within a population.It was developed by Italian statistician Corrado Gini in 1912.The coefficient ranges between 0 and 1, where 0 represents perfect equality (everyone has the same income or wealth) and 1 represents perfect inequality (one individual or household possesses all the income or wealth, while others have none).
Mathematically, the Gini coefficient can be expressed as: where: A is the area between the Lorenz curve (a graphical representation of income distribution) and the line of perfect equality.
B is the area under the line of perfect equality.

THE PROPOSED METHODOLOGY
Given the abundance of promising approaches for addressing the challenge of escaping local optima, our paper introduces an enhanced K-means algorithm that incorporates a post-clustering redistribution technique.The contribution of our work lies in proposing this redistribution method as a means to tackle the problem effectively.
The process of the proposed algorithm shown in Figure 1 is a modified version of the K-means algorithm with iterative refinement.The standard K-means algorithm is an iterative clustering algorithm that aims to partition a given dataset into K clusters, where each data point belongs to the cluster with the nearest mean.
In the process of enhanced K-means, we begin with an initial value of K and then apply the standard k-means algorithm with K clusters on the given dataset.After this initial step, we have two variants to refine the cluster formation.
In the first variant, SSE-Based Cluster Splitting denoted SSE-SPLITTING_KMEANS, we calculate the SSE (sum of squared errors) for each cluster.This helps us identify the cluster with the highest SSE.We then take nci − nci/r points closest to the center of this cluster ci, and these points will define the first cluster.The remaining points are considered residual points.
where, xi denotes each individual data point.̅ represents the mean (average) of all the data points.For the second variant, Iterative Diameter-Based K-Means denoted DIAMETER_KMEANS, we focus on the diameter of each cluster.Within each cluster, we sort the points based on their distance from the center, starting from the nearest point to the farthest one.Next, we take nci − nci/r nearest points to the center and calculate the new center of each cluster ci based on these selected points.Subsequently, we take residual points, and we determine the point that has the maximum distance to the new center for each cluster.The cluster that minimizes the distances between these maximum distances is then selected as the target cluster.We repeat the same operation as in the first variant for this selected cluster.
The next step involves merging the residual points from the selected cluster with the other clusters.This merging of residual points with other clusters involves a comprehensive criterion for optimizing SSE and diameter, in other term, each residual point is strategically assigned to the cluster that minimizes the increase in SSE and diameter.The algorithm aims to refine the cluster structure, strategically integrating residual points into clusters that exhibit both enhanced compactness and well-managed spatial spread.As a result, we reduce the number of clusters by setting K = K -1.We then apply the K-means algorithm again, this time on the combined points from the residual clusters and merged points (residual points) with the updated value of K.
We continue iterating through the previous steps until K becomes 0, meaning all clusters have been merged.This iterative process leads to a more refined and optimized clustering solution for the given dataset.
The algorithm aims to enhance the overall clustering quality by merging the data points that exhibit the highest distance within a cluster.The choice of r value can significantly impact the clustering results.
In the rest of this section, we focus on the reasons for using SSE and Diameter for redistribution.On the other hand, we discuss the metrics used to evaluate these proposed enhancements and analyze the complexity of enhanced Kmeans algorithm.

Rationale for SSE and diameter criteria in redistribution
The utilization of SSE (Sum of Squared Errors) and diameter as criteria for redistribution in clustering algorithms is rooted in their distinct advantages and complementary roles in evaluating cluster quality.SSE, by measuring the compactness of clusters, encourages the formation of tightlyknit groups, ensuring that data points are closely associated with their respective centroids.This metric offers an intuitive and straightforward interpretation, making it a valuable criterion for assessing intra-cluster cohesion.On the other hand, Diameter, as a redistribution criterion in clustering algorithms, assesses the spatial spread within clusters.It represents the maximum distance between data points, offering insights into overall dispersion.This metric is particularly valuable for accommodating irregular cluster shapes and contributes to creating well-rounded, spatially balanced clusters.

Proposed enhancement evaluation
The evaluation of these algorithmic variants relies on two key metrics, namely the Davies-Bouldin Index (DBI) and the Gini coefficient.The following points offer a comprehensive rationale for the selection of these metrics, providing a nuanced understanding of the reasons behind their choice in the evaluation of the proposed algorithmic variants.
Why DBI?We chose the DBI as the clustering metric due to its superior performance in both K-means and Bissecting Kmeans algorithms when compared to external validation metrics.External indexes require prior knowledge, which is often not available in real applications, making them unsuitable for determining the number of clusters.On the other hand, DBI is an internal validation metric that does not rely on prior knowledge and has shown good discrimination ability [39].
In the work [40] that aims to compare various internal validation metrics [40], DBI ranked second overall, just behind another metric called SDbw [41], but DBI was more suitable for real-world applications and showed better performance with more than two clusters.Other internal metrics, like Dunn Index [42], were less effective due to their sensitivity to boundary points.Additionally, in a comprehensive comparison of 30 cluster validity indices, DBI belonged to the group of better performing indices, while some others, like the Dunn Index, did not yield statistically significant results.
Why Gini?The Gini coefficient is frequently chosen as a clustering evaluation metric in various fields due to its ability to quantify inequality or diversity within a set of values [11].When applied to clustering, the Gini coefficient measures clustering quality and the homogeneity of clusters [43].Its sensitivity to cluster compactness makes it valuable in assessing the balance and tightness of data points within clusters [43].Moreover, its adaptability to clusters of arbitrary shapes and sizes makes it versatile for analyzing clustering results with diverse data distributions.The Gini coefficient's single scalar output ensures easy interpretability and facilitates comparisons across different clustering experiments [11].
Therefore, the measurements we have chosen are based on the goal of thoroughly evaluating the proposed enhancements.DBI is employed to measure the compactness of clusters, providing insights into the intra-cluster similarity and separation between clusters.Meanwhile, the Gini coefficient is utilized to quantify the balance within clusters, offering a robust measure of population distribution.

The complexity of enhanced K-means algorithm
Analyzing the time complexity of enhanced K-means algorithm involves a detailed examination of the computational costs associated with each step.In the standard K-means initialization, the complexity is expressed as O(n.K.d.I).Where n is the number of data points, K is the initial number of clusters, d is the dimensionality of the data, and I is the number of iterations until convergence.
For the SSE-Based Cluster Splitting variant (SSE-SPLITTING_KMEANS), an additional computational complexity of O(n.K.I) is introduced.This involves calculating the sum of squared errors (SSE) for each cluster and selecting points based on SSE.The Iterative Diameter-Based K-Means variant (DIAMETER_KMEANS) introduces a complexity of O(n.K. log(n).I) as it entails sorting points based on their distances within each cluster.
Merging residual clusters contributes O(n.I) to the computational complexity, involving the merging of points and updating the number of clusters.Considering the overall iterative process, denoted by T as the number of iterations until K becomes 0. The total computational complexity is expressed as O(T.(n.K.d.I+n.K.I+n.I)) for SSE-SPLITTING_KMEANS; O(T.(n.K.d.I+n.K.log(n).I+n.I)) for DIAMETER_KMEANS.This comprehensive evaluation captures the complexity associated with each phase of enhanced K-means algorithm, providing insights into its computational efficiency and performance characteristics.

Experimental environment
For the implementation of our proposition, we utilize a 5core CPU PC running a 64-bit Mac OS operating system, with 8GB of memory and a 128GB SSD.To support our algorithm, we employ Anaconda, an open-source platform for Python data science.Furthermore, we adapt and utilize the K-means implementation in Scikit-learn, a Python-based, free, and efficient machine learning tool, for our experiments.

Selection of TSP benchmark for evaluation
To validate our proposal, we chose a dataset used in the well-known TSP (Traveling Salesman Problem) [44].The use of TSP optimization benchmarks as a dataset when the machine learning techniques are used is one of the main focuses of optimization community [45][46][47].The use of clustering as a step in the optimization techniques was introduced in many works such as [48][49][50][51][52].
The instances used refer to specific problems in the Traveling Salesman Problem (TSP) where the cities are represented as points in an Euclidean space.In these instances, the cities are typically defined by their (x, y) coordinates in a two-dimensional plane, and the distance between two cities is calculated using the Euclidean distance [53].
We assess the performance of our clustering algorithm on five different datasets: Berlin52, eil51, eil76, kroA100, and eil101.These datasets are selected from the TSPLIB [44] library and represent sample instances for the Traveling Salesman Problem (TSP).All of these datasets will be utilized for clustering, and we will conduct evaluations on each of them using various metrics.
Given that the dataset instances lie in a two-dimensional plane, we aim to employ a quadtree structure to represent these instances.Our focus is on utilizing the quadtree representation for the proposed enhancement of the k-means clustering approach with a targeted range of 3 to 4 clusters.

RESULTS AND DISCUSSION
The evaluation of the proposed algorithm and its two variants involves applying it to five distinct instances.Additionally, a comparison is made between the performance of enhanced K-means and the standard K-means.The evaluation metrics used are the Davies-Bouldin Index (DBI) to assess clustering quality and cluster compactness and the Gini coefficient to measure the distribution of results among the clusters.The value of r in our proposition plays a vital role as a key parameter that directly impacts both the quality of clustering and the distribution of points across the clusters.
The comparative analysis of clustering techniques presented in Table 1, including standard K-means and enhanced Kmeans variants (1 and 2), was conducted on multiple instances of the TSP.The results indicate that enhanced variants consistently outperformed standard K-means in terms of the DBI metric, with DIAMETER_KMEANS achieving the best cluster separation and compactness.While Gini values were similar across all methods, enhanced variants showed potential for a more balanced data distribution among clusters.In conclusion, enhanced K-means variants, particularly DIAMETER_KMEANS, demonstrate higher clustering quality, compactness, and potentially achieve a more balanced distribution of data points among clusters compared to the standard K-means algorithm.
Focusing solely on DBI values, the results suggest that both our enhanced K-means variants (1 and 2) perform slightly better than the standard K-means algorithm in terms of clustering quality and cluster compactness.DIAMETER_KMEANS consistently achieves the best results in terms of DBI (Table 2).For instance, in Eil51, DIAMETER_KMEANS achieves the lowest DBI of 1.16 compared to 1.20 for K-means and 1.17 for SSE-SPLITTING_KMEANS. Similar trends are observed in Berlin52, Eil76, KroA100, instances, where DIAMETER_KMEANS consistently yields the lowest DBI values, emphasizing its superior performance in clustering.See Figure 2.
As measured by the Gini values, the results show that both SSE-SPLITTING_KMEANS and DIAMETER_KMEANS achieve a slightly more balanced distribution of data points among clusters than the standard K-means.When looking at Gini, the optimal results are consistently obtained with DIAMETER_KMEANS (Table 3).For instances (e.g., KroA100, Eil76, and Eil101), DIAMETER_KMEANS achieves the lowest Gini, respectively, 0.26, 0.22, and 2.21; and seems to outperform both KMEANS and SSE-SPLITTING_KMEANS. KroA100 has higher Gini indices across all methods, indicating that clustering performance might be more challenging for this instance.See Figure 3.
Our comprehensive experimental investigations illuminate how two pivotal parameters, denoted as r1 and r2, influence the optimization efficacy of the proposed enhanced K-means variants across diverse problem instances.Fixing K at 3 clusters, we meticulously explore the impact of varying r1 and r2 values on the performance of SSE-SPLITTING_KMEANS and DIAMETER_KMEANS based on key evaluation metrics.The ideal parameter settings are contingent on the prevailing optimization objectives, as encapsulated in Table 4.When maximizing cluster separation and compactness per the DBI, optimal r1and r2 values for DIAMETER_KMEANS hover between 3.5-6, while SSE-SPLITTING_KMEANS thrives at approximately 1.5-8.These carefully selected parameters allow our enhanced K-means variants to cultivate distinct, tightly-knit clusters, overcoming the limitations of conventional K-means.Conversely, if crafting clusters with balanced data distributions is paramount measured through the Gini coefficient, both variants flourish when r1 and r2 are tuned between 1.25-4.This parametrization empowers the creation of equitably populated clusters, surmounting imbalances.Considering the composite metric amalgamating Gini and DBI, our variants demonstrate resilient performance across diverse instances.For situations where DBI reductions through improved cluster cohesion take precedence, optimal r1 and r2 values for DIAMETER_KMEANS and SSE-SPLITTING_KMEANS situate around 3.5-6 and 1.5-8 respectively.In summary, our exhaustive experiments elucidate the profound influence of r1 and r2 on the optimization capabilities of our enhanced K-means variants, providing insights into ideal parameter ranges based on specified optimization objectives.

CONCLUSION
In this study, we proposed an enhanced K-means clustering algorithm with a post-processing step to achieve balanced cluster sizes.Our algorithm uses SSE and a diameter-based criterion during redistribution of points between clusters.The key findings of our experiments are: (1) Our enhanced algorithm resulted in an average of 2.6-4% reduction in Davies-Bouldin Index compared to standard Kmeans, demonstrating improved cluster compactness and separation.
(2) For the Gini coefficient metric, we achieved a more balanced cluster size distribution than baseline methods in 5 datasets tested.
Balanced and high-quality clustering outputs are important for applications where relative cluster population sizes carry meaning, such as market segmentation, recommendation systems, and social network analysis.Our approach addresses the common real-world challenge of imbalanced clusters.In these applications, balanced clusters ensure all subgroups are well-represented and avoid one or two segments dominating the analysis.This leads to more insightful segment profiles.
Moreover, balanced and compact clusters are particularly crucial in domains involving predictive risk analysis, such as healthcare, fraud detection, and sustainability.Reliable identification and characterization of high-risk clusters requires representative coverage of all subgroups.
Future work will focus on building upon this balanced redistribution approach.We plan to evaluate our algorithm on additional types of datasets, such as text and images.Expanding our technique to handle multi-dimensional data more efficiently could improve its applicability.We will also explore integrating cluster validity indices to automatically select algorithm parameters.
In conclusion, our experiments demonstrate the effectiveness of the proposed enhanced K-means algorithm at achieving balanced cluster sizes while maintaining high clustering quality.This work provides a foundation for developing balanced clustering methods applicable across diverse real-world problem domains.

Figure 2 .
Figure 2. Ilustration of DBI values of K-means and enhanced K-means for different instances

Figure 3 .
Figure 3. Ilustration of Gini values of K-means and enhanced K-means for different instances

Table 1 .
DBI and Gini values for different instances and methods in K-means and enhanced K-means variants

Table 2 .
DBI values of K-means and enhanced K-means variants for different instances

Table 3 .
Gini values of K-means and enhanced K-means variants for different instances

Table 4 .
Ideal r values to optimize performance of enhanced K-means variants for different instances