Abstract
The kmeans algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global kmeans is a deterministic algorithm proposed to tackle the random initialization problem of kmeans but its wellknown that requires high computational cost. It partitions the data to K clusters by solving all kmeans subproblems incrementally for all \({k=1,\ldots , K}\). For each k cluster problem, the method executes the kmeans algorithm N times, where N is the number of datapoints. In this paper, we propose the global kmeans++ clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global kmeans with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the kmeans++ algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.
Graphical abstract
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Clustering is one of the most important and fundamental tasks in machine learning, data mining, and pattern recognition, with numerous applications in computer science and many other scientific fields [1,2,3]. It is defined as a process of partitioning a set of objects into groups, called clusters so that the data in the same cluster share common characteristics and differ from those in other clusters. While its definition is simple, it is actually a challenging problem to solve. The typical form of clustering is the partitioning of a given dataset \(X =\{x_1,\ldots , x_N\}\), \(x_i \in \mathbb {R}^d\) into K disjoint clusters \(\mathcal {C} = \{C_1, \ldots , C_K\}\) so that a specific criterion is optimized. The most widely used optimization criterion is the clustering error. It is defined as the withincluster sum of squared distances between each data \(x_i \in C_k\) to its cluster center \(m_k\) as defined in
where \({\textbf {1}}_{C_k}\) is the indicator function of the set \(C_k\), while \(M = \{m_1,\ldots , m_K \}\) is the set of K centers computed as the mean of the datapoints of each cluster. The number of ways in which a set of N objects can be partitioned into K nonempty groups is given by Stirling numbers of the second kind:
which can be approximated by \(K^{N}/K!\) as \(N \rightarrow +\infty \) [4]. It is evident from (2) that a complete enumeration of all possible clusterings to determine the global minimum of (1) is computationally prohibitive. It is worth mentioning that this nonconvex optimization problem is \(\textsf{NP}\)hard [5, 6] not only for two clusters (\(K=2\)) [7], but also for twodimensional datasets (\(D=2\)) [8]. The kmeans algorithm [9, 10] is a widely used method for minimizing clustering error. It is an iterative algorithm that has been extensively utilized in many clustering applications due to its simplicity and speed [11].
The kmeans algorithm is computationally fast, but its performance relies heavily on the initialization of the cluster centers, which may lead to a local minimum of the clustering error. Various stochastic global optimization methods have been proposed, such as simulated annealing and genetic algorithms to overcome the above problem, but have not gained wide acceptance. These types of methods involve several hyperparameters that are difficult to tune, for example, starting temperature, cooling, schedule, population size, and crossover/mutation probability and usually require a large number of iterations, which renders them prohibitive for large datasets. Consequently, stochastic global optimization methods are usually not preferred, but instead, methods with multiple random restarts are commonly used [1]. The kmeans++ [12] is probably the most widely used center initialization algorithm. It selects the cluster centers by sampling datapoints from a probability distribution that strives to effectively spread them away from each other. Specifically, the algorithm comprises two key steps: i) computing a probability distribution for center selection and ii) sampling a datapoint from this distribution as the initial cluster center position. The method iteratively recalculates the distribution in order to select subsequent cluster centers. This iterative process terminates when all K centers have been initialized.
The global kmeans [13] algorithm has also been proposed to effectively solve the kmeans initialization problem. It constitutes a deterministic global optimization method that does not depend on center initialization. Instead of randomly selecting starting positions for all cluster centers the global kmeans proceeds incrementally, attempting to optimally add one new cluster center at each stage k by exploiting the \(k1\) clustering solution. This effective approach is deterministic and does not depend upon any initial conditions or empirically adjustable parameters. Nonetheless, a notable limitation of this method is its high computational cost. It has been empirically proven that both global kmeans and kmeans++ methods significantly surpass the performance of standard kmeans [14, 15].
It is important to note that the performance of random initialization algorithms, such as kmeans++, decays as K grows compared to those of the global kmeans algorithm. Large values of K are frequently utilized in clustering scenarios, where the objective is to capture finer substructures within a dataset. In this context, we define a clustering of X as an overclustering when \(K>k^{\star }\), with \(k^{\star }\) representing the actual number of clusters present in X. In various fields such as speech recognition [16] and computational biology [17], overclustering is commonly practiced. For example, in computational biology, overclustering ensures that relevant cell types can be discovered even if an expected population is split into subpopulations. Moreover, overclustering is vital in more sophisticated clustering algorithms as an algorithmic step, particularly when dealing with nonconvex data structures [18, 19]. In overclustering methods, obtaining solutions from the global kmeans clustering algorithm is desirable. However, due to the computational burden associated with large values of N and K, computationally cheaper alternatives are practically utilized.
In this paper, we propose the global kmeans++ clustering algorithm, which is an effective way of acquiring clustering solutions of comparable clustering errors to those that global kmeans produces without its high computational cost. This is achieved by employing the effective center selection probability distribution of the kmeans++ method. The global kmeans++ algorithm is an attempt to retain the effectiveness of the incremental clustering strategy of the global kmeans while reducing its computational demand using the efficient stochastic center initialization of the kmeans++ algorithm. The proposed algorithm proceeds incrementally by solving all intermediate subproblems with \(k \in \{1, 2,\ldots , K1\}\) to provide the clustering solution for K clusters. The underlying idea of the proposed method is that an optimal solution for a clustering problem with K clusters can be obtained using a series of local kmeans searches that are appropriately initialized. More specifically, at each local search, the \(k  1\) cluster centers are always initially placed at their optimal positions corresponding to the clustering problem with \(k1\) clusters. However, the remaining k_{th} cluster center is placed at several starting positions within the data space that are randomly selected by sampling from the kmeans++ probability distribution.
The rest of the paper is organized as follows. At first, in Section 2, we give a brief overview of the kmeans algorithm, the initialization method of the kmeans++, the global kmeans clustering algorithm and its variations. Then, in Section 3, we present the proposed global kmeans++ clustering algorithm. In Section 4, we provide extensive comparative experimental results and discussion. Finally Section 5 presents conclusions and future work suggestions.
2 Related work
The kmeans algorithm finds locally optimal solutions with respect to the clustering error (1). It is an iterative centerbased clustering algorithm that starts with K cluster centers. In its purest form (Lloyd’s algorithm [10]), the K centers are typically selected uniformly at random from the set of datapoints X. Then the twostep algorithmic procedure follows iteratively until convergence, which includes the data assignment and the center optimization steps. At the assignment step, every data point \(x_i\) is assigned to the cluster \(C_j\) with the nearest center \(m_j\):
At the optimization step, each center is updated to the mean of all datapoints assigned to its cluster:
This simple optimization methodology is proven to be computationally fast and effective. The main disadvantage of the standard kmeans algorithm is that it is a local optimization method with high sensitivity to the original starting positions of the cluster centers. A poor center selection usually leads to local minima of the clustering error. Therefore, to obtain nearoptimal solutions using the kmeans algorithm, several runs must be scheduled differing in the starting positions of the cluster centers. Among these runs, the solution with the lowest clustering error is naturally chosen and retained.
The kmeans++ algorithm [12] is considered the most successful method for choosing the initial positions of the cluster centers. It randomly selects the center positions from the datapoints using a multinomial probability distribution that strives to effectively distribute the centers away from each other. Specifically, the method begins by choosing the first center position through uniform random selection from the dataset X. Then it computes the distances
where \(i = 1, \ldots , N\) of each datapoint \(x_i\) from its nearest center and forms the probability vector \(P = \left( p_1, \ldots , p_N\right) \), where each component \(p_i\) is given by
Then, it samples the next center position m from the data using the multinomial distribution with probability vector P. This twostep procedure is repeated until the number of centers equals to K. After the kmeans++ center initialization procedure, the standard kmeans algorithm proceeds with the assignment and optimization steps, described in (3) and (4) respectively. These steps are iteratively applied until convergence is achieved. It is important to note that choosing centers using the kmeans++ algorithm is computationally fast, while it also guarantees a \(\mathcal {O}(\log {k})\)competitive clustering solution [12].
The global kmeans clustering algorithm [13] is a deterministic global optimization method that does not depend on any predefined parameter values and employs the kmeans algorithm as a local search procedure. The complete global kmeans method is presented in Alg. 1.
Instead of randomly selecting initial values for the cluster centers the global kmeans algorithm incrementally adds one new cluster center at each stage in an attempt to be optimally placed. To accomplish that, the global kmeans algorithm proceeds to solve a clustering problem with K clusters by sequentially solving every intermediate subproblem with k clusters (\(k \in \{1,\ldots , K\}\)). In order to solve the problem with k clusters, the global kmeans utilizes the solution with \(k1\) cluster centers.
Note that the set of k centers is denoted as \(M_k\), the k clustering partition is denoted as \(C_k\), and E(C) denotes the clustering error, see (1). The algorithm starts by solving the 1means problem where the optimal position corresponds to the center of the dataset X (step 1 in Alg. 1). Then, it solves the 2means problem by performing N executions of the kmeans algorithm (steps 35 in Alg. 1). In each execution n, the first cluster center is always initialized at the optimal solution of the 1means subproblem, while the second center is initially set at the data point \(x_n\) \((n \in \{1,\ldots , N\})\). The best solution, with the lowest error (1), is obtained after the N executions of the kmeans algorithm is considered the solution for the 2means clustering subproblem (step 6 in Alg. 1). Following the same incremental procedure, the solution for \(k_{\text {th}}\) clusters is obtained (steps 27 in Alg. 1). In general, for solving the \(k_{\text {th}}\) cluster subproblem the procedure begins with the initialization of the \(k1\) centers at the center positions provided by the solution of the \((k1)\) problem; then, the new \(k_{\text {th}}\) center is initialized at each datapoint \(x_n\). The kmeans algorithm is executed N times while retaining the best clustering solution. The global kmeans algorithm is very successful in obtaining nearoptimal solutions, but computationally expensive for large N as it involves \(\mathcal {O}(NK)\) executions of kmeans on the entire dataset.
The global kmeans is widely acknowledged for its optimization capabilities. However, its prominent drawback lies in its high computational complexity, as each k clustering subproblem necessitates \(\mathcal {O}(N)\) kmeans executions. Consequently, the practical application of the algorithm is predominantly confined to small datasets. This limitation has prompted researchers to explore heuristic techniques for selecting initial center candidates, with the dual objective of reducing computational complexity while maintaining clustering results of comparable quality to those obtained with global kmeans. Nonetheless, it is essential to acknowledge that reducing the computational complexity of the global kmeans algorithm results in the loss of its deterministic nature.
The fast global kmeans algorithm (FGKM) [13] constitutes an effort to accelerate the global kmeans. Unlike global kmeans, which necessitates \(\mathcal {O}(N)\) kmeans executions for each k subproblem, the FGKM algorithm performs only a single kmeans execution. Alg. 2 describes the FGKM clustering method.
In particular, for each value of k, the FGKM algorithm computes an upper bound (denoted as \(E_n\)) on the clustering error that would arise if the new center were initialized at the specific point \(x_n\) and kmeans were executed until convergence. The upper bound is defined as \(E_n \le E  b_n\), where E is the error already found for the \(k1\) clustering subproblem, while the value \(b_n\) is given by equation
In (7), the \(d^{j}_{k1}\) is defined as the squared euclidean distance between \(x_j\) and its closest cluster center among the \(k1\) centers obtained so far. The initial position of the kth cluster center is set to the data point \(x_i^{\star }\) with minimum \(E_n\) or, equivalently, with maximum \(b_n\):
Several methods have been proposed that modify the global kmeans or the fast global kmeans to make it more efficient [14]. The efficient global kmeans clustering algorithm (EGKM) [20] defines a normalized density function criterion for selecting the top candidate for the new cluster center. However, EGKM similarly to FGKM demands all pairwise distances \(d(x_i, x_j)\), thus it needs \(\mathcal {O}(N^2)\) distance computations and extra memory space. The fast modified global kmeans algorithm (FMGKM) [21] defines a more sophisticated auxiliary function criterion for selecting the candidates compared to FGKM. Its main drawback is computational complexity because, at each iteration, it requires the whole or part of the affinity matrix computation, which also generally requires \(\mathcal {O}(N^2)\). In addition, the FMGKM algorithm exhibits a secondary limitation due to the utilization of an auxiliary function criterion, which introduces a nonconvex optimization problem. The fast global kmeans clustering based on local geometric information [22] is an attempt to reduce the computational complexity of FGKM [13] and FMGKM [21] by leveraging local geometric structure information to decrease the distance computations required in each iteration. It is claimed that the method requires \(\mathcal {O}(n_1 n_2)\) distance computations to select the new center candidate, where empirically \(n_1 \ll N\) and \(n_2 \ll N\). Fast global kmeans clustering using cluster membership and inequality [23] is also an attempt to reduce the distance computations of the FGKM. It is claimed that the method requires \(\mathcal {O}(N n_1)\) distance computations to select the new center candidate, where empirically \(n_1 \ll N\). It is crucial to emphasize that all related methods exhibit comparable clustering performance to FGKM rather than global kmeans.
3 The global kmeans++ algorithm
Global kmeans is a deterministic algorithm proposed to tackle the random initialization problem but it is computationally expensive. It partitions the data to K clusters by solving all k cluster subproblems sequentially \(k \in \{1,\ldots , K\}\). In order to solve the k clustering subproblem, the method employs a strategy wherein the kmeans algorithm is executed N times. Each algorithmic cycle considers all N datapoints as potential candidate positions for the new cluster center. This comprehensive exploration of candidate positions aims to identify the optimal center placement for the new cluster center and drastically improves the clustering performance.
An approach to reduce the complexity of the global kmeans algorithm is to consider a set of L datapoints as candidate positions for the new cluster center, where \(L \ll N\). However, an effective candidate center selection strategy is required. Employing a random uniform selection method for the L candidates is not expected to be an viable choice. An ideal candidate selection method should meet the following requirements:

Efficiency requirements

i.
Low computational complexity.

ii.
Low space complexity.

i.

Effectiveness requirement

i.
Selection of a set of highquality center candidates.

i.
We consider as “highquality” a set of centers assigned to separate data regions. In order to select such center candidates, prioritizing sampling from regions without cluster centers is crucial. Such strategy enhances the likelihood of choosing candidates that belong to different clusters while minimizing the risk of selecting redundant or suboptimal center positions. For this reason, we have employed the kmeans++ probability distribution as the candidate selection method. The effectiveness of this distribution in the initial placement of cluster centers makes it an excellent choice for addressing the aforementioned requirements.
In this work, we propose the global kmeans ++ algorithm, an effective relaxation of the global kmeans method. Its main difference with the global kmeans is that the proposed method requires only L executions of kmeans for each k cluster subproblem, where \(L \ll N\). The primary goal of our proposed clustering method is to provide highquality results that are comparable to those of the global kmeans algorithm, while retaining low computational and space complexity. The complete global kmeans++ method is presented in Alg. 3. As with any other global kmeans variant, in order to reduce the computational requirements, we sacrifice determinism by using an effective stochastic initial center selection procedure.
The algorithm requires as input the dataset X, the number of clusters K, the number of candidates L to consider, as well as the sampling method S. Two possible sampling strategies are considered, namely, batch sampling and sequential sampling. It should be noted that \(D = (d_1, \ldots , d_N)\) represents the distance vector, where \(d_i\) is the distance of \(x_i\) to its closest center \(m_j\) (5).
Specifically, to solve a clustering problem with K clusters, the method proceeds as follows. Initially, it addresses the 1means subproblem by setting the center \(m_1\) as the mean of the entire dataset X and it initializes the distance vector D (steps 12 in Alg. 3). Then, to solve the 2means subproblem, it utilizes the 1means solution as the first initial center position (obtained in step 1). To initialize the second center, a set of L candidate positions \(\{c_1, \ldots , c_L\}\) is sampled using batch or sequential sampling (steps 48 in Alg. 3). The kmeans algorithm is executed L times until convergence, one for each candidate position \(c_\ell \) (steps 911 in Alg. 3), and the solution with the minimum error is retained (step 12 in Alg. 3). Finally, the distance vector D is updated (steps 1315 in Alg. 3), using the precomputed centertodata point distances of the best clustering solution. Following the same reasoning, for each \(k = 2, \ldots , K\), it incrementally tackles the k cluster subproblem by leveraging the solution of the previous \((k1)\) cluster subproblem. To solve for k clusters, it utilizes the center positions obtained from the \((k1)\) subproblem as initialization for the \(k1\) centers. It then uses batch or sequential sampling strategies, which we discuss next, to initialize the new \(k_{\text {th}}\) center. Finally, the algorithm returns the clustering solution for every \(k \in \{1,\ldots ,K\}\). Importantly, it should be noted that steps 911 of Alg. 3 can be executed concurrently or in parallel. This parallel execution allows for efficient computation and optimization.
We can sample candidates from probability distribution P through either batch or sequential sampling, presented in Alg. 4 and Alg. 5, respectively. On the one hand, batch sampling does not require any additional computations as it utilizes only the precomputed centertodata point distances \(D = (d_1, \dots , d_N)\). First, it computes the probability vector \(P = \left( p_1, \ldots , p_N\right) \), where \(p_i\) represents the selection probability of the datapoint \(x_i\) (step 1 in Alg. 4). The distribution P is formulated so that high selection probabilities are assigned to datapoints far from cluster centers, and small to near zero are assigned to datapoints near cluster centers. Then, a set of L candidates \(\{c_1, \ldots , c_L\}\) are sampled without replacement by the constructed kmeans++ probability vector P (step 2 in Alg. 4).
On the other hand, sequential sampling strategy selects one candidate \(c_{\ell }\) at a time using the distribution P (step 3 in Alg. 5). This procedure aims to select candidates that are far from: i) the set M consisting of the converged solutions of the \(k1\) centers, and ii) the \(\ell 1\) candidates that already sampled. After each sampling the distance \(d_i\) of each datapoint \(x_i\) is updated accordingly (steps 46 in Alg. 5). This procedure is repeated the L times in total to create the set of candidates \(\{c_1, \ldots , c_L\}\) (steps 17 in Alg. 5). Although the sequential sampling method incurs the cost of recalculating the minimum distance values, it provides a better spread of the L samplesin the dataspace.
In Fig. 1, we present an illustration of a running example of the algorithm using the 2dimensional “R15” dataset [24].^{Footnote 1} The circles denote the datapoints, while their color indicates the cluster category that they belong to, asserted by the global kmeans++ algorithm. With the red star, we represent the converged cluster centers for each clustering subproblem k, while with the green crosssymbols, we show the initial candidate position of the next cluster center. In each subfigure 1a1d, the algorithm solves the current kmeans subproblem. Based on the current solution, the algorithm samples the next center candidates from the kmeans++ distribution. The figure demonstrates that the method samples highquality center candidates.
While the global kmeans requires \(\mathcal {O}(NK)\) kmeans executions, the global kmeans++ only requires \(\mathcal {O}(LK)\), where generally \(L \ll N\). Additionally, we have empirically observed that kmeans converges very fast as k grows. The speed up in convergence is reasonable, since in order to solve each k cluster subproblem, the method exploits the centers of \(k1\) cluster subproblem that are already positioned sufficiently well. Therefore the kmeans does not require many iterations to converge.
It should be noted that the proposed method is a relaxation of the global kmeans algorithm rather than an optimization of the FGKM similar to the methods discussed in the related work. Its simplicity and speed stem from the fact that the sampling strategies require none to minimal computations to select the L candidates, demonstrating superior computational complexity compared to other global kmeans variants. In particular, the batch sampling strategy requires no additional distance computations since the centertodatapoint distances necessary to define the probability vector P have already been computed by the kmeans procedure. Meanwhile, the sequential sampling strategy requires only \(\mathcal {O}(NL)\) distance computations for the selection process, which is the same as kmeans++. Note also that solving the kmeans problem with an incremental procedure has many advantages because, due to the unsupervised nature of the clustering problem, it is usually desirable to obtain solutions for different k that are evaluated using appropriate quality criteria (e.g. silhouette coefficient [25]) for selecting the appropriate number of clusters [26].
4 Experiments
In this section, we present our experimental study to evaluate the effectiveness of the proposed global kmeans++ clustering method, both in batch (gl++ (b)) and sequential sampling (gl++ (s)) settings. Our study involved comparisons against other clustering methods, including global kmeans (gl) [13], FGKM (fgl) [13], kmeans++ (kms++) [12], scalable kmeans++ (kms) [27] and standard kmeans with random uniform initialization (rnd) [10].^{Footnote 2} Our implementation of the global kmeans++ clustering algorithm is available in the following GitHub repository: https://github.com/gvardakas/globalkmeanspp.git.
4.1 Datasets
In order to evaluate the proposed algorithm, a series of experiments have been conducted on various benchmark and publicly available datasets.^{Footnote 3} Moreover, we deliberately selected datasets that exhibited a broad spectrum of data characteristics, encompassing factors such as the number of samples N and the data dimensionality D, the complexity, and the domain of origin. The description each dataset is presented in Table 1. As a preprocessing step, we used minmax normalization to map the attributes of each real dataset to the [0, 1] interval to prevent attributes with large ranges from dominating the distance calculations and avoid numerical instabilities in the computations [30].
4.2 Evaluation
The evaluation of clustering methods encompasses various approaches, including internal and external metrics when the ground truth cluster labels are available. However, our investigation focuses solely on treating it as an optimization problem, with the primary objective being the minimization of clustering error. Consequently, we intentionally disregard any class labels in the data, as our emphasis lies not on assessing external clustering metrics or determining the number of clusters. Instead, we focus on clustering error \(E(C_k)\) (1) as the performance measure for the different methods, enabling direct assessment of their error minimization capabilities.
Specifically, in order to evaluate the performance of the different methods, we have computed the relative Percentage Error, defined as
where \(E(C_{k}^{\star })\) is the clustering error of the baseline method (global kmeans), while \(E(C_k)\) is the error provided by each of the compared methods (refer to Figs. 2, 3 and 4). Considering the mnist dataset, the global kmeans did not terminate in reasonable time due to its high computational complexity. In that case, we have used the difference in clustering errors, relative to the bestperforming algorithm to facilitate a transparent comparison (see Fig. 5). Additionally, we studied the convergence speed of each method for each k by presenting the CPU time required by each algorithm (Table 2) and the average number of iterations needed for kmeans to converge (Fig. 6). In this way, we provide comprehensive information regarding each method’s ability to minimize clustering error as well as its computational speed and efficiency.
4.3 Experimental setup
In our experimental study, we executed the compared methods for a maximum number of clusters, denoted as K and evaluated all clustering solutions for \(k\in \{1,\ldots ,K\}\) in terms of clustering error. We chose the maximum number of clusters K based on the dataset size, with \(K=30\) for smaller datasets (breast cancer and wine) and \(K=50\) for medium (pendigits) and larger datasets (mnist). For each dataset, we run the compared algorithms with L values of 10, 25, 50, and 100. This range of values allowed us to evaluate the impact of L on the performance of the clustering methods.
In the case of the randomly initialized methods, namely kmeans++, scalable kmeans++ and standard kmeans, we have defined the parameter L to represent the number of restarts within each subproblem k. However, the scalable kmeans++ method has an additional hyperparameter l, which denotes the number of data points sampled in each iteration that we set equal to L. In contrast, for the incremental optimization methods of global kmeans++ and FGKM, only a single execution was performed, considering L candidates. We implemented this approach to ensure a fair comparison with the randomly initialized methods, which do not utilize prior subproblem solutions in their optimization procedure. It is worth noting that the FGKM, as presented in related work [13], originally considers only one candidate (8). However, in our study, we relaxed this limitation by allowing the method to select the top L candidates in each iteration that maximize (7). In the following experiments, we evaluate the optimization capabilities of each method, the average number of iterations required, as well as the time needed for convergence to the clustering solution. In summary for each dataset, we conducted the following experiments:

We selected \(K=30\) or \(K=50\) as the maximum number of clusters depending on the size of the dataset.

We executed one run of global kmeans, global kmeans++, and FGKM. For the two latter methods, we considered L equal to 10, 25, 50 and 100 candidates in each k cluster subproblem.

The kmeans++, scalable kmeans++ and standard kmeans methods were initialized L times for each \(k=1,\ldots , K\).
4.4 Results
Figures 2, 3, and 4 depict the relative percentage error between each method and the baseline algorithm, for the respective datasets: breast cancer, pendigits, and wine. However, in the case of the mnist dataset, Fig. 5 displays the difference in clustering error between each method and the bestperforming algorithm. In the subfigures a, b, c, and d, we present the results of experiments considering different indicative values of L (i.e., 10, 25, 50, and 100) for the global kmeans variants, kmeans++, scalable kmeans++, and the standard kmeans algorithm. Moreover, Fig. 6 provides the average number of iterations required by each algorithm to converge, while Table 2 presents the CPU time necessary to compute all K clustering solutions for each algorithm.
The results across all datasets demonstrate that both the batch and sequential versions of global kmeans++ exhibit similar performance to global kmeans, while clearly outperforming the FGKM, kmeans++, scalable kmeans++ and the standard kmeans algorithm. The optimization capabilities of global kmeans++ are particularly noteworthy, especially as k increases, where minimizing the clustering error becomes more challenging. As expected, the performance of global kmeans++ improves with an increasing number of candidates, better approximating the performance of global kmeans. Surprisingly, in the wine dataset, global kmeans++ outperformed global kmeans in several k subproblems (Fig. 4b4d). The relative percentage error of global kmeans++ was less than \(1\%\) when considering more than \(L=25\) candidates. In the pendigits dataset Fig. 3, the baseline model and our algorithm coincide with high accuracy for \(L=25, 50, 100\). Notably, global kmeans required four days of execution, while global kmeans++ completed the task in significantly less time, ranging from 17 second up to almost 2 minutes, depending on the value of L. This indicates that we achieved comparable results in a fraction of the time. Last but not least, in the case of the mnist dataset Fig. 5, when we consider \(L=25\) candidates or more, the global kmeans++ consistently outperforms the other methods.
It should be noted that in the wine dataset Fig. (4d), the FGKM algorithm with \(L=100\) candidates demonstrated promising results. However, its performance across all datasets was inconsistent. It is evident that the FGKM algorithm cannot reliably approximate the clustering solution achieved by the global kmeans algorithm. Even with an increasing number of candidates, the clustering solution provided by the FGKM significantly deviates from the quality attained by the global kmeans++.
As expected, the kmeans++ algorithm was successful for small values of k across all datasets. However, it becomes apparent that as k increases, the performance of the kmeans++ algorithm deteriorates compared to global kmeans and global kmeans++. This is because the kmeans problem becomes more challenging when it comes to selecting the initial center positions as k raises. Additionally, scalable kmeans++ had inferior performance compare to kmeans++ in most cases and especially when k is large. As expected, the standard kmeans with the random uniform selection yielded the worst results.
Figure 6 presents the average number of kmeans iterations performed by each method in all datasets and k subproblems. It becomes evident that as k increases, the incremental methods require fewer iterations in each run of kmeans due to improved initialization of the cluster centers. This advantage arises from the fact that the previous \(k1\) centers have already been positioned in nearoptimal solutions. It should be reminded that for each cluster number k, the number of executed kmeans runs is \(\mathcal {O}(Lk)\) for all compared methods, except for global kmeans which executes kmeans \(\mathcal {O}(Nk)\) times. This means that even if global kmeans tends to execute a smaller average number of kmeans iterations compared to kmeans++, this does not translate to faster execution times. Table 2 presents the CPU time required by each algorithm to compute all K clustering solutions. It turns out that if we want to find all k cluster solutions from \(k=2\) until a cluster number K, global kmeans++ is the faster approach.
4.5 Discussion
From the above experimental results, the following empirical conclusions can be drawn.
The Global kmeans incremental method exhibits powerful optimization capabilities. However, due to its big computational requirements, it is often applied to smaller datasets. As shown in the experimental results, even for a medium size dataset like pendigits global kmeans required very large execution time. Nevertheless, it is a very powerful extension of kmeans and has been widely used in the clustering literature [31]. It should be noted that the algorithm provides high quality solutions, however, it does not guarantee the discovery of the global optimum. As we have analyzed in the introduction, minimization of the clustering is known to be an \(\textsf{NP}\)hard problem. We have observed such case in the wine dataset where the global kmean did not yield the best results.
FGKM constitutes a natural fast variant of the global kmeans method which, in order to initialize the new cluster center, it selects the data point \(x_{i^{\star }}\) that minimizes an upper bound of the final kmeans clustering error. We should recall that we allowed the FGKM method to select the top L candidates that maximize the (8) instead of just one as originally proposed. However, this candidate selection heuristic performs poorly even compared to the random initialization method. This suggests, that while FGKM is intuitively justified, it does not constitute a very effective strategy.
The kmeans++ algorithm is considered as the stateoftheart random initialization algorithm for solving the kmeans clustering problem due to strong empirical performance, theoretical guarantees of the solution quality, and simplicity [32, 33]. As expected, it demonstrated good performance and generally surpassed most methods, including the FGKM, the scalable kmeans++, and of course kmeans with random initialization. However, global kmeans and global kmeans++ had better performance overall, and this becomes more apparent as k gets larger.
The proposed Global kmeans++ produced very satisfactory results. It demonstrates similar performance to the global kmeans with a large reduction in computational complexity. Its candidate selection method chooses excellent candidates, as it is inspired by the successful kmeans++ seeding strategy. Moreover, it provides the clustering solution for all intermediate values of k. Therefore, if we wish to obtain clustering solutions for different k, its time performance is even more impressive compared to the rest of the random initialization methods, which have to be executed separately for each k.
5 Conclusions
We have proposed the global kmeans++ clustering algorithm, which is an effective relaxation of the global kmeans algorithm that provides an ideal compromise between clustering error and execution speed. The basic idea of the proposed method is to take advantage of the superior clustering solutions that the global kmeans algorithm can provide while avoiding its substantial computational requirements. The global kmeans++ is an incremental clustering approach that dynamically adds one cluster center at each k cluster subproblem. For each k cluster subproblem, the method selects L datapoints as candidates for the initial position of the new center using the effective kmeans++ selection probability distribution. The selection method is fast and requires no extra computational effort for distance computations.
Global kmeans++ has been tested on various benchmark publicly available datasets and has been compared to the global kmeans, the FGKM with multiple candidates, the kmeans++, the scalable kmeans++, and the standard kmeans with random uniform initialization. The experimental results reveal its superiority against all method except global kmeans. In this case, its performance is comparable but with a significant decrease in computational cost, thus establishing it as an effective relaxation.
We conclude that the proposed algorithm balances out the best characteristics of global kmeans and kmeans++ methods, resulting in high quality clustering at reduced computational cost. Despite previous efforts to enhance the FGKM algorithm, our method takes a different approach by directly addressing the global kmeans method. We believe this method is an effective alternative to both global kmeans and kmeans++. Furthermore, the method provides all clustering solutions for \(k\in \{1,\ldots , K\}\). In this way, it is suitable for comparing clustering solutions for different k to determine the appropriate number of clusters, which is unknown in many applications. In this case, the method is significantly faster, even when compared to nonincremental methods such as standard kmeans and kmeans++.
In future work, at first it would interesting to apply the idea on generalizations of kmeans that deal with distance measures different from the Euclidean, such as for example the kernel kmeans algorithm or the kmedoids algorithm. We could also investigate a dynamic method for appropriately selecting the number of candidates L in each k cluster subproblem. This could potentially lead to a further speedup of the algorithm. We could also adapt the global kmeans++ algorithm into a semisupervised setting by incorporating mustlink or cannotlink constraints by utilizing the exact solver as referred to [34]. In addition, we also aim to test the method in real applications (such as in biology, speech recognition, face clustering, etc.) initially requiring a large number of clusters (overclustering solutions). Finally, it would be interesting to integrate the method into a deep framework for clustering of composite objects (graphs, images, text, etc.).
Code Availibility
Code online available: https://github.com/gvardakas/globalkmeanspp.git.
Data availability and access
The datasets analysed during the current study are available in the UCI repository, https://archive.ics.uci.edu/ml/index.php, and in the mnist database, http://yann.lecun.com/exdb/mnist/.
Notes
The synthetic dataset is available in the following GitHub repository: https://github.com/deric/clusteringbenchmark.git.
Experiments were carried on a machine with an Intel\(^{\circledR }\) Core™ i78700 CPU at 3.20 GHz and 16 GB of RAM.
The time constraint is set to 7 days of execution while the available memory is 16 GB of RAM.
References
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recognit 41(1):176–190
Jain AK (2010) Data clustering: 50 years beyond kmeans. Pattern Recognit Lett 31(8):651–666
Kaufman L, Rousseeuw PJ (2009) Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons
CohenAddad V, Karthik C (2019) Inapproximability of clustering in lp metrics. In: 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, pp 519–539
CohenAddad V, Karthik C, Lee E (2021) On approximability of clustering problems without candidate centers. In: Proceedings of the 2021 ACMSIAM Symposium on Discrete Algorithms (SODA), SIAM, pp 2635–2648
Aloise D, Deshpande A, Hansen P, Popat P (2009) Nphardness of euclidean sumofsquares clustering. Mach Learn 75(2):245–248
Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar kmeans problem is nphard. Theoretical Comput Sci 442:13–21
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the fifth berkeley symposium on mathematical statistics and probability, Oakland, CA, USA 1:281–297
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the kmeans clustering algorithm. Expert Syst Appl 40(1):200–210
Arthur D, Vassilvitskii S (2006) kmeans++: The advantages of careful seeding. Technical Report 200613, Stanford InfoLab. http://ilpubs.stanford.edu:8090/778/
Likas A, Vlassis N, Verbeek JJ (2003) The global kmeans clustering algorithm. Pattern Recognit 36(2):451–461
Agrawal A, Gupta H (2013) Global kmeans (gkm) clustering algorithm: a survey. Int J Comput Appl 79(2)
Fränti P, Sieranoja S (2019) How much can kmeans be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Ajmera J, Wooters C (2003) A robust speaker clustering algorithm. In: 2003 IEEE Workshop on automatic speech recognition and understanding (IEEE Cat. No.03EX721), pp 411–416. https://doi.org/10.1109/ASRU.2003.1318476
Saeys Y, Van Gassen S, Lambrecht BN (2016) Computational flow cytometry: helping to make sense of highdimensional immunology data. Nature Rev Immunol 16(7):449–462
Wei Z, Chen YC Skeleton clustering: Graphbased approach for dimensionfree densityaided clustering. In: NeurIPS 2022 Workshop: New Frontiers in Graph Learning
Nie F, Wang CL, Li X (2019) Kmultiplemeans: A multiplemeans clustering method with specified k clusters. In: Proceedings of the 25th ACM SIGKDD International conference on knowledge discovery & data mining, pp 959–967
Xie J, Jiang S, Xie W, Gao X (2011) An efficient global kmeans clustering algorithm. J Comput 6(2):271–279
Bagirov AM, Ugon J, Webb D (2011) Fast modified global kmeans algorithm for incremental cluster construction. Pattern Recognit 44(4):866–876
Bai L, Liang J, Sui C, Dang C (2013) Fast global kmeans clustering based on local geometrical information. Inf Sci 245:168–180
Lai JZ, Huang T (2010) Fast global kmeans clustering using cluster membership and inequality. Pattern Recognit 43(5):1954–1963
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable kmeans++. Proceedings of the VLDB Endowment 5(7):622–633
Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
LeCun Y, Cortes C (2010) MNIST handwritten digit database
Milligan GW, Cooper MC (1988) A study of standardization of variables in cluster analysis. J Classification 5(2):181–204
Ikotun AM, Ezugwu AE, Abualigah L, Abuhaija B, Heming J (2023) Kmeans clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci 622:178–210
Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and provably good seedings for kmeans. Advances in neural information processing systems 29
Choo D, Grunau C, Portmann J, Rozhon V (2020) kmeans++: few more steps yield constant approximation. In: International conference on machine learning, PMLR pp 1909–1917
Piccialli V, Russo AR, Sudoso AM (2022) An exact algorithm for semisupervised minimum sumofsquares clustering. Comput & Operations Res 147:105958
Acknowledgements
This research was supported by project “Dioni: Computing Infrastructure for BigData Processing and Analysis” (MIS No. 5047222) cofunded by European Union (ERDF) and Greece through Operational Program“Competitiveness, Entrepreneurship and Innovation”, NSRF 20142020.
Author information
Authors and Affiliations
Contributions
Both authors contributed equally to all aspects of this research. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare no conflict of interest.
Ethical and informed consent for data used
This article does not contain any studies conducted on human or animal subjects by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vardakas, G., Likas, A. Global kmeans++: an effective relaxation of the global kmeans clustering algorithm. Appl Intell 54, 8876–8888 (2024). https://doi.org/10.1007/s10489024056362
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489024056362