1 Introduction

Clustering is one of the most important and fundamental tasks in machine learning, data mining, and pattern recognition, with numerous applications in computer science and many other scientific fields [1,2,3]. It is defined as a process of partitioning a set of objects into groups, called clusters so that the data in the same cluster share common characteristics and differ from those in other clusters. While its definition is simple, it is actually a challenging problem to solve. The typical form of clustering is the partitioning of a given dataset \(X =\{x_1,\ldots , x_N\}\), \(x_i \in \mathbb {R}^d\) into K disjoint clusters \(\mathcal {C} = \{C_1, \ldots , C_K\}\) so that a specific criterion is optimized. The most widely used optimization criterion is the clustering error. It is defined as the within-cluster sum of squared distances between each data \(x_i \in C_k\) to its cluster center \(m_k\) as defined in

$$\begin{aligned} E(C) = \sum _{i=1}^{N} \sum _{k=1}^{K} {\textbf {1}}_{C_k}(x_i) ||x_i - m_k||^2, \end{aligned}$$
(1)

where \({\textbf {1}}_{C_k}\) is the indicator function of the set \(C_k\), while \(M = \{m_1,\ldots , m_K \}\) is the set of K centers computed as the mean of the datapoints of each cluster. The number of ways in which a set of N objects can be partitioned into K non-empty groups is given by Stirling numbers of the second kind:

$$\begin{aligned} S(N, K) = \frac{1}{K!} \sum _{k=0}^{K} (-1)^{k} (K-k)^N {\begin{pmatrix}K\\ k \end{pmatrix}}, \end{aligned}$$
(2)

which can be approximated by \(K^{N}/K!\) as \(N \rightarrow +\infty \) [4]. It is evident from (2) that a complete enumeration of all possible clusterings to determine the global minimum of (1) is computationally prohibitive. It is worth mentioning that this non-convex optimization problem is \(\textsf{NP}\)-hard [5, 6] not only for two clusters (\(K=2\)) [7], but also for two-dimensional datasets (\(D=2\)) [8]. The k-means algorithm [9, 10] is a widely used method for minimizing clustering error. It is an iterative algorithm that has been extensively utilized in many clustering applications due to its simplicity and speed [11].

The k-means algorithm is computationally fast, but its performance relies heavily on the initialization of the cluster centers, which may lead to a local minimum of the clustering error. Various stochastic global optimization methods have been proposed, such as simulated annealing and genetic algorithms to overcome the above problem, but have not gained wide acceptance. These types of methods involve several hyperparameters that are difficult to tune, for example, starting temperature, cooling, schedule, population size, and crossover/mutation probability and usually require a large number of iterations, which renders them prohibitive for large datasets. Consequently, stochastic global optimization methods are usually not preferred, but instead, methods with multiple random restarts are commonly used [1]. The k-means++ [12] is probably the most widely used center initialization algorithm. It selects the cluster centers by sampling datapoints from a probability distribution that strives to effectively spread them away from each other. Specifically, the algorithm comprises two key steps: i) computing a probability distribution for center selection and ii) sampling a datapoint from this distribution as the initial cluster center position. The method iteratively recalculates the distribution in order to select subsequent cluster centers. This iterative process terminates when all K centers have been initialized.

The global k-means [13] algorithm has also been proposed to effectively solve the k-means initialization problem. It constitutes a deterministic global optimization method that does not depend on center initialization. Instead of randomly selecting starting positions for all cluster centers the global k-means proceeds incrementally, attempting to optimally add one new cluster center at each stage k by exploiting the \(k-1\) clustering solution. This effective approach is deterministic and does not depend upon any initial conditions or empirically adjustable parameters. Nonetheless, a notable limitation of this method is its high computational cost. It has been empirically proven that both global k-means and k-means++ methods significantly surpass the performance of standard k-means [14, 15].

It is important to note that the performance of random initialization algorithms, such as k-means++, decays as K grows compared to those of the global k-means algorithm. Large values of K are frequently utilized in clustering scenarios, where the objective is to capture finer substructures within a dataset. In this context, we define a clustering of X as an overclustering when \(K>k^{\star }\), with \(k^{\star }\) representing the actual number of clusters present in X. In various fields such as speech recognition [16] and computational biology [17], overclustering is commonly practiced. For example, in computational biology, overclustering ensures that relevant cell types can be discovered even if an expected population is split into sub-populations. Moreover, overclustering is vital in more sophisticated clustering algorithms as an algorithmic step, particularly when dealing with non-convex data structures [18, 19]. In overclustering methods, obtaining solutions from the global k-means clustering algorithm is desirable. However, due to the computational burden associated with large values of N and K, computationally cheaper alternatives are practically utilized.

In this paper, we propose the global k-means++ clustering algorithm, which is an effective way of acquiring clustering solutions of comparable clustering errors to those that global k-means produces without its high computational cost. This is achieved by employing the effective center selection probability distribution of the k-means++ method. The global k-means++ algorithm is an attempt to retain the effectiveness of the incremental clustering strategy of the global k-means while reducing its computational demand using the efficient stochastic center initialization of the k-means++ algorithm. The proposed algorithm proceeds incrementally by solving all intermediate sub-problems with \(k \in \{1, 2,\ldots , K-1\}\) to provide the clustering solution for K clusters. The underlying idea of the proposed method is that an optimal solution for a clustering problem with K clusters can be obtained using a series of local k-means searches that are appropriately initialized. More specifically, at each local search, the \(k - 1\) cluster centers are always initially placed at their optimal positions corresponding to the clustering problem with \(k-1\) clusters. However, the remaining kth cluster center is placed at several starting positions within the data space that are randomly selected by sampling from the k-means++ probability distribution.

The rest of the paper is organized as follows. At first, in Section 2, we give a brief overview of the k-means algorithm, the initialization method of the k-means++, the global k-means clustering algorithm and its variations. Then, in Section 3, we present the proposed global k-means++ clustering algorithm. In Section 4, we provide extensive comparative experimental results and discussion. Finally Section 5 presents conclusions and future work suggestions.

2 Related work

The k-means algorithm finds locally optimal solutions with respect to the clustering error (1). It is an iterative center-based clustering algorithm that starts with K cluster centers. In its purest form (Lloyd’s algorithm [10]), the K centers are typically selected uniformly at random from the set of datapoints X. Then the two-step algorithmic procedure follows iteratively until convergence, which includes the data assignment and the center optimization steps. At the assignment step, every data point \(x_i\) is assigned to the cluster \(C_j\) with the nearest center \(m_j\):

$$\begin{aligned} j =\arg \underset{k}{\min} || x_i - m_k ||^2. \end{aligned}$$
(3)

At the optimization step, each center is updated to the mean of all datapoints assigned to its cluster:

$$\begin{aligned} m_j = \frac{1}{|C_j|} \sum \limits _{x_i \in C_j} x_i. \end{aligned}$$
(4)

This simple optimization methodology is proven to be computationally fast and effective. The main disadvantage of the standard k-means algorithm is that it is a local optimization method with high sensitivity to the original starting positions of the cluster centers. A poor center selection usually leads to local minima of the clustering error. Therefore, to obtain near-optimal solutions using the k-means algorithm, several runs must be scheduled differing in the starting positions of the cluster centers. Among these runs, the solution with the lowest clustering error is naturally chosen and retained.

The k-means++ algorithm [12] is considered the most successful method for choosing the initial positions of the cluster centers. It randomly selects the center positions from the datapoints using a multinomial probability distribution that strives to effectively distribute the centers away from each other. Specifically, the method begins by choosing the first center position through uniform random selection from the dataset X. Then it computes the distances

$$\begin{aligned} d_i = \min _{j}||x_i - m_j||^2, \end{aligned}$$
(5)

where \(i = 1, \ldots , N\) of each datapoint \(x_i\) from its nearest center and forms the probability vector \(P = \left( p_1, \ldots , p_N\right) \), where each component \(p_i\) is given by

$$\begin{aligned} p_i = Pr(m = x_i) = d_i \bigg / \sum \limits _{j=1}^N d_j. \end{aligned}$$
(6)

Then, it samples the next center position m from the data using the multinomial distribution with probability vector P. This two-step procedure is repeated until the number of centers equals to K. After the k-means++ center initialization procedure, the standard k-means algorithm proceeds with the assignment and optimization steps, described in (3) and (4) respectively. These steps are iteratively applied until convergence is achieved. It is important to note that choosing centers using the k-means++ algorithm is computationally fast, while it also guarantees a \(\mathcal {O}(\log {k})\)-competitive clustering solution [12].

The global k-means clustering algorithm [13] is a deterministic global optimization method that does not depend on any predefined parameter values and employs the k-means algorithm as a local search procedure. The complete global k-means method is presented in Alg. 1.

Algorithm 1
figure d

The global k-means [13]

Instead of randomly selecting initial values for the cluster centers the global k-means algorithm incrementally adds one new cluster center at each stage in an attempt to be optimally placed. To accomplish that, the global k-means algorithm proceeds to solve a clustering problem with K clusters by sequentially solving every intermediate sub-problem with k clusters (\(k \in \{1,\ldots , K\}\)). In order to solve the problem with k clusters, the global k-means utilizes the solution with \(k-1\) cluster centers.

Note that the set of k centers is denoted as \(M_k\), the k clustering partition is denoted as \(C_k\), and E(C) denotes the clustering error, see (1). The algorithm starts by solving the 1-means problem where the optimal position corresponds to the center of the dataset X (step 1 in Alg. 1). Then, it solves the 2-means problem by performing N executions of the k-means algorithm (steps 3-5 in Alg. 1). In each execution n, the first cluster center is always initialized at the optimal solution of the 1-means sub-problem, while the second center is initially set at the data point \(x_n\) \((n \in \{1,\ldots , N\})\). The best solution, with the lowest error (1), is obtained after the N executions of the k-means algorithm is considered the solution for the 2-means clustering sub-problem (step 6 in Alg. 1). Following the same incremental procedure, the solution for \(k_{\text {th}}\) clusters is obtained (steps 2-7 in Alg. 1). In general, for solving the \(k_{\text {th}}\) cluster sub-problem the procedure begins with the initialization of the \(k-1\) centers at the center positions provided by the solution of the \((k-1)\) problem; then, the new \(k_{\text {th}}\) center is initialized at each datapoint \(x_n\). The k-means algorithm is executed N times while retaining the best clustering solution. The global k-means algorithm is very successful in obtaining near-optimal solutions, but computationally expensive for large N as it involves \(\mathcal {O}(NK)\) executions of k-means on the entire dataset.

The global k-means is widely acknowledged for its optimization capabilities. However, its prominent drawback lies in its high computational complexity, as each k clustering sub-problem necessitates \(\mathcal {O}(N)\) k-means executions. Consequently, the practical application of the algorithm is predominantly confined to small datasets. This limitation has prompted researchers to explore heuristic techniques for selecting initial center candidates, with the dual objective of reducing computational complexity while maintaining clustering results of comparable quality to those obtained with global k-means. Nonetheless, it is essential to acknowledge that reducing the computational complexity of the global k-means algorithm results in the loss of its deterministic nature.

The fast global k-means algorithm (FGKM)  [13] constitutes an effort to accelerate the global k-means. Unlike global k-means, which necessitates \(\mathcal {O}(N)\) k-means executions for each k sub-problem, the FGKM algorithm performs only a single k-means execution. Alg. 2 describes the FGKM clustering method.

Algorithm 2
figure e

The fast global k-means [13]

In particular, for each value of k, the FGKM algorithm computes an upper bound (denoted as \(E_n\)) on the clustering error that would arise if the new center were initialized at the specific point \(x_n\) and k-means were executed until convergence. The upper bound is defined as \(E_n \le E - b_n\), where E is the error already found for the \(k-1\) clustering sub-problem, while the value \(b_n\) is given by equation

$$\begin{aligned} b_n = \sum _{j=1}^{N} \max (d^{j}_{k-1} - ||x_n - x_j||^2, 0). \end{aligned}$$
(7)

In (7), the \(d^{j}_{k-1}\) is defined as the squared euclidean distance between \(x_j\) and its closest cluster center among the \(k-1\) centers obtained so far. The initial position of the k-th cluster center is set to the data point \(x_i^{\star }\) with minimum \(E_n\) or, equivalently, with maximum \(b_n\):

$$\begin{aligned} i^{\star }=\arg \max _n b_n. \end{aligned}$$
(8)

Several methods have been proposed that modify the global k-means or the fast global k-means to make it more efficient [14]. The efficient global k-means clustering algorithm (EGKM) [20] defines a normalized density function criterion for selecting the top candidate for the new cluster center. However, EGKM similarly to FGKM demands all pairwise distances \(d(x_i, x_j)\), thus it needs \(\mathcal {O}(N^2)\) distance computations and extra memory space. The fast modified global k-means algorithm (FMGKM) [21] defines a more sophisticated auxiliary function criterion for selecting the candidates compared to FGKM. Its main drawback is computational complexity because, at each iteration, it requires the whole or part of the affinity matrix computation, which also generally requires \(\mathcal {O}(N^2)\). In addition, the FMGKM algorithm exhibits a secondary limitation due to the utilization of an auxiliary function criterion, which introduces a non-convex optimization problem. The fast global k-means clustering based on local geometric information [22] is an attempt to reduce the computational complexity of FGKM [13] and FMGKM [21] by leveraging local geometric structure information to decrease the distance computations required in each iteration. It is claimed that the method requires \(\mathcal {O}(n_1 n_2)\) distance computations to select the new center candidate, where empirically \(n_1 \ll N\) and \(n_2 \ll N\). Fast global k-means clustering using cluster membership and inequality [23] is also an attempt to reduce the distance computations of the FGKM. It is claimed that the method requires \(\mathcal {O}(N n_1)\) distance computations to select the new center candidate, where empirically \(n_1 \ll N\). It is crucial to emphasize that all related methods exhibit comparable clustering performance to FGKM rather than global k-means.

3 The global k-means++ algorithm

Global k-means is a deterministic algorithm proposed to tackle the random initialization problem but it is computationally expensive. It partitions the data to K clusters by solving all k cluster sub-problems sequentially \(k \in \{1,\ldots , K\}\). In order to solve the k clustering sub-problem, the method employs a strategy wherein the k-means algorithm is executed N times. Each algorithmic cycle considers all N datapoints as potential candidate positions for the new cluster center. This comprehensive exploration of candidate positions aims to identify the optimal center placement for the new cluster center and drastically improves the clustering performance.

An approach to reduce the complexity of the global k-means algorithm is to consider a set of L datapoints as candidate positions for the new cluster center, where \(L \ll N\). However, an effective candidate center selection strategy is required. Employing a random uniform selection method for the L candidates is not expected to be an viable choice. An ideal candidate selection method should meet the following requirements:

  • Efficiency requirements

    1. i.

      Low computational complexity.

    2. ii.

      Low space complexity.

  • Effectiveness requirement

    1. i.

      Selection of a set of high-quality center candidates.

We consider as “high-quality” a set of centers assigned to separate data regions. In order to select such center candidates, prioritizing sampling from regions without cluster centers is crucial. Such strategy enhances the likelihood of choosing candidates that belong to different clusters while minimizing the risk of selecting redundant or sub-optimal center positions. For this reason, we have employed the k-means++ probability distribution as the candidate selection method. The effectiveness of this distribution in the initial placement of cluster centers makes it an excellent choice for addressing the aforementioned requirements.

Algorithm 3
figure f

Global k-means++

Algorithm 4
figure g

Batch Sampling

Algorithm 5
figure h

Sequential Sampling

In this work, we propose the global k-means ++ algorithm, an effective relaxation of the global k-means method. Its main difference with the global k-means is that the proposed method requires only L executions of k-means for each k cluster sub-problem, where \(L \ll N\). The primary goal of our proposed clustering method is to provide high-quality results that are comparable to those of the global k-means algorithm, while retaining low computational and space complexity. The complete global k-means++ method is presented in Alg. 3. As with any other global k-means variant, in order to reduce the computational requirements, we sacrifice determinism by using an effective stochastic initial center selection procedure.

The algorithm requires as input the dataset X, the number of clusters K, the number of candidates L to consider, as well as the sampling method S. Two possible sampling strategies are considered, namely, batch sampling and sequential sampling. It should be noted that \(D = (d_1, \ldots , d_N)\) represents the distance vector, where \(d_i\) is the distance of \(x_i\) to its closest center \(m_j\) (5).

Fig. 1
figure 1

Illustration of a running instance of the algorithm applied to the “R15” dataset [24]. Circles denote the datapoints, the cluster centers are represented by red stars, while the center candidates are marked with green crosses

Specifically, to solve a clustering problem with K clusters, the method proceeds as follows. Initially, it addresses the 1-means sub-problem by setting the center \(m_1\) as the mean of the entire dataset X and it initializes the distance vector D (steps 1-2 in Alg. 3). Then, to solve the 2-means sub-problem, it utilizes the 1-means solution as the first initial center position (obtained in step 1). To initialize the second center, a set of L candidate positions \(\{c_1, \ldots , c_L\}\) is sampled using batch or sequential sampling (steps 4-8 in Alg. 3). The k-means algorithm is executed L times until convergence, one for each candidate position \(c_\ell \) (steps 9-11 in Alg. 3), and the solution with the minimum error is retained (step 12 in Alg. 3). Finally, the distance vector D is updated (steps 13-15 in Alg. 3), using the pre-computed center-to-data point distances of the best clustering solution. Following the same reasoning, for each \(k = 2, \ldots , K\), it incrementally tackles the k cluster sub-problem by leveraging the solution of the previous \((k-1)\) cluster sub-problem. To solve for k clusters, it utilizes the center positions obtained from the \((k-1)\) sub-problem as initialization for the \(k-1\) centers. It then uses batch or sequential sampling strategies, which we discuss next, to initialize the new \(k_{\text {th}}\) center. Finally, the algorithm returns the clustering solution for every \(k \in \{1,\ldots ,K\}\). Importantly, it should be noted that steps 9-11 of Alg. 3 can be executed concurrently or in parallel. This parallel execution allows for efficient computation and optimization.

We can sample candidates from probability distribution P through either batch or sequential sampling, presented in Alg. 4 and Alg. 5, respectively. On the one hand, batch sampling does not require any additional computations as it utilizes only the pre-computed center-to-data point distances \(D = (d_1, \dots , d_N)\). First, it computes the probability vector \(P = \left( p_1, \ldots , p_N\right) \), where \(p_i\) represents the selection probability of the datapoint \(x_i\) (step 1 in Alg. 4). The distribution P is formulated so that high selection probabilities are assigned to datapoints far from cluster centers, and small to near zero are assigned to datapoints near cluster centers. Then, a set of L candidates \(\{c_1, \ldots , c_L\}\) are sampled without replacement by the constructed k-means++ probability vector P (step 2 in Alg. 4).

On the other hand, sequential sampling strategy selects one candidate \(c_{\ell }\) at a time using the distribution P (step 3 in Alg. 5). This procedure aims to select candidates that are far from: i) the set M consisting of the converged solutions of the \(k-1\) centers, and ii) the \(\ell -1\) candidates that already sampled. After each sampling the distance \(d_i\) of each datapoint \(x_i\) is updated accordingly (steps 4-6 in Alg. 5). This procedure is repeated the L times in total to create the set of candidates \(\{c_1, \ldots , c_L\}\) (steps 1-7 in Alg. 5). Although the sequential sampling method incurs the cost of recalculating the minimum distance values, it provides a better spread of the L samplesin the dataspace.

Table 1 Descriptions of datasets

In Fig. 1, we present an illustration of a running example of the algorithm using the 2-dimensional “R15” dataset [24].Footnote 1 The circles denote the datapoints, while their color indicates the cluster category that they belong to, asserted by the global k-means++ algorithm. With the red star, we represent the converged cluster centers for each clustering sub-problem k, while with the green cross-symbols, we show the initial candidate position of the next cluster center. In each sub-figure 1a-1d, the algorithm solves the current k-means sub-problem. Based on the current solution, the algorithm samples the next center candidates from the k-means++ distribution. The figure demonstrates that the method samples high-quality center candidates.

While the global k-means requires \(\mathcal {O}(NK)\) k-means executions, the global k-means++ only requires \(\mathcal {O}(LK)\), where generally \(L \ll N\). Additionally, we have empirically observed that k-means converges very fast as k grows. The speed up in convergence is reasonable, since in order to solve each k cluster sub-problem, the method exploits the centers of \(k-1\) cluster sub-problem that are already positioned sufficiently well. Therefore the k-means does not require many iterations to converge.

It should be noted that the proposed method is a relaxation of the global k-means algorithm rather than an optimization of the FGKM similar to the methods discussed in the related work. Its simplicity and speed stem from the fact that the sampling strategies require none to minimal computations to select the L candidates, demonstrating superior computational complexity compared to other global k-means variants. In particular, the batch sampling strategy requires no additional distance computations since the center-to-datapoint distances necessary to define the probability vector P have already been computed by the k-means procedure. Meanwhile, the sequential sampling strategy requires only \(\mathcal {O}(NL)\) distance computations for the selection process, which is the same as k-means++. Note also that solving the k-means problem with an incremental procedure has many advantages because, due to the unsupervised nature of the clustering problem, it is usually desirable to obtain solutions for different k that are evaluated using appropriate quality criteria (e.g. silhouette coefficient [25]) for selecting the appropriate number of clusters [26].

4 Experiments

In this section, we present our experimental study to evaluate the effectiveness of the proposed global k-means++ clustering method, both in batch (gl++ (b)) and sequential sampling (gl++ (s)) settings. Our study involved comparisons against other clustering methods, including global k-means (gl) [13], FGKM (fgl) [13], k-means++ (k-ms++) [12], scalable k-means++ (k-ms||) [27] and standard k-means with random uniform initialization (rnd) [10].Footnote 2 Our implementation of the global k-means++ clustering algorithm is available in the following GitHub repository: https://github.com/gvardakas/global-kmeans-pp.git.

Fig. 2
figure 2

Relative Percentage Error for the breast cancer, for different L values

Fig. 3
figure 3

Relative Percentage Error for the pendigits, for different L values

Fig. 4
figure 4

Relative Percentage Error for the wine, for different L values

Fig. 5
figure 5

Clustering Error Differences for the mnist, for different L values

Fig. 6
figure 6

Average number of k-means iterations

Table 2 CPU Time. Values marked by \(\dagger \) and \(\ddagger \) denote that the method could not be executed due to memory/time and method constraints, respectively

4.1 Datasets

In order to evaluate the proposed algorithm, a series of experiments have been conducted on various benchmark and publicly available datasets.Footnote 3 Moreover, we deliberately selected datasets that exhibited a broad spectrum of data characteristics, encompassing factors such as the number of samples N and the data dimensionality D, the complexity, and the domain of origin. The description each dataset is presented in Table 1. As a pre-processing step, we used min-max normalization to map the attributes of each real dataset to the [0, 1] interval to prevent attributes with large ranges from dominating the distance calculations and avoid numerical instabilities in the computations [30].

4.2 Evaluation

The evaluation of clustering methods encompasses various approaches, including internal and external metrics when the ground truth cluster labels are available. However, our investigation focuses solely on treating it as an optimization problem, with the primary objective being the minimization of clustering error. Consequently, we intentionally disregard any class labels in the data, as our emphasis lies not on assessing external clustering metrics or determining the number of clusters. Instead, we focus on clustering error \(E(C_k)\) (1) as the performance measure for the different methods, enabling direct assessment of their error minimization capabilities.

Specifically, in order to evaluate the performance of the different methods, we have computed the relative Percentage Error, defined as

$$\begin{aligned} PE = \frac{E(C_k) - E(C_{k}^{\star })}{E(C_{k}^{\star })} \times 100\%, \end{aligned}$$
(9)

where \(E(C_{k}^{\star })\) is the clustering error of the baseline method (global k-means), while \(E(C_k)\) is the error provided by each of the compared methods (refer to Figs. 2, 3 and 4). Considering the mnist dataset, the global k-means did not terminate in reasonable time due to its high computational complexity. In that case, we have used the difference in clustering errors, relative to the best-performing algorithm to facilitate a transparent comparison (see Fig. 5). Additionally, we studied the convergence speed of each method for each k by presenting the CPU time required by each algorithm (Table 2) and the average number of iterations needed for k-means to converge (Fig. 6). In this way, we provide comprehensive information regarding each method’s ability to minimize clustering error as well as its computational speed and efficiency.

4.3 Experimental setup

In our experimental study, we executed the compared methods for a maximum number of clusters, denoted as K and evaluated all clustering solutions for \(k\in \{1,\ldots ,K\}\) in terms of clustering error. We chose the maximum number of clusters K based on the dataset size, with \(K=30\) for smaller datasets (breast cancer and wine) and \(K=50\) for medium (pendigits) and larger datasets (mnist). For each dataset, we run the compared algorithms with L values of 10, 25, 50, and 100. This range of values allowed us to evaluate the impact of L on the performance of the clustering methods.

In the case of the randomly initialized methods, namely k-means++, scalable k-means++ and standard k-means, we have defined the parameter L to represent the number of restarts within each sub-problem k. However, the scalable k-means++ method has an additional hyperparameter l, which denotes the number of data points sampled in each iteration that we set equal to L. In contrast, for the incremental optimization methods of global k-means++ and FGKM, only a single execution was performed, considering L candidates. We implemented this approach to ensure a fair comparison with the randomly initialized methods, which do not utilize prior sub-problem solutions in their optimization procedure. It is worth noting that the FGKM, as presented in related work [13], originally considers only one candidate (8). However, in our study, we relaxed this limitation by allowing the method to select the top L candidates in each iteration that maximize (7). In the following experiments, we evaluate the optimization capabilities of each method, the average number of iterations required, as well as the time needed for convergence to the clustering solution. In summary for each dataset, we conducted the following experiments:

  • We selected \(K=30\) or \(K=50\) as the maximum number of clusters depending on the size of the dataset.

  • We executed one run of global k-means, global k-means++, and FGKM. For the two latter methods, we considered L equal to 10, 25, 50 and 100 candidates in each k cluster sub-problem.

  • The k-means++, scalable k-means++ and standard k-means methods were initialized L times for each \(k=1,\ldots , K\).

4.4 Results

Figures 2, 3, and 4 depict the relative percentage error between each method and the baseline algorithm, for the respective datasets: breast cancer, pendigits, and wine. However, in the case of the mnist dataset, Fig. 5 displays the difference in clustering error between each method and the best-performing algorithm. In the subfigures a, b, c, and d, we present the results of experiments considering different indicative values of L (i.e., 10, 25, 50, and 100) for the global k-means variants, k-means++, scalable k-means++, and the standard k-means algorithm. Moreover, Fig. 6 provides the average number of iterations required by each algorithm to converge, while Table 2 presents the CPU time necessary to compute all K clustering solutions for each algorithm.

The results across all datasets demonstrate that both the batch and sequential versions of global k-means++ exhibit similar performance to global k-means, while clearly outperforming the FGKM, k-means++, scalable k-means++ and the standard k-means algorithm. The optimization capabilities of global k-means++ are particularly noteworthy, especially as k increases, where minimizing the clustering error becomes more challenging. As expected, the performance of global k-means++ improves with an increasing number of candidates, better approximating the performance of global k-means. Surprisingly, in the wine dataset, global k-means++ outperformed global k-means in several k sub-problems (Fig. 4b-4d). The relative percentage error of global k-means++ was less than \(1\%\) when considering more than \(L=25\) candidates. In the pendigits dataset Fig. 3, the baseline model and our algorithm coincide with high accuracy for \(L=25, 50, 100\). Notably, global k-means required four days of execution, while global k-means++ completed the task in significantly less time, ranging from 17 second up to almost 2 minutes, depending on the value of L. This indicates that we achieved comparable results in a fraction of the time. Last but not least, in the case of the mnist dataset Fig. 5, when we consider \(L=25\) candidates or more, the global k-means++ consistently outperforms the other methods.

It should be noted that in the wine dataset Fig. (4d), the FGKM algorithm with \(L=100\) candidates demonstrated promising results. However, its performance across all datasets was inconsistent. It is evident that the FGKM algorithm cannot reliably approximate the clustering solution achieved by the global k-means algorithm. Even with an increasing number of candidates, the clustering solution provided by the FGKM significantly deviates from the quality attained by the global k-means++.

As expected, the k-means++ algorithm was successful for small values of k across all datasets. However, it becomes apparent that as k increases, the performance of the k-means++ algorithm deteriorates compared to global k-means and global k-means++. This is because the k-means problem becomes more challenging when it comes to selecting the initial center positions as k raises. Additionally, scalable k-means++ had inferior performance compare to k-means++ in most cases and especially when k is large. As expected, the standard k-means with the random uniform selection yielded the worst results.

Figure 6 presents the average number of k-means iterations performed by each method in all datasets and k sub-problems. It becomes evident that as k increases, the incremental methods require fewer iterations in each run of k-means due to improved initialization of the cluster centers. This advantage arises from the fact that the previous \(k-1\) centers have already been positioned in near-optimal solutions. It should be reminded that for each cluster number k, the number of executed k-means runs is \(\mathcal {O}(Lk)\) for all compared methods, except for global k-means which executes k-means \(\mathcal {O}(Nk)\) times. This means that even if global k-means tends to execute a smaller average number of k-means iterations compared to k-means++, this does not translate to faster execution times. Table 2 presents the CPU time required by each algorithm to compute all K clustering solutions. It turns out that if we want to find all k cluster solutions from \(k=2\) until a cluster number K, global k-means++ is the faster approach.

4.5 Discussion

From the above experimental results, the following empirical conclusions can be drawn.

The Global k-means incremental method exhibits powerful optimization capabilities. However, due to its big computational requirements, it is often applied to smaller datasets. As shown in the experimental results, even for a medium size dataset like pendigits global k-means required very large execution time. Nevertheless, it is a very powerful extension of k-means and has been widely used in the clustering literature [31]. It should be noted that the algorithm provides high quality solutions, however, it does not guarantee the discovery of the global optimum. As we have analyzed in the introduction, minimization of the clustering is known to be an \(\textsf{NP}\)-hard problem. We have observed such case in the wine dataset where the global k-mean did not yield the best results.

FGKM constitutes a natural fast variant of the global k-means method which, in order to initialize the new cluster center, it selects the data point \(x_{i^{\star }}\) that minimizes an upper bound of the final k-means clustering error. We should recall that we allowed the FGKM method to select the top L candidates that maximize the (8) instead of just one as originally proposed. However, this candidate selection heuristic performs poorly even compared to the random initialization method. This suggests, that while FGKM is intuitively justified, it does not constitute a very effective strategy.

The k-means++ algorithm is considered as the state-of-the-art random initialization algorithm for solving the k-means clustering problem due to strong empirical performance, theoretical guarantees of the solution quality, and simplicity [32, 33]. As expected, it demonstrated good performance and generally surpassed most methods, including the FGKM, the scalable k-means++, and of course k-means with random initialization. However, global k-means and global k-means++ had better performance overall, and this becomes more apparent as k gets larger.

The proposed Global k-means++ produced very satisfactory results. It demonstrates similar performance to the global k-means with a large reduction in computational complexity. Its candidate selection method chooses excellent candidates, as it is inspired by the successful k-means++ seeding strategy. Moreover, it provides the clustering solution for all intermediate values of k. Therefore, if we wish to obtain clustering solutions for different k, its time performance is even more impressive compared to the rest of the random initialization methods, which have to be executed separately for each k.

5 Conclusions

We have proposed the global k-means++ clustering algorithm, which is an effective relaxation of the global k-means algorithm that provides an ideal compromise between clustering error and execution speed. The basic idea of the proposed method is to take advantage of the superior clustering solutions that the global k-means algorithm can provide while avoiding its substantial computational requirements. The global k-means++ is an incremental clustering approach that dynamically adds one cluster center at each k cluster sub-problem. For each k cluster sub-problem, the method selects L datapoints as candidates for the initial position of the new center using the effective k-means++ selection probability distribution. The selection method is fast and requires no extra computational effort for distance computations.

Global k-means++ has been tested on various benchmark publicly available datasets and has been compared to the global k-means, the FGKM with multiple candidates, the k-means++, the scalable k-means++, and the standard k-means with random uniform initialization. The experimental results reveal its superiority against all method except global k-means. In this case, its performance is comparable but with a significant decrease in computational cost, thus establishing it as an effective relaxation.

We conclude that the proposed algorithm balances out the best characteristics of global k-means and k-means++ methods, resulting in high quality clustering at reduced computational cost. Despite previous efforts to enhance the FGKM algorithm, our method takes a different approach by directly addressing the global k-means method. We believe this method is an effective alternative to both global k-means and k-means++. Furthermore, the method provides all clustering solutions for \(k\in \{1,\ldots , K\}\). In this way, it is suitable for comparing clustering solutions for different k to determine the appropriate number of clusters, which is unknown in many applications. In this case, the method is significantly faster, even when compared to non-incremental methods such as standard k-means and k-means++.

In future work, at first it would interesting to apply the idea on generalizations of k-means that deal with distance measures different from the Euclidean, such as for example the kernel k-means algorithm or the k-medoids algorithm. We could also investigate a dynamic method for appropriately selecting the number of candidates L in each k cluster sub-problem. This could potentially lead to a further speed-up of the algorithm. We could also adapt the global k-means++ algorithm into a semi-supervised setting by incorporating must-link or cannot-link constraints by utilizing the exact solver as referred to [34]. In addition, we also aim to test the method in real applications (such as in biology, speech recognition, face clustering, etc.) initially requiring a large number of clusters (overclustering solutions). Finally, it would be interesting to integrate the method into a deep framework for clustering of composite objects (graphs, images, text, etc.).