1 Introduction

Semi-supervised learning can enhance the performance of a learning algorithm by utilizing an extensive amount of unlabeled data in real-world tasks [1, 2]. Several semi-supervised methods have been developed, including co-training [3, 4], self-training [5, 6], consistency regularization [7, 8], and holistic methods [9, 10]. The co-training approach can be described as a disagreement-based technique for semi-supervised learning [11]. During the co-training process, two separate learners are trained on different views, and they select the unlabeled instances about which they are most confident. The unlabeled instances of one learner are then given pseudo-labels and added to the training set of the other learner, which uses them to update its model parameters. Unlike graph and kernel methods [12, 13], co-training does not require additional complex computations or large amounts of memory.

Fig. 1
figure 1

The framework of S2RMS. The deep metric network \(\mathrm M_T\) used in S2RMS maps the labeled data \(\textbf{L}\) and unlabeled data \(\textbf{U}\) into a d-dimensional embedding space, which facilitates subsequent operations. To establish the co-training framework, k subsets \(\{\textbf{L}_i'\}_{i=1}^k\) are randomly and repeatedly selected from the transformed \(\textbf{L}'\) and used to initialize k regressors \(\{h_j\}_{j=1}^k\). During the co-training iterations, instances are identified from the transformed \(\textbf{U}'\) using a similar instances selector and added to the data pool \(\textbf{P}\). The co-regressor then employs a label smearing technique within \(\textbf{P}\) to ensure representation stability and assigns appropriate pseudo-labels for the unlabeled instances. Instances with pseudo-labels are added to the training set of the corresponding regressor, which is then retrained

Two main problems are encountered when conducting co-training. First, some studies have indicated that casually using unlabeled data may result in a decrease in model effectiveness [14]. This is because not all unlabeled data can be used for training purposes. It is difficult to assign accurate pseudo-label to some of these unlabeled instances [15, 16]. If unlabeled instances carrying incorrect pseudo-labels continue to be added to the training set, the predictive performance of the utilized learner can be seriously disturbed [17]. The situation described above is comparable to the introduction of noisy data to the training set, which is beyond our control. Second, weaker learners struggle to assign accurate pseudo-labels to unlabeled data during the initial stages of training, resulting in a failure to satisfy the expected confidence level [18]. These unsuitable instances are also added to the training set, which leads to the accumulation of errors until the end of the training process.

In this paper, we present semi-supervised regression via embedding space mapping and pseudo-label smearing (S2RMS). This approach combines co-training with a deep metric neural network to handle the above issues. To address the first problem, we use a metric network to learn a decision function that can effectively utilize unlabeled data. To address the second problem, we employ a pseudo-label smearing technique to train robust regressors. Figure 1 shows the framework of S2RMS.

Figure 1 shows that S2MRS selects multiple regressors to build a co-training model. In this study, a particular set of regressors is chosen to test the validity of S2RMS. To better cope with different environments, S2RMS uses random forests [19] possessing varying numbers of evaluators to form the base regressors. Each regressor also has different parameters to enhance the variability between different models.

Under the assumption of smoothing [1], instances located in high-density areas that are close to each other typically have similar labels, while those that are further apart are more likely to be distinct. Based on this hypothesis, a regression model trained with a small amount of labeled data will have a higher prediction accuracy when confronted with unlabeled data that behave similarly to the labeled data.

First, a pairwise dataset is generated for all anchor instances in the labeled dataset, and this dataset comprises anchor-positive and anchor-negative instance pairs. Subsequently, a deep metric Triplet neural network is trained using this pairwise dataset to learn embedding mappings. Similar instances are located close to each other in this embedding space. The Triplet neural network is used to map labeled and unlabeled data into an embedding space. Second, k subsets are created by randomly sampling the transformed labeled data. Each subset is divided into two parts, with one part used for training the regressors and the other used to calculate the confidence level. The regressors are initialized with different parameters to enhance the diversity and ensure the efficacy of S2RMS. In each iteration, the Triple neural network functions as an instance selector, choosing two similar unlabeled instances for each labeled instance. These similar instances are then added to the pool. The other regressors use pseudo-label smearing to select stable instances from the pool for another iteration. Afterward, these stable instances are assigned pseudo-labels with confidence. Finally, all k training subsets are augmented with these stable instances. All regressors are trained again with the updated data before moving on to the next iteration. After completing all training processes, the unknown data are predicted through averaging.

The effectiveness of the S2RMS algorithm is evaluated through experiments conducted on 14 benchmark datasets. In addition, these datasets are normalized to better visualize the experimental results. We compare the S2RMS algorithm with three excellent semi-supervised regression algorithms. These algorithms also use co-training or metric networks. The results indicate that the S2RMS algorithm outperforms these algorithms.

The remainder of this paper is organized as follows. In Section 2, popular metric learning and semi-supervised learning approaches are reviewed. Section 3 presents the proposed S2RMS algorithm. The spatial mapping method and the smearing technique are described in this section. Section 4 provides the experimental results and analysis. In Section 5, we conclude the paper and outline further work directions.

2 Related works

This section reviews two fundamental techniques that are used in S2RMS: deep metric learning and co-training.

2.1 Deep metric learning

Metric learning aims to measure the similarity between instances using an optimal distance metric for performing the learning task. The main objective is to learn a new metric that reduces the distances between instances belonging to the same class and increases the distances between instances belonging to different classes [20]. Traditional metric learning methods typically use a linear projection strategy, which may not be suitable for real-world problems with nonlinear structures. In deep metric learning, this problem is solved by activation functions with nonlinear structures [21].

The development of deep metric learning is primarily reflected in the available loss functions. The two most basic loss functions in deep metric learning are Contrastive Loss [22] and Triplet Loss [23]. Many variants have been derived from these two functions, such as the Quadruple Loss [24], ArcFace Loss [25], and Ranked List Loss [26]. Furthermore, deep metric learning has been primarily applied in three areas: 1) image retrieval, 2) person re-identification (ReID), and 3) face recognition. In the field of image retrieval, some studies [27, 28] have examined tasks such as visual sentiment analysis and massive image retrieval by designing different similarity losses. Constrained by the limitations of manual annotation, TSSML [29] improves the hard instance mining process through curriculum learning for semi-supervised ReID tasks. In the facial recognition task, DDFMs [30] use a novel deep metric network approach based on the triplet loss for facial recognition/verification and visual image classification purposes.

2.2 Co-training

Co-training is a divergence-based approach for semi-supervised learning that is designed for multi-view data [3]. This method assumes that data can be effectively represented through various perspectives and that distinct learners can be trained on each view. These learners then label unlabeled instances and incorporate their most confident unlabeled instances into the training sets of other learners. Following each iteration, the learners undergo retraining with the augmented training set.

In the field of co-training, research has focused primarily on three areas: 1) classification, 2) regression, and 3) image segmentation and recognition. In classification tasks, Tri-training [31] utilizes three classifiers with a single view and assigns pseudo-labels to a classifier only if the other two classifiers agree on the label of the example. SIVLC [32] addresses the multi-view challenges encountered in co-training by utilizing sufficient irrelevant views and label consistency. Deep Co-Training [33] successfully constructs deep collaborative classification networks on two views by imposing view difference constraints. Tri-net [34] enhances disparities by generating three other distinct classification models from a shared model with a single view.

In regression tasks, COREG [35] uses two K-nearest neighbor (K-NN) [36] models with different distance measures and K values for regression analysis. CoBCReg [37] uses three RBFNN [38] regressors based on COREG to perform confident instance selection through a committee mechanism. In addition, by integrating a safer and self-paced learning process [39], the collaborative regression model can learn security pseudo-labels with higher quality, resulting in improved model performance. In real-world scenarios, E-CoGRF [40] combines co-training and a random forest model with information entropy grouping to assess the severity levels of depressive symptoms.

In image processing tasks, particularly in the field of medical images, Co-DeepSVS [41] applies support vector regression to multi-view deep co-training networks for survival time estimation. UCMT [42] uses the mean teacher model as a base co-learner while incorporating uncertainty-guided region mixing for medical image segmentation. Some studies [43] have improved the semantic segmentation performance achieved for medical images by fusing collaborative training with adversarial training. In addition, in the field of remote sensing images, RSCoTr [44] combines multi-task learning and co-training to construct a transformer that can simultaneously handle three tasks: classification, segmentation and detection.

In addition to the previously mentioned research areas, co-training is also commonly employed in subfields such as multi-label learning [45], time series learning [46], and class-imbalanced learning [47]. Furthermore, the amalgamation of co-training with other semi-supervised techniques can significantly enhance the performance achieved in various tasks [48, 49].

3 The proposed algorithm

In this section, we present the implementation details of the S2RMS algorithm. The core techniques of the S2RMS model include embedding space mapping, a similarity-based instance selection strategy and pseudo-label smearing. Finally, the process details of the S2RMS algorithm are given. Table 1 details the notations used in the paper.

Table 1 Notations

3.1 The model

Let \(\textbf{L} = \{(x_i, y_i)\}_{i=1}^{n_l}\) denote the labeled dataset, and \(\textbf{U} = \{x_j\}_{j=n_l+1}^{n_l+n_u}\) the unlabeled dataset, where \(n_l\) and \(n_u\) represent the numbers of labeled and unlabeled instances, respectively. Let \(\textbf{H}=\{h_j\}_{j=1}^{k}\) denote a set of k regressors and \(\mathcal {L}\) be the loss function using mean square error (MSE). A general approach using co-training is to train k learners using the original labeled data \(\textbf{L}\). Then, the unlabeled data \(\textbf{U}\) are selected to facilitate mutual learning among the learners and improve the performance of the algorithm. In our algorithm, \(\textbf{L}\) and \(\textbf{U}\) are mapped to \(\textbf{L}'\) and \(\textbf{U}'\), respectively, by \(f^d\) to better select unlabeled data for training. \(f^d\) represents a metric-based neural network that estimates similarity and maps data to the embedding space, with d denoting the dimensionality of the embedding space. Furthermore, the label smearing technique is proposed as a means of obtaining stable instances. The regressors then passe through these stable instances to smooth the predicted output. Then, the objective function can be expressed as follows:

$$\begin{aligned} {\begin{matrix} \min _{\textbf{H}}E&=\sum _{j \in \{1,\dots , k\}} \left( \sum _{x_l \in \textbf{L}'_j}\mathcal {L}(h_j(x_l), y_l)+\sum _{x_s \in \Omega _j}\mathcal {L}(h_j(x_s), y_s) \right) , \end{matrix}} \end{aligned}$$
(1)

where \(\textbf{L}'_j\) is the j-th labeled subset and \(\Omega _j\) is the j-th set of stable instances with high confidence. To optimize the training process, a pool of unlabeled instances is constructed using a similarity metric strategy. Each optimization iteration applied to the regressors is performed using the data in this pool.

Fig. 2
figure 2

The unlabeled instances selection process. For each embedded labeled instance \(x_i \in \textbf{L}'\), the Triplet neural network selects the most similar instances from the embedded set \(\textbf{U}'\) according to the threshold \(\tau \). The threshold \(\tau \) decreases with the number of iterations

3.2 Embedding space mapping

To address the first problem of co-training, an effective approach must employ an improved metric for instance selection. The typical approach involves randomly selecting a specific number of unlabeled instances and assigning them pseudo-labels. This approach overlooks the distribution of the given data, potentially resulting in the selection of noisy data that can contaminate the training set. Considering the smoothing assumption, the selected unlabeled data should conform to the same distribution as that of the labeled data. Next, a metric method is employed to find unlabeled data that are close to the labeled data. One potential approach is to use the K-NN algorithm, which enables the direct calculation of such metrics. However, constrained by the manifold assumption, we use \(f^d\) to replace the simple linear metric methods. For \(f^d\), we utilize a triplet neural network. This network consists of three identical subnetworks that receive an input pair and extract instance features. These features are then connected to form features that are associated with the input pair. The final layer of the network estimates the similarity with the input pairs.

A pairwise dataset \(\{(x_i, \textbf{P}_{x_i}, \textbf{N}_{x_i})\}_{i=1}^{n_l}\) is generated from \(\textbf{L}\) to train the Triplet neural network. Specifically, three objects derived from the pairs are the anchor \(x_i\), positive instance set \(\textbf{P}_{x_i}\) and negative instance set \(\textbf{N}_{x_i}\). A positive instance has high similarity to the anchor, and a negative instance has the opposite relationship. For the anchor \(x_a \in \textbf{L}\), the positive instance set \(\textbf{P}_{x_a}\) and the negative instance set \(\textbf{N}_{x_a}\) are constructed by \(d_{ij}\). The initial similarity \(d_{ij}\) is calculated between instances \(x_i\) and \(x_j\), while \(i \ne j\) is calculated by the Euclidean distance measures. The numbers of positive and negative instances are set to 2.

After generating the pairwise dataset for all anchor instances, the Triplet neural network exploits \(n_l(n_l-1)\) pairs of instances. Then, the Ranked List Loss function \(\mathcal {L}_\textrm{RLL}\) [26] is employed as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_\textrm{P}(x_a, \textbf{P}_{x_a})&= \sum _{x_j \in \textbf{P}_{x_a}}\frac{\exp (\varepsilon (d_{aj} - (\alpha -M)))}{\sum _{x_j \in \textbf{P}_{x_a}}\exp (\varepsilon (d_{aj} - (\alpha -M)))}(d_{aj} - (\alpha -M)), \\ \mathcal {L}_\textrm{N}(x_a, \textbf{N}_{x_a})&= \sum _{x_j \in \textbf{N}_{x_a}}\frac{\exp (\varepsilon (\alpha -d_{aj}))}{\sum _{x_j \in \textbf{N}_{x_a}}\exp (\varepsilon (\alpha -d_{aj}))}(\alpha -d_{aj}), \\ \mathcal {L}_\textrm{RLL}&= (1-\gamma )\mathcal {L}_\textrm{P}(x_a, \textbf{P}_{x_a}) + \gamma \mathcal {L}_\textrm{N}(x_a, \textbf{N}_{x_a}), \end{aligned} \end{aligned}$$
(2)

where \(\alpha \) is the boundary parameter, \(\varepsilon \) is the temperature parameter, and M is the margin value, \(\mathcal {L}_\textrm{P}(x_a, \textbf{P}_{x_a})\) and \(\mathcal {L}_\textrm{N}(x_a, \textbf{N}_{x_a})\) represent the positive and negative instance losses, respectively. \(\gamma \) is the weight that controls the percentage of losses. In our experiments, we treat both targets equally and set \(\gamma \) to 0.5. In the feature space of the Triplet neural network, \(\mathcal {L}_\textrm{RLL}\) brings positive instances closer to each other than the negatives are by a margin of M.

The output of the Triplet neural network depends on the form of the input data. In the network, the last layer is designated as the similarity output layer, while the output of the penultimate hidden layer is employed as the embedding space. In the case of a single input, the network generates the deep embedding space of the input data. For pairs of inputs, the network outputs the similarity between pairs of inputs. In particular, when labeled or unlabeled data is employed as inputs, the network returns the data after the metric mapping. Furthermore, for pairs of inputs comprising labeled and unlabeled data, the network outputs the similarity between pairs of data. Consequently, the Triple neural network is capable of executing metric space mapping and similarity measuring operations.

Once the training of the Triple neural network has been completed, \(\textbf{L}\) and \(\textbf{U}\) are mapped to \(\textbf{L}'\) and \(\textbf{U}'\) through the network. For each instance in \(\textbf{L}'\), the similarity between it and any instance in \(\textbf{U}'\) is computed by the network. This similarity is then compared to a threshold value, \(\tau \), and the two highest unlabeled instances with similarity below this threshold are selected. The similarity threshold \(\tau \) is used to control the bounds of the similar instance selection procedure. These instances are used to construct an unlabeled data pool. In each iteration, the Triplet neural network selects a batch consisting of new data to update the pool according to \(\tau \). The threshold \(\tau \) decreases with the number of iterations as follows:

$$\begin{aligned} \tau _{t+1} = \lambda \tau _{t}, \end{aligned}$$
(3)

where t is the number of iterations and \(\lambda \) is the attenuation coefficient. The selected data are then removed from \(\textbf{U}'\). Figure 2 shows the unlabeled instances selection process.

3.3 Co-training with pseudo-label smearing

To address the second problem encountered during co-training, an effective training framework is proposed. k pairs of subsets \((\textbf{L}_i', \textbf{C}_i'), i \in \{1,\dots ,k\}\) are randomly and repeatedly selected from \(\textbf{L}'\). \(\textbf{L}_i'\) is used to initialize regressor \(h_i\), and \(\textbf{C}_i'\) is used as part of the confidence calculation. Specifically, \(\textbf{C}_i'\) is not included in the training data.

In each iteration, other regressors assign pseudo-labels to all instance data in the unlabeled data pool. Since the pseudo-labels assigned to the unlabeled data may be inaccurate, adding them to the training set of the current regressor will degrade the performance of this regressor. Utilizing these erroneous pseudo-labels is equivalent to introducing noise to the training set. To address this issue, we employ a smearing technique to assign stable pseudo-labels to the unlabeled data. The purpose of label smearing is to select stable instances for improving the robustness of the regressors by smoothing out the differences between different collaborative regressors. At its core, label smearing calculates the variance of the predicted values of the unlabeled data, treats the unlabeled data with smaller variances as stable instances, and smooths out the differences between the predictions of the different regressors by averaging them.

For regressor \(h_i\), the other regressors \(h_{j,j \ne i}\) compute the predicted values for all instances in the unlabeled data pool. Then, the stable instances are filtered from these instances through a stability assessment. Stability is expressed as the variance of the values predicted by these regressors for the instances, as follows:

$$\begin{aligned} STD_i(x_u) = \sqrt{\frac{1}{k-1}\sum _{j=1,j \ne i}^{k}(h_j(x_u)-\frac{1}{k-1}\sum _{j=1,j \ne i}^{k}h_j(x_u))^2}, \end{aligned}$$
(4)

where \(x_u\) is an instance in the unlabeled data pool. In the context of the S2RMS algorithm, the smearing strategy involves calculating the stability of the unlabeled instances based on the variance of the predictions of the regressors. The instances with less variance are then selected as stable instances, and their predicted values are replaced with the mean prediction value, which serves as their corresponding pseudo-labels. Once stable instances are obtained and pseudo-labeling is completed, confidence calculation can be performed with \(h_i\) as follows:

$$\begin{aligned} \Delta _{x_u} = \frac{\delta -\delta '}{\delta +\delta '}, \end{aligned}$$
(5)

where \(\delta \) and \(\delta '\) represent the validation errors induced on \(\textbf{C}_i'\) prior to and following the update of \(\left( x_u, \frac{1}{k-1}\sum _{j=1,j \ne i}^{k}h_j\right. \) \((x_u)\Big )\), respectively. In particular, the validation error in the algorithm is defined as the root mean square (RMSE). Instances with high confidence expand the training set of \(h_i\), and the corresponding regressor is retrained.

3.4 Algorithm description

Algorithm 1
figure a

S2RMS algorithm.

Algorithm 1 shows the complete training process of S2RMS. The inputs of S2RMS include the labeled dataset \(\textbf{L} = \{(x_i, y_i)\}_{i=1}^{n_l}\), the unlabeled dataset \(\textbf{U} = \{x_j\}_{j=n_l+1}^{n_l+n_u}\), the initial value of \(\tau \), the maximum number of iterations T, the value of \(\mu \), the number of regressors k, and the number of confident instances R.

A Triplet neural network \(f^d\) is trained to generate pairwise data from \(\textbf{L}\) according to (2). In line 3, the labeled dataset \(\textbf{L}\) and unlabeled dataset \(\textbf{U}\) are mapped by \(f^d\) to \(\textbf{L}'\) and \(\textbf{U}'\), respectively. The transformed data \(\textbf{L}'\) are then randomly split into 2 subsets \(\textbf{L}_i'\) and \(\textbf{C}_i'\), as shown in line 5. k different subsets \(\textbf{L}_i',i \in \{1,\dots ,k\}\) are obtained after repeating the random selection procedure k times. These subsets are used to initialize the k regressors \(\{h_j\}_{j=1}^k\).

The next step is to train these regressors. At the beginning of each iteration, in lines 7 to 10, an unlabeled data pool \(\textbf{P}\) is created. This pool is generated by selecting unlabeled data using the metric method and the value of the threshold \(\tau \) with the help of \(f^d\). The threshold \(\tau \) is then decreased according to (3). For each regressor \(h_i\), the other regressors \(h_{j,j \ne i}\) select the instances with the highest confidence value. First, in line 14, the validation error \(\delta _i\) of \(h_i\) on \(\textbf{C}_i'\) is calculated. Next, for each \(x_u\) in \(\textbf{P}\), its stability value is calculated according to (4). In lines 17 to 20, the instances \(x_u\) that satisfy the stability threshold \(\mu \) receive pseudo-labels and are temporarily merged with the original training set \(\textbf{L}_i'\) to train the regressor \(h_i'\). Then, in line 21, the validation error \(\delta _i\) of \(h_i'\) on \(\textbf{C}_i'\) is calculated. In line 22, the confidence \(\Delta _{x_u}\) of \(x_u\) is calculated according to (5). As shown in line 24, the instances \(x_u\) with the highest confidence values are selected and added to \(\Omega _i\). Eventually, at most R instances are contained in \(\Omega _i\). In line 29, the instances of \(\Omega _i\) are integrated into the training set of \(h_i\). Regressor \(h_i\) is subsequently retrained on the updated training set.

A set of regression models \(\textbf{H} = \{h_j\}_{j=1}^k\) is obtained for data prediction purposes at the end of the S2RMS training process. The predicted label of \(x_u\) is then given by

$$\begin{aligned} \hat{y}_{x_u} = \frac{1}{k}\sum _{j=1}^{k} h_j(x_u). \end{aligned}$$
(6)

S2RMS controls the differences among the predictions of learners by using label smearing during the training process. This ensures that the prediction results of the learners are stable without the need for weighted averaging.

3.5 Complexity analysis

The co-training process is where the complexity of the entire algorithm lies. This stage involves processing each \(x_u\) contained in the unlabeled data pool individually. In the S2RMS algorithm, we use random forests as the underlying co-regressors. These random forest regressors are executed in parallel during the iterative process. Therefore, we multiply the dominant term of the corresponding time complexity by the number of iterations to form the final complexity representation.

The random forest model typically employs a classification and regression tree (CART) as its base model. The time complexity of constructing a single CART is denoted by \(O(I_\textrm{ins} \times A_\textrm{attr} \times \log I_\textrm{ins})\), where \(I_\textrm{ins}\) denotes the number of instances and \(A_{attr}\) denotes the number of attributes. Therefore, the time complexity of constructing a random forest model is \(O(T_\textrm{trees} \times I_\textrm{ins} \times A_\textrm{attr} \times \log I_\textrm{ins})\), where \(T_\textrm{trees}\) represents the number of trees. As the computation process involves random feature selection, additional time is required to handle this process. Consequently, the average time complexity of constructing a random forest model is \(O(T_\textrm{trees} \times I_\textrm{ins} \times A_\textrm{attr} \times \log I_\textrm{ins})\).

Co-training involves calculating confidence values and selecting instances based on these confidence levels. During the confidence calculation, additional random forest regressors are constructed for each stable instance \(x_u\) in the unlabeled data pool \(\textbf{P}\). When selecting confident instances, the time cost is only O(R), where R is the number of confident instances. This cost can be ignored compared to the cost of the confidence calculation. Therefore, the total time complexity is

$$O(k\times N\times S\times T_\textrm{trees} \times I_\textrm{ins} \times A_\textrm{attr} \times \log I_\textrm{ins}),$$

where N is the maximum number of iterations and S is the number of stable instances. In the worst-case, all unlabeled instances are stable instances, hence the total time complexity is

$$O(k \times N \times |\textbf{P}|\times T_\textrm{trees} \times I_\textrm{ins} \times A_\textrm{attr} \times \log I_\textrm{ins}),$$

where \(|\textbf{P}|\) is the cardinality of \(\textbf{P}\) and \(|\textbf{P}| \ge S\).

4 Experiments

To validate the effectiveness of our approach, we employ S2RMS to conduct a diverse range of regression tasks involving inductive semi-supervised learning. First, we test it on a total of fourteen datasets obtained from machine learning repositories (e.g., UCI, Delve, and Statlib). We calculate the RMSE values of S2RMS and the comparison methods. The effectiveness of similarity-based instance selection strategy and pseudo-label smearing techniques are evaluated through ablation experiments. All the experiments are performed on PyTorch [50].

4.1 Experimental settings

The detailed configuration of the experimental setup is as follows.

Table 2 Datasets information

Datasets

Table 2 lists the fourteen real-world datasets used in our experiments. For each dataset, we randomly select 2000 instances as the training set, while the remaining instances are used for testing. The training set is divided into labeled and unlabeled portions based on a specific ratio. In our experiments, \(0.5\%\), \(2.5\%\) and \(5\%\) of the training set are used as labeled data in the separate scenarios. In most cases (except Section 4.3.4), \(20\%\) of the training set is used for validation. As an illustration, let us consider the Abalone dataset. We randomly select 2000 instances as the training set, while the remaining 2177 instances are used for testing. With a label rate of \(2.5\%\), we include 50 instances in the labeled dataset, whereas the remaining 1950 instances are included in the unlabeled dataset. Among them, 10 instances serve as the validation set.

Parameters

In this study, we choose a neural network with three hidden layers as Triplet metric model. In addition, we select specific regressors and parameters to suit our purposes. The three networks in the Triplet neural network share the same architecture, and each hidden layer uses the LeakyReLU [51] function to perform nonlinear transformations. The numbers of neurons in the three hidden layers are 16, 32 and 64. In particular, the dimensionality of the input layer is the number of features contained in the input data, while the output of the second hidden layer is taken as the embedding space. For the parameter settings of the Ranked List Loss function, \(\alpha \), \(\epsilon \) and M are set to 40, \(-0.5\) and 35, respectively.

To increase the diversity among the learners, three random forest regressors are initialized with different parameters. Specifically, the three regressors are based on 100 estimators, 120 estimators and 150 estimators. The minimum instance splits are set to 1, 2 and 3. In the experiments, the initial similarity threshold \(\tau \) and the number of confident instances R are set to 0.017 and 2, respectively. The attenuation coefficient \(\lambda \) is set to 0.98. Instances with \(\mu \) less than 0.02 are used for the selection of stable instances. The number of total experiments is set to 30.

Metrics

The RMSE and \(R^2\) metrics are chosen to evaluate the effectiveness of the tested algorithms. The RMSE is the root mean square error, which measures the square root of the average of the squared differences between the predicted and ground-truth values divided by the number of observations \(n_{\textrm{test}}\). This metric calculates the difference between the predicted and ground-truth values and is particularly impacted by extreme values in the utilized dataset. The RMSE is calculated as follows:

$$\begin{aligned} \textrm{RMSE} = \sqrt{\frac{1}{n_{\textrm{test}}}\sum _{i=1}^{n_{\textrm{test}}}(y_i - \hat{y_i})^2}, \end{aligned}$$
(7)

where \(y_i\) represents the ground-truth value, \(\hat{y_i}\) represents the predicted value and \(n_{\textrm{test}}\) represents the number of instances.

The coefficient of determination \(R^2\) is an indicator used to measure the degree of model fitness. It indicates the proportion of dependent variable variance that the model can explain. \(R^2\) is calculated as follows:

$$\begin{aligned} R^2=1-\frac{\sum _{i=1}^{n_{\textrm{test}}}\left( y_{i}-\hat{y}_{i}\right) ^{2} / n_{\textrm{test}}}{\sum _{i=1}^{n_{\textrm{test}}}\left( y_{i}-\bar{y}_{i}\right) ^{2} / n_{\textrm{test}}}, \end{aligned}$$
(8)

where \(\bar{y}_i\) represents the predicted average value. \(R^2\) measures the extent to which the model explains the total variance, with a normal range of values between 0 and 1. A value closer to 1 indicates a better level of fit, while a value closer to 0 indicates a worse level of fit. A value less than 0 indicates that the model does not fit well and cannot make predictions.

4.2 Comparison with semi-supervised methods

S2RMS is compared with the following methods to further evaluate its performance.

  • COREG: COREG uses a fixed K value and different distance metric algorithms to obtain the best experimental results. Two 3-NN regressors based on the Euclidean and Mahalanobis distance metrics are utilized. A new method using the mean squared error (MSE) is employed to calculate the confidence values.

  • SAFER [17]: The SAFER semi-supervised regression method transforms the task of interest into a geometric projection problem. This approach utilizes multiple regressors to learn security predictions. The problem that unlabeled data may degrade the performance of the utilized algorithm is effectively circumvented.

  • DML-S2R [52]: DML-S2R aims to learn the metric space of similar instances by efficiently using both unlabeled and labeled data. This method consists of two main steps: 1) pairwise similarity modeling using labeled data and 2) metric learning based on a Triplet network with a large amount of unlabeled data.

Table 3 presents the comparison results produced by S2RMS and other algorithms under different label rates.

Table 3 The values of RMSE produced by the comparison algorithms under different label rates

The best performance entries in Table 3 are identified with bullets. First, S2RMS is compared with COREG and SAFER, which are based on divergence. At all label rates, S2RMS achieves the lowest average RMSE, while COREG and SAFER exhibit significantly greater average RMSEs than does S2RMS. S2RMS outperforms both COREG and SAFER on average on more than 10 datasets under different label rates. With a 2.5% label rate, S2RMS outperforms COREG on 12 datasets and SAFER on 11 datasets. It is evident that S2RMS outperforms both COREG and SAFER by significant margins.

Second, S2RMS is compared with DML-S2R using the metric method. Table 3 shows that S2RMS outperforms DML-S2R on all datasets under all label rates. The average performance improvement is greater than 13%. With a lower label rate, S2RMS is more effective. The above results show that S2RMS outperforms DML-S2R. In addition, S2RMS performs better in cases with extremely small rates.

Regardless of whether the label rate is 2.5% or 5%, S2RMS has poor performance on the Folds5x2_pp and Bank32NH datasets. The reasons for this result are as follows. The first reason is the inaccurate embedding space mapping process. Due to the number of features and the structural characteristics of the datasets, a fixed embedding space dimensionality may lead to data misalignment. This issue affects the training of the learner. The second reason is the special distribution of the embedding space. Because of the first reason, algorithms such as K-NN perform well in this space, but the performance of random forests decreases significantly. In contrast to COREG and SAFER, S2RMS employs a decision tree for regression purposes. As a result, its use of random forests leads to a decrease in performance compared to that of the other two methods using K-NN.

As S2RMS and COREG share co-training similarities, we adapt the co-learner of S2RMS to adopt the same K-NN regressor as that of COREG. This enables a clearer and more precise comparison between the performance results of the algorithms, as they all use the same regression model. Figure 3 shows the results of the experimental comparisons.

Fig. 3
figure 3

RMSE comparisons between S2RMS-KNN and COREG conducted on fourteen datasets under different label rates. S2RMS-KNN denotes the S2RMS algorithm constructed after applying K-NN as the base regressor

The experimental results indicate that when the base regressor is replaced with the K-NN algorithm, the prediction performance of S2RMS tends to decrease on some datasets relative to that attained with random forests. However, S2RMS outperforms COREG on most datasets. On the other hand, S2RMS falls behind COREG in all aspects on the Folds5x2_pp and Puma8NH datasets. The primary reason for this outcome is that the Euclidean distance measure, which is used by K-NN, is not suitable for the metric embedding spaces of these two datasets. Additionally, higher RMSEs are observed for the Cal_housing, Space_ga, and Wine_quality datasets, where the labeled data rate is set to 5%. S2RMS has an overall advantage over COREG. The experimental results indicate that the performance of the proposed algorithm is affected by the differences among the base regressors. The random forest-based co-learner outperforms K-NN under the influence of the metric mapping strategy proposed in this paper.

4.3 Ablation study

4.3.1 Similarity-based instance selection strategy

First, the regression model is initialized by labeled data. According to the smoothing assumption in semi-supervised learning, two instances that are close in terms of distance in high-density regions should have similar outputs. Therefore, for an unlabeled instance that is close to a labeled instance, the regression model can predict its target value more accurately. Thus, we construct an unlabeled data pool by selecting instances that are similar or close to the labeled data from the unlabeled data. This unlabeled data pool not only accelerates the iteration speed of the algorithm but also provides available unlabeled data for the subsequent co-training process. In conclusion, we choose a deep metric network to simultaneously complete metric space mapping and similarity-based instance selection. In the metric space, similar instances are close to each other, while instances with large differences are far from each other. Based on this characteristic of the metric space, the selection of similar instances can be better realized, and the effectiveness of the unlabeled data pool is also guaranteed.

We compile a list of the performance improvements (in percentages) resulting from the similarity-based instance selection strategy. This improvement is calculated by subtracting the RMSE of the similarity-based instance selection strategy from the RMSE obtained without using the similarity-based instance selection strategy. Then, the result is divided by the RMSE obtained without similarity-based instance selection. Table 4 shows that the performances achieved by the model on most datasets improve.

Table 4 The RMSE values of the two methods were compared, one employing similarity-based instance selection and the other random selection

In the absence of a similarity-based instance selection strategy, an unlabeled data pool is constructed in accordance with the pool construction strategy outlined in COREG. In particular, the size of the pool is fixed and instances are randomly selected from the unlabeled dataset to be added to the pool until it is filled. After one round of iteration, the unlabeled data pool is replenished by randomly selecting again. In this way, the construction process of the unlabeled data pool, as outlined in lines 7 to 11 of Algorithm 1, is replaced by the aforementioned method.

Experiments are conducted under a \(2.5\%\) label rate. Each dataset is used for 30 tests implemented under different label rates. All the data from the training process are mapped to the metric space. Table 4 presents the experimental RMSE results obtained before and after using similar selection. The pool size of the random selection process is set to 50. The results show that the similarity-based instance selection strategy always performs better.

The data pool of S2RMS is constructed by selecting unlabeled data that are similar to the labeled data. In the embedding space, similar instances are close to one another. More appropriate unlabeled data can be selected based on the distribution of the labeled data. According to the smoothing and manifold assumptions, these unlabeled data can be effectively used to expand the training set. Thus, the proposed method significantly improves upon the performance of random selection.

However, for the Folds5x2_pp dataset, the performance degrades. One possible reason for this finding is that this dataset has fewer features. After mapping the data to a higher-dimensional embedding space, some ambiguous features cannot correctly represent the data. Therefore, it is not possible to effectively select instances. This, in turn, affects the training process of the regressor.

4.3.2 Embedding space mapping

The COREG and SAFER comparison algorithms use regression and pseudo-labeling without embedding learning. In this case, we compare the S2RMS algorithm with the COREG and SAFER algorithms after applying the embedding space mapping strategy. Specifically, we additionally apply steps 1 to 3 of Algorithm 1 to the compared algorithms.

Table 5 The values of \(R^2\) produced by the comparison algorithms at a label rate of 2.5%

We then compare the performances achieved by the COREG and SAFER algorithms before and after adopting the mapping strategy. This approach aims to accurately evaluate the impact of embedding space mapping on the resulting algorithmic performance.

For an example with a 2.5% label rate, Table 5 shows the results obtained in the comparison experiments by the tested methods when using the mapping strategy. The experimental results indicate that S2RMS outperforms the other methods on the 11 datasets, even though COREG and SAFER also implement the mapping strategy.

Table 6 displays the results of comparison between the \(R^2\) values produced before and after applying embedding mapping in the COREG and SAFER algorithms.

Table 6 The values of \(R^2\) produced by COREG and SAFER at a label rate of 2.5%

Table 6 shows that the two algorithms produce divergent results after applying the mapping strategy. Compared with that of COREG, the performance of COREG-ESP is significantly improved on only 6 datasets and is basically equal on 3 datasets. However, performance degradations are observed on 5 datasets: Bank32nh, Folds5x2_pp, Pollen, Space_ga, and Wine_quality. As discussed in the previous subsections, the metrics used in COREG may not be suitable for metric embedding mappings, resulting in negative shifts on some datasets. In contrast, the embedding mapping strategy works relatively well in the SAFER algorithm. The performance of SAFER-ESP is effectively improved on 12 datasets, with only slight decreases on the Bank32nh and Pollen datasets. Overall, the metric mapping strategy can improve the performance of these algorithms on most datasets.

4.3.3 Pseudo-label smearing

In the pseudo-label calculation process, S2RMS uses the smearing technique to smooth out the predictions of the regressors. To evaluate the effectiveness of the proposed method, we compare the RMSE values obtained with and without pseudo-label smearing. Lines 16 to 22 of Algorithm 1 represent the process of label smearing. The core of this process is the calculation of stability, which is performed using (4) to filter the stable instances. Once the stable instances have been identified, the subsequent confidence assessment operation is performed. In the baseline experiment where label smearing is not employed, the step in line 17 of Algorithm 1 is ignored. Line 19 and subsequent operations are conducted directly without any form of judgement. The objective of this experiment is to assess the impact of stable instance filtering on the performance of the algorithm in the context of label smearing.

Figure 4 shows the results of the experimental comparisons.

Fig. 4
figure 4

Average RMSE values produced over thirty tests on fourteen datasets. S2RMS-Without-PLS denotes the S2RMS algorithm without the pseudo-label smearing (PLS) technique

Table 7 The values of \(R^2\) obtained on the validation set \(\textbf{C}_i'\) under sizes of 0, 30 and 50 at a label rate of 2.5%

Under the same configuration, the pseudo-label smearing technique can effectively improve the performance of the algorithm. At a label rate of 0.5%, the average RMSE is notably greater when employing the smearing technique than that attained when not using it. With the help of smearing, the RMSE decreases smoothly on most of the datasets as the label rate increases. Especially on the Space_ga dataset, the RMSE is more volatile without the smearing technique. This indicates that the smearing technique can effectively leverage stable instances to optimize the regression performance of the algorithm.

4.3.4 The influence of validation set

Considering the relatively small size of the labeled dataset, the validation set \(\textbf{C}_i'\) is also expected to be small. To investigate potential biases in the confidence calculation executed during the pseudo-label smearing process caused by the size of the validation set, we conduct comparative experiments with different \(\textbf{C}_i'\) sizes.

The experiment of this subsection aims to study the impact of the validation set size. Therefore, we add some additional labeled data to increase the validation set. In this way, the size can be 30 or 50. We also consider the extreme case where the size of the validation set is 0. In this case, it is impossible to perform a confidence calculation, and the algorithm degenerates into a self-training mode in its ordinary form. Consequently, lines 14, 21, 22, and 24 of Algorithm 1 are disabled, and the strategy that replaces them is to randomly select R instances from \(\Pi _i\).

Table 7 shows the experimental results obtained on the validation set \(\textbf{C}_i'\) under different size settings. The performance of the algorithm gradually improves as the size of the validation set increases. A larger validation set leads to more accurate calculation results for (5). Inevitably, the labeled dataset size is positively correlated with the size of the validation set. With a small labeled dataset, a small validation set can cause bias in the confidence computation. Nevertheless, the algorithm also performs better with a small validation set than it does without it.

4.3.5 Parameter sensitivity

The parameters \(\tau \) and \(\mu \) are key parameters that affect the performance of the S2RMS algorithm. One of the parameters, \(\tau \), is the attenuation coefficient, which is applied to the process of creating a pool of unlabeled instances. The value of \(\tau \) reflects the size of the instance pool. The number of instances in the pool directly affects the performance of the co-regressors in subsequent iterations. The use of too few unlabeled instances is meaningless for the learner. An instance pool that is excessively large makes the learning process less efficient. To obtain a suitable value for \(\tau \), we observe the performance achieved by the algorithm with different parameter settings. When testing parameter \(\tau \), it is important to note that we only focus on the process related to parameter \(\tau \) to clearly observe the experimental results. The co-training process removes the label smearing strategy and eliminates the experimental bias caused by the stable instances controlled by parameter \(\mu \).

Table 8 The values of \(R^2\) produced under various settings of parameter \(\tau \) at a label rate of 2.5%
Table 9 The values of \(R^2\) produced with various settings of \(\mu \) at a label rate of 2.5%

Table 8 shows the experimental results produced by S2RMS with different \(\tau \) values. A higher \(\tau \) value implies that the conditions for the similarity threshold are relaxed to a greater extent. When \(\tau \) is set to 0.008, this lower value prevents the selection of more unlabeled instances that do not satisfy the imposed similarity requirement. This results in a final unlabeled data pool that is too small or even empty after just a few iterations due to the decaying value of \(\tau \). The performance of the subsequent regressors remains relatively low due to the insufficient availability of unlabeled data.

As the value of \(\tau \) increases to 0.017, more unlabeled instances are able to satisfy the larger threshold. Additionally, sufficient training rounds and data are available for improving the performance of the regressor until the threshold \(\tau \) decays to the point where the instance pool can no longer be constructed. As the value of \(\tau \) increases again, some unlabeled data that do not satisfy the imposed conditions are selected, affecting the performance of the algorithm. The experimental results indicate that a parameter \(\tau \) setting of approximately 0.017 is appropriate.

The parameter \(\mu \) denotes the stability coefficient, which is used to screen for stable instances that satisfy the preset conditions. The stability of a instance is indicated by the variance of the values predicted by multiple regressors. For each instance in the pool of unlabeled instances, when its stability value is lower than the parameter \(\mu \), it is regarded as a stable instance and marked with a pseudo-label. In summary, parameter \(\mu \) controls the prediction variance among the co-learners by setting a stability threshold to ensure the robustness of the algorithm. Similar to when testing parameter \(\tau \), we focus solely on the process related to parameter \(\mu \) and exclude the similarity-based instance selection process that is controlled by parameter \(\tau \). The pool of unlabeled instances is then constructed by randomly selecting instances with a fixed size.

Table 9 shows the experimental results yielded by S2RMS under different \(\mu \) values. The parameter \(\mu \) indicates the degree of the predictive differences among multiple regressors. When \(\mu \) is set to 0.01, only unlabeled instances with prediction variances of 0.009 or less are considered stable instances. Due to variations in the regression model settings, producing identical outputs for unlabeled data is essentially impossible. Lower parameter settings may not provide sufficient unlabeled data for sufficiently augmenting the regressors. Conversely, higher settings, such as 0.04, may result in the selection of the vast majority of the instances in the instance pool, which would interfere with the performance of the regressor due to the presence of widely varying labels. The experiments show that setting the unlabeled instance rate to 0.02 provides enough, but not all, instances to satisfy the stability constraint. These unlabeled instances can effectively enhance the training set of the corresponding regressor by labeling them with pseudo-labels.

5 Conclusions and further works

This paper proposes a semi-supervised regression approach via embedding space mapping and pseudo-label smearing. The method involves constructing a deep metric network based on pairwise data, mapping the raw training data to the metric embedding space through this network, and then training co-regressors on the transformed mapping data. The parameters of the regression model are iteratively updated using the label smearing technique. Tests conducted on 14 benchmark datasets show that the proposed algorithm outperforms 3 other excellent semi-supervised regression methods. An ablation study indicates that both the metric approach and the label smearing technique are effective for improving the performance of the algorithm.

Our future research will focus on designing more suitable mapping strategies and developing new confidence calculation methods. It is important to consider both the original spatial distribution of the given training data and the feature representations during the metric mapping process. Additionally, we will investigate improved confidence evaluation methods. Inspired by the evaluation metrics used in supervised learning research, novel reliability estimation strategies will be applied to increase the accuracy of confidence calculation [53]. Since the current study focuses on single-target regression, we will explore the potential to refine the proposed methods for applications involving multi-target regression tasks in our future research.