Abstract
Label distribution learning, where deal with label ambiguity by describing the degree of relevance of each label to a specific instance. As a novel machine learning paradigm, the curse of dimensionality is one of the prominent problems. Feature selection is a vital preprocessing step to reduce the high dimensionality of data. However, most existing label distribution feature selection methods focus on selecting a feature subset that has relevant capabilities for all labels, ignoring label-specific features with the maximum discriminatory power for each label. To tackle this issue, a label distribution feature selection algorithm based on label-specific features is proposed in this paper. Initially, we introduce a feature selection optimization model for label distribution data that simultaneously considers common and label-specific features, leveraging sparse learning to further investigate the intricate relationships between features and labels. Subsequently, the label correlation coefficient is employed to enhance the collaborative learning effect of labels. Finally, the relevance between features and labels is taken into account to guide the feature selection process, which can effectively eliminate the redundant features. Comprehensive experiments demonstrate the advantage of our proposed method over other well-established feature selection algorithms for selecting label-specific features to label distribution data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, single-label learning and multi-label learning are two representative paradigms, which have been widely used in solving label ambiguity problems. Specifically speaking, single-label learning assumes that each instance is associated with a single label, and labels are mutually exclusive. However, in the real-world application, an instance could simultaneously embody multiple semantic meanings [26, 49]. For instance, in facial emotional expressions, an expression may include a variety of complex emotions such as happiness, anger, and sadness. Similarly, in text annotation, news may be relevant to several topics such as politics, economics, and entertainment. In multi-label learning [2, 8], each instance is related to multiple labels, and each label is regarded to be equally important. Both single-label learning and multi-label learning basically regard the relationship between an instance and its label as binary, meaning that it determines whether the label is pertinent to the instance or not. In some situations, some tasks consider not only whether instances are associated with labels, but also the different importance degree of each label [24]. Consequently, [6] proposed a more generalized learning paradigm, called label distribution learning. Label distribution learning solves the problem of label ambiguity explicitly by depicting how relevant the label is to the instance.
Owing to the advantage of learning richer semantics from data compared to multi-label learning, label distribution learning has been widely used in various real-world applications. Applications of label distribution learning can be mainly divided into the following aspects: head posture recognition, scene classification, emotion analysis, and indoor crowd counting. Taking scene classification as an example, a scene topic is generally composed of various basic words, namely, ‘ road’ , ‘ water’ , and ‘ veg’ . Each of these basic words contributes in a unique way to the formation of the scene classification. However, as data from videos, images, and text continues to increase, the label distribution data in the above applications often grapples with the issue of high dimensionality. The emergence of high-dimensional label distribution data not only increases the computational cost and storage, but also reduces the quality of learning models. Feature selection, as an effective data preprocessing technology, which reduces the data dimension and improves algorithm performance. Until now, a variety of feature selection approaches have been studied to select relevant features from label distribution data [21, 23, 28, 32]. Qian et al. [29] proposed a specialized feature selection algorithm by taking into account both the feature similarity and label correlation, which can deal with high-dimensional label distribution data efficiently. [22] proposed a hierarchical feature selection method to deal with the Semantic gap, which uses label enhancement to obtain a label distribution. [33] developed multi-label feature selection algorithm for label distribution data, which utilizes label enhancement based on the deep forest to transform the logical label.These methods aim to learn a subset of features that are used for discriminating all labels. However, this strategy may not be optimal, as each label may require its own specific features for decision-making. Therefore, it is essential to consider label-specific features explicitly, thereby enhancing the performance of label distribution learning in the feature selection process.
Recently, the strategy of label-specific features has been integrated with several tasks in label distribution learning, such as exploiting label correlation [34] and completing missing labels [20]. Ren et al. [34] developed a novel label distribution learning algorithm by leveraging label-specific features, which simultaneously selected label-specific features and common features. [20] proposed a label distribution learning approach for incomplete data by exploiting label-specific feature selection, which can fill missing label information. In general, these methods select label-specific features regarding each label by a coefficient matrix W. For example, as shown in Fig. 1, let us consider 3 instances with a 6-dimensional feature space {\(f_{1},f_{2},\ldots ,f_{6}\)}, and their corresponding labels \(y_{1}\) and \(y_{2}\). Upon training, we obtain a coefficient matrix W, which comprises two vectors {\(w_{1},w_{2}\)} corresponding for two labels. The non-zero elements of {\(w_{1},w_{2}\)} allow us to identify the specific features associated with labels \(y_{1}\) and \(y_{2}\). It is worth noting that two correlated labels may have different label-specific features. Therefore, these methods employ label correlation to ensure that two strongly correlated labels yield similar outputs. Previous studies on label-specific features have considered the task of exploiting label correlation or completing missing labels, but they have rarely taken into account the feature relevance along with the generation of label-specific features.
In light of these observations, the strategy of label-specific features has been widely accepted in data classification. However, most models are dealing with multi-label data, which reduces some label information for label distribution data [4, 5, 9, 11]. Moreover, many existing label-specific features works ignore the relationship between features and labels [16, 18, 40]. To address this issue, a novel feature selection approach is proposed that exploits label-specific features for label distribution data in this paper. First, the optimization framework is formulated by sparse learning, which exploits specific features with \(l_{1}\)-norm regularization and common features with \(l_{2,1}\) regularization. Secondly, the Pearson correlation coefficient is applied to measure the correlation information among labels, which can assist the optimization process and enhance the generalization ability of the proposed framework. Furthermore, Mutual information variation is incorporated into the proposed optimization function to examine the relevance between features and labels, which can enhance the discriminatory ability of the feature selection model. Finally, a series of complete experiments are developed to verify the effectiveness of our proposed method.
The major contributions of this work are summarized in the following four points:
-
1.
A novel feature selection approach for label distribution data is proposed, which considers simultaneously common, and label-specific features. Moreover, the proposed method further improves the performance by exploiting label-specific features and sparse learning.
-
2.
Label correlation and feature relevance are incorporated in feature selection, simultaneously, thereby effectively enhancing the generalization ability of the learning model and reducing the redundancy of features.
-
3.
The proposed method can directly deal with label distribution data, without discretizing the label information. Thus, the performance of label distribution learning can be significantly improved by selecting discriminate features.
-
4.
Extensive empirical comparison of our method and other state-of-the-art feature selection algorithms are conducted on fifteen benchmark datasets using six evaluation metrics. The experiment result demonstrated the superiority of our method.
This paper is structured as follows. Section 2 introduces the work related to this paper. The proposed approach is detailed in Section 3. In Section 4, the experimental results obtained by the proposed approach are analyzed. In Section 5, the paper is concluded and future work is discussed.
2 Related work
2.1 Label distribution learning
Label Distribution Learning (LDL), which has demonstrated a superior ability to extract complex semantics from data compared to multi-label learning, has broad applications in areas such as gene selection and scene recognition [35, 41, 44]. Various label distribution learning approaches can be categorized into three types: problem transformation, algorithm adaptation, and specialized algorithms. Problem transformation is the simplest approach, where the LDL problem is reformulated into existing learning paradigms. [6] introduced two typical problem transformation algorithms, which treated each label as an independent binary classification problem. The algorithm adaptation method attempts to modify existing learning models to handle label distribution directly. For example, LogitBoost [38] proposed an additive weighted function regression to fit the LDL paradigm. [6] adopted kNN algorithm to predict the label distribution.
Unlike the indirect strategy of problem transformation and algorithm adaptation, the specialized algorithms are tailor-made for label distribution learning problems. Most specialized LDL algorithms mainly adopt the maximum entropy model [3] as output model, including such examples as SAIIS [6] and SABFGS [6]. Both these approaches use the K-L (Kull-backLeibler) divergence as the objective function to quantify the loss between the predicted and actual label values. Meanwhile, some LDL algorithms use the other similarity measure as the objective function. For example, [7] opted for the Jeffrey divergence as their optimization objective in the realm of head pose estimation. Akbari et al. [1] proposed the \(\gamma \) -Wasserstein loss that leverages the specific geometric structure of the age label space to enhance the impact of highly correlated ages. Additionally, there are also differences in the optimization methods of LDL algorithms. LDLSP [34] and LGGME [36] both utilize ADMM (Alternating Direction Method of Multipliers) as the optimization method. LDL4C [37] employs the BFGS method for optimization.
The specialized algorithms outperform the algorithms based on the problem transformation and algorithm adaptation strategies from relevant studies and experiments, which received much attention from researchers. However, label distribution learning, like other conventional learning paradigms, is challenged by the curse of dimensionality. In this paper, we will propose a specialized algorithm to reduce the dimensionality of label distribution datasets, which utilize label-specific features to direct feature selection.
2.2 Label-specific features learning
Label-specific features is a processing strategy that assumes different features for each label rather than using the same original features for all labels. This strategy stems from multi-label learning and seeks to obtain features that are highly relevant to each label to improve model performance. Based on the learning process, label-specific features learning can be classified into two types. The first is generating label-specific features in the transformed feature space. For example, LIFT [47] is the pioneer multi-label learning method based on label-specific features, which uses cluster analysis on positive and negative instances for each label to generate the corresponding label-specific features. However, LIFT is unable to distinguish the features that are strongly relevant to each label and does not take into consideration the correlation information among labels. Hang and Zhang [10] proposed a label-specific features learning approach based on deep neural networks, which use the label semantics to select label-specific features. Zhang et al. [48] presented an effective BiLabel-specific feature learning approach, which generates label-specific features for label pairs through heuristic prototype selection and embedding. These methods based on first strategy improve the performance of multi-label learning, but they are often not intuitive and unexplainable.
The second is generating label-specific features from original feature space directly. Specifically, these approaches typically integrate the process of label-specific feature selection into the classification model, thereby facilitating a mutual enhancement between the two [17, 39, 45]. Among a variety of methods, most employ the l1-norm to learn label-specific features. For example, [14] proposed a novel method for learning label-specific features, which incorporates the learned high-order label correlations with missing labels. Since learning label-specific features is independent of the classification model, the performance of classification model may be impacted. To handle the issues, [43] introduced a comprehensive learning strategy that simultaneously generates label-specific features and induces a classification model. However, these methods ignore some common features that are discriminative across all labels. Ma et al. [25] performed sample-specific and label-specific classifications, which combines interlabel and interinstance correlations. [40] proposed a new two-stage partial multilabel feature selection method, which fuses label correlations into the feature selection process. Moreover, some approaches achieved the generation of label-specific features by involving instance correlations. Li et al. [18] developed a novel method of learning common and label-specific features, which uses the correlation information from labels and instances in multi-label classification. Zhang et al. [46] proposed a novel label-specific feature selection framework, which simultaneously considers label-group and instance-group correlations. Peng et al. [27] proposed a joint label-common and label-specific features selection model, which can achieve semi-supervised crosssession Electroencephalogram emotion recognition.
The above research have shown that label-specific features is an efficient processing strategy for multi-label data. However, current label-specific features methods are usually for multi-label learning, and do not consider feature correlation. To deal with this issue, an approach based on label-specific features is proposed by involving label correlation and feature correlation, which can select distinctive features for label distribution data.
3 Proposed approach
This section proposes a novel feature selection approach based on label distribution, which simultaneously considers label-specific features for each label and common features for the label set. First, the approach of learning label-specific features is proposed. Second, the Pearson coefficient is utilized to investigate the correlation among labels, thereby improving the performance of model. Finally, feature relevance is also considered through mutual information theory.
3.1 Problem formulation
Let \(X=[x_{1},x_{2},\ldots ,x_{i},...,x_{n}]\in R^{n\times m}\) denote the instance matrix of the training data in the m-dimensional feature space, where \(x_{i}\) is the i-th instance, and n is the number of instance. \(Y=[y_{1},y_{2},...,y_{n}]\in R^{n\times l}\) denote the label space with l class labels. Given a training set \(S=\{(x_{1},D_{1})\), \((x_{2},D_{2}),\) \(\ldots ,(x_{n},D_{n})\}\), where \(D_{i}=\{d_{i1},d_{i2},\ldots ,d_{il}\}\) is the label distribution associated with \(x_{i}\), and \(d_{il}\) represent the significance of label \(y_{l}\) to instance \(x_{i}\). In addition, the label distribution \(d_{il}\)satisfies \(d_{il}\in [0,1]\) and \(\varSigma _{l=1}^{c}d_{il}=1\).
3.2 Feature selection model based on label-specific features
For label distribution feature selection, leveraging common features of the label set to improve model performance is an effective and popular method. In general, some common features exhibit comparable discriminatory power across all class labels, thereby enhancing the efficacy of selecting features from the original feature set. Nevertheless, class labels may have distinct subsets of label-specific features that are particularly relevant and discriminatory for their corresponding label. Thus, the selected subset of features may be suboptimal due to the ignorance of label-specific features for each class label. In this paper, we adopt an optimization learning framework to learn accurate label-specific features and common features from label distribution data. The overall framework of this paper is shown in Fig. 2.
As depicted in Fig. 2, our proposed method, FSLSF, is composed of several components which include data preparation, learning common and label-specific features, exploring label correlation and feature relevance, and feature selection. In the data preparation stage, we transform multi-label data into label distribution data through a label enhancement algorithm. In the second stage, we learn both common and label-specific features, assuming a linear relationship between the feature space and the label space. This relationship can be represented as \(X\varvec{W}=\hat{D}\). Here, the matrix \(\varvec{W}=[w_{1},w_{2},...,w_{l}]\in R^{m\times l}\) indicates the correlations among different labels and features, where \(w_{l}\) is a coefficient vector contains the discriminative power of all features for the l-th label. The optimization function is developed to minimize empirical loss between modeling outputs \(X\,\varvec{W}\) and predicted label distribution \(\hat{D}\). Specifically, we use the sparsity of the \(l_{1}\) regularizer to obtain label-specific features matrix \(\varvec{Q}\), and use the \(l_{2,1}\) regularizer to retain common features. Furthermore, the correlation of labels in label space is explored to constrain label-specific features matrix \(\varvec{Q}\). The relevance between features and labels is incorporated to eliminate the redundant features. During the feature selection process, the final feature significant matrix \(\varvec{W}\) is obtained by fusing \(\varvec{V}\) and \(\varvec{Q}\). With the feature importance matrix, we can rank the features and select the top-ranked ones.
The corresponding optimization function for learning can be formulated as follows.
where the regularizers \(\varPhi (.)\) and \(\varPsi (.)\) are employed to promote stability in learning the label-specific features matrix \(\varvec{Q}\) and common features matrix \(\varvec{V}\), respectively, and \(\varvec{W}\) is consisted of \(\varvec{Q}\) and \(\varvec{V}\). Besides, \(\alpha \) and \(\beta \) are the regularization parameters to control the sparsity of the label-specific features matrix \(\varvec{Q}\) and common features matrix \(\varvec{V}\), respectively.
Since each label is determined by several specific features of its own, the label-specific features matrix \(\varvec{Q}\) is enforced to be sparse. Meanwhile, regularization using \(l_{1}\) norm can induce sparsity among all elements in \(\varvec{Q}\), leading to certain parameters being shrunk to zero. Therefore, in order to extract label-specific features of each label, \(l_{1}\) norm regularizer is utilized to make full use of this sparsity. To regularize the common features matrix \(\varvec{V}\), we use the \(l_{2,1}\) norm, which promotes a sparse representation with a few non-zero rows. This ensures that discriminative features that are common across all labels are selected. To sum up, the optimization function can be reformulated as follows:
For label-specific features matrix \(\varvec{Q}\), a non-zero value of \(q_{ij}\) signifies that the i-th feature distinguishes the j-th label, thereby qualifying as a label-specific feature for that label. Conversely, a zero value of \(q_{ij}\) implies that the corresponding feature is not informative for the j-th label. For common features matrix \(\varvec{V}\), if \(v_{i}\ne 0\), it indicates that the i-th feature exhibits strong discriminatory power for the label set.
3.3 Label correlation
In label distribution learning, each sample is assigned to interrelated label distributions. Therefore, taking label correlation into account proves advantageous in enhancing the weights of discriminative features. In addition, the label correlation constraint has been demonstrated to provide substantial performance improvements in label distribution learning. Leveraging the labels linked with a specific label allows for the adjustment of that label’ s distribution value. If a robust connection exists between two labels, their distribution values ought to be alike. Otherwise, if \(l_{p}\) and \(l_{q}\) indicate the distribution of discriminative features, these should differ. As a result, the objective function can be reformulated as follows:
where \(r_{pq}=\frac{1}{s_{pq}-1}\) , and \(s_{pq}\) denotes the correlation coefficient between the distributions of the p-th label \(y_{p}\) and q-th label \(y_{q}\). \(\gamma \) is the balance factor. In this paper, the Pearson correlation function is employed to compute the label correlation matrix \(s_{pq}\), as outlined below:
where \(\mu _{p}\) and \(\mu _{q}\) denote the mean of the p-th and q-th label distribution for all instances in the label distribution matrix D, respectively. In the light of that \(\sum _{p=1}^{c}\sum _{q=1}^{c}r_{pq}q_{p}^{T}q_{q}=tr(RQ^{T}Q)\), the optimization objective function can be further rewritten as the form:
where tr(.) is the trace of a matrix, and \(\varvec{R}=[r_{pq}]_{l\times l}\) denotes the correlation matrix of the labels. By considering label correlation through Pearson’s coefficient, the optimization function can more efficiently select features with distinguishing ability.
3.4 Feature relevance
In label distribution learning, utilizing information theory to develop feature evaluation criteria is a common and effective approach. Generally speaking, information theory is a traditional and useful measure for determining the correlation among random variables. Formally, the mutual information shared by two random variables, X and Y, can be delineated as follows:
Consider Z as a discrete random variable. The conditional mutual information can also be articulated using information entropy as follows:
In feature selection approaches based on information theory, the construction of feature relevance terms is typically dependent on the quantity of information that selected or candidate features contribute to the label set. Hu et al. [13] proposed a new feature relevance term, which considered both the altered ratio of the undetermined information quantity and the modified ratio of the established information quantity. The corresponding evaluation function is outlined below:
where F represents the whole set of features, S denotes the subset of selected features and L signifies the label set, \(f_{k}\in F-S;f_{j}\in S;l_{i}\in L\). The function \(J(f_{k};f_{j};l_{i})\) concurrently takes into account the altered ratio of the undetermined information quantity and the modified ratio of the established information quantity.
We inherit the evaluation method to access feature importance, and propose the following optimization objective function :
where \(J\left( f_{k};f_{F-k};l_{j}\right) \) signifies the relevance between the feature \(f_{k}\) and the label \(l_{j}\), while \(w_{kj}\) represents the importance of the feature \(f_{k}\) for the label \(l_{j}\). To simplify, the similarity between them can be directly computed as follows:
where p is set as 2.
Then, the final objective function arrives at:
where \(\delta \) is the tradeoff parameter. In summary, the proposed optimization feature selection framework uses the first term as a loss function, that calculates the difference between the predicted label distribution and the actual one. The second and third terms are to extract label-specific features for the corresponding class label and to find common features for all labels, respectively. Besides, the coefficient matrix \(\varvec{W}\) is involved in terms four and five, which makes the optimized feature selection result affected by the label correlation and feature relevance simultaneously.
3.5 Optimization
Our approach’ s ultimate objective function is convex in nature, yet it doesn’ t constitute a smooth optimization problem owing to the presence of l2, 1 and l1 -norm regularizers. To tackle this issue, the optimization problem can be reformulated in the following manner:
In this context, \(||V||_{2,1}\) is directly relaxed by \(tr(V^{T}AV)\), with \(\varvec{A}\) representing the \(p\times p\) diagonal matrix and the i th diagonal value in \(\varvec{A}\) expressed as \(A_{ii}=\frac{1}{2||V_{i}||_{2}}\). Furthermore, the optimization problem can be efficiently addressed by optimizing \(\varvec{Q}\) and \(\varvec{V}\) in an alternating manner.
Fix \(\varvec{Q}\), Optimize \(\varvec{V}\): With \(\varvec{Q}\) held constant, the optimization problem in relation to \(\varvec{V}\) can be restructured as follows:
where \(\varvec{E}=D-X\varvec{Q}\). Here, matrices \(\varvec{A}\) and \(\varvec{R}\) are positive semidefinite, and \(\varvec{V}^{T}\varvec{A}\varvec{V}\ge 0\) and \(\varvec{R}\varvec{V}^{T}\varvec{V}\) for any nonzero \(\varvec{V}\) hold. Thus, we can get \(\frac{\partial tr(\varvec{V}^{T}\varvec{A}\varvec{V})}{\partial \varvec{V}}=(\varvec{A}^{T}+\varvec{A})\varvec{V}\) and \(\frac{\partial tr(\varvec{R}\varvec{V}^{T}\varvec{V})}{\partial \varvec{V}}=V(\varvec{R}^{T}+\varvec{R}).\) (13) is differentiable and has a closed-form solution. Therefore, we can transform (13) into the following one:
By setting the derivative of (14) with respect to \(\varvec{V}\) to 0, we can get the solution for \(\varvec{V}\) as follows:
Fix \(\varvec{V}\), Optimize \(\varvec{Q}\): With \(\varvec{V}\) held constant, the optimization problem in relation to \(\varvec{Q}\) can be restructured as follows:
where \(\varvec{E}=D-X\varvec{V}\).
Employing the accelerated proximal gradient strategy, we address the convex optimization problem, which is expressed as follows:
Here, H represents the real Hilbert space. According to (15) and (17), \(f(\varvec{Q})\) and \(g(\varvec{Q})\) can be derived as follows:
\(f(\varvec{Q})\) is both convex and smooth, while g(Q), although convex, is typically not smooth.
For \(\nabla f(\varvec{Q})\), it can be obtained by taking the derivative of (18) with respect to \(\varvec{Q}\). That is
According to the existing studies [18], below we propose to minimize the separable quadratic approximation sequence of \(f(\varvec{Q})\) by the proximal gradient algorithm rather than minimizing it directly, which is expressed as
where \(L_{f}\) is termed as the Lipschitz constant and \(O^{(t)}=\varvec{Q}^{(t)}-\frac{1}{L_{f}}\nabla f(\varvec{Q})^{(t)}\).
Proposition 1
If H is a Euclidean space endowed with the Frobenius norm \(||\cdot ||_{F}\) and the \(l_{1}\) norm \(||\cdot ||_{1}\), \(\varvec{Q}\) can be updated by soft-shrinkage operator as follows:
According to Proposition 1, the soft-thresholding operator \(\zeta [O^{(t)}]\) and the closed-form solution for \(\varvec{Q}\) can be collectively expressed as follows:
where \(1\le i\le n\) and \(1\le j\le l\). This operator can be extended to vectors and matrices by applying it element wisely.
Given \(\varvec{Q}_{1}\) and \(\varvec{Q}_{2}\) , the Lipschitz constant \(L_{f}\) satisfies the inequality. Thus, the Lipschitz constant \(L_{f}\) can be obtained by the following equation:
The Lipschitz constant \(L_{f}\) can be set as
On the basis of the above optimization process, a label distribution feature selection algorithm based on label-specific features is proposed. The flow chart of this algorithm is depicted in Fig. 3.
As shown in Fig. 3, the feature matrix is initialized and the correlation coefficient of the optimization function is calculated. Then, the label-specific feature matrix \(Q^{(t)}\) and the common feature matrix \(V^{(t)}\) are progressively updated until the termination condition is met. Finally, the top-ranked features are obtained through the feature importance matrix.
Analysis of algorithm FSLSF The time complexity of the proposed method FSLSF is composed of three components: initialization, computation of the correlation matrix and Lipschitz constant, and feature selection. Initially, the time complexity of initializing the variables \(\varvec{V}\) and \(\varvec{Q}\) is \(O(m^{2}n+m^{3}+mnl)\), where n, m, l denote the number of instances, features, and labels, respectively. Then, the feature relevance matrix \(\varvec{C}\) is computed with a time complexity of O(mnl) and the label correlation matrix \(\varvec{R}\) is computed with a time complexity of \(O(ml^{2})\). The process of computing the Lipschitz constant \(L_{f}\) is \(O(m^{3}+l^{3})\). Finally, the feature selection process has a time complexity of \(O(t(m^{2}l+ml^{2}+m^{2}n+m^{3}+mnl))\), where t indicates the number of iterations. Therefore, the total time complexity of the proposed algorithm is \(O(l^{3}+t(m^{2}l+ml^{2}+m^{2}n+m^{3}+mnl))\).
4 Experiments
In this section, we delve into the performance and feasibility of our algorithm via comparative experiments. Initially, the experimental datasets are outlined in Section 4.1. Subsequently, Section 4.2 introduces the widely used evaluation metrics. Following this, Section 4.3 establishes the settings for the parameters and the count of selected features for the experimental algorithms. Next, the effectiveness of FSLSF is confirmed through an analysis of experimental results in Section 4.4. Finally, we conduct an analysis of parameter sensitivity in Section 4.5.
4.1 Datasets
To assess the reliability of the proposed FSLSF approach, we have conducted extensive comparative experiments for feature selection tasks across sixteen benchmark datasets. The specifics of these datasets are encapsulated in Table 1, where ‘ #instance’, ‘ #feature’ , and ‘ #label’ denote the number of instances, features, and labels respectively. The sixteen datasets are label-distributed datasets, and the first fifteen accessible as downloaded from the associated websiteFootnote 1. In addition, RAF-ML is a facial expression dataset that can be found in [19].
4.2 Evaluation metrics
In this study, six evaluation metrics are utilized to measure the performance of FSLSF. Considering the testing dataset \(\bar{S}=\{\bar{x_{i}};\bar{D_{i}}|1\le i\le n\}\), where \(\bar{D_{i}}=\{\bar{d_{1}},\ldots ,\bar{d_{j}},\ldots ,\bar{d_{l}}\}\) represents the actual label distribution of the instance \(\bar{x_{i}}\). The predicted label distribution for \(\bar{x_{i}}\), denoted as \(\hat{D_{i}}=\{\hat{d_{1}},\ldots ,\hat{d_{j}},\ldots ,\) \(\hat{d_{l}}\}\), is predicted by the label distribution algorithm SA-BFGS. The six evaluation metrics are categorized into two types: distance-based metrics (including Chebyshev, Clark, Canberra, and Kullback–Leibler divergence) and similarity-based metrics (comprising Cosine and Intersection). For the distance-based metrics, a lower value signifies better performance, indicated by the downward arrow (\(\downarrow \)). Conversely, for the similarity-based metrics, a higher value represents better performance, indicated by the upward arrow (\(\uparrow \)). Detailed explanations of these evaluation metrics are provided in Table 2.
4.3 Experimental settings
In this subsection, we discuss the settings of experiments. Our proposed method, FSLSF, is evaluated against five other state-of-the-art feature selection methods, including RLFC, G3WI, LRLSF, and FSFL. As for FSLSF, the regularization parameters \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are adjusted within the range of \(\{10^{-3},10^{-2},...,10^{2}\}\), respectively. The parameters associated with these methods are configured as follows:
-
RLFC [31] was an relevance-based feature selection approach for label distribution data, which takes both feature relevance and label correlation into account. The parameters \(\alpha \), \(\beta \), \(\gamma \) are searched in \(\{10^{-3},10^{-2},...,10^{2}\}\).
-
G3WI [50] was a novel gradient-based multi-label feature selection approach, which integrates the aforementioned three-way variable interactions into a comprehensive global optimization objective. The parameters \(\alpha \), \(\beta \) are tuned in the range of \(\{0,10^{-3},10^{-2},\ldots ,10^{1}\}\).
-
LRLSF [42] was a method for multi-label feature selection that relied on stable label relevance and label-specific features. It integrated both global and local label relevance to learn the features specific to each label. The trade-off parameters \(\lambda _{1}\) , \(\lambda _{2}\), \(\lambda _{3}\) and \(\lambda _{4}\) are searched in \(\{10^{-3},10^{-2},...,10^{3}\}\).
-
FSFL [30] was a feature selection approach for label distribution data, which considers feature weights fusion and local label correlations. The number of clusters k is set as 5 while \(\alpha \) and \(\beta \) are tuned in \(\{10^{-3},10^{-2},...,10^{2}\}\).
-
CLEMLFS [12] was a feature selection algorithm for multi-label data, which uses a label enhancement algorithm to enhance traditional logical labels into label distribution. The balance factors \(\lambda _{1}\) and \(\lambda _{2}\) are tuned in the range of \(\{10^{-2},10^{-1},...,10^{2}\}\).
Since the multi-label feature selection method cannot directly deal with the real-world label distribution dataset. Therefore, we employ an equal-width strategy to discretize the label distribution dataset prior to performing dimensionality reduction. Furthermore, it is important to highlight that the algorithms mentioned earlier are capable of ranking features and managing the length of the selected feature subset on datasets. To ensure a fair comparison, it is crucial that each experimental algorithm selects an equal number of features on their respective datasets. Consequently, feature spaces of varying range dimensionality are reduced to sizes recommended in [15]. The specifics of this process are outlined below:
-
When \(m\le \)100, \(u=40\%\), that is, select the top 40% \(\times \) m features.
-
When \(100\le m\le \)500, \(u=30\%\), that is, select the top 30% \(\times \) m features.
-
When \(500\le m\le \)1000, \(u=20\%\), that is, select the top 20% \(\times \) m features.
-
When \(m>\)1000, \(u=10\%\), that is, select the top 10% \(\times \) m features.
Where m represents the number of original features. We adopt a 10-fold cross-validation strategy for conducting our experiments. Subsequently, the SA-BFGS algorithm [6] is utilized to train the learner on the aforementioned fifteen datasets after feature selection. This process yields the output and provides the predictive performance during testing.
4.4 Experimental results
The purpose of this subsection is to demonstrate the efficiency of FSLSF on fifteen label distribution datasets in terms of Chebyshev, Clark, Canberra, KL, Cosine, and Intersection. All algorithms employ the same cross-validation method for performance evaluation in our experiments. Subsequently, the experimental results of the five algorithms on the test set for the selected feature subsets, under the six evaluation criteria, are presented in Tables 3, 4, 5, 6, 7 and 8. Each table provides a performance overview of the experimental algorithms, evaluated based on corresponding metrics. The second-to-last row in each table, labeled “Avg. Ranking” , records the average rankings of these experimental algorithms. Furthermore, the Win/Tie/Loss counts comparing FSLSF to other algorithms on the fifteen datasets are displayed in the final row. The best performance across the five feature selection algorithms is highlighted in bold.
From Tables 3-8, it’s clear to notice that our proposed method, FSLSF, consistently outperforms the other compared algorithms in most instances. The average rankings for the evaluation metrics, which include Chebyshev distance, Clark distance, Canberra distance, Kullback–Leibler divergence, cosine similarity, and intersection similarity, are 1.25, 1.31, 1.50, 1.38, 1.56, and 1.50, respectively. Moreover, detailed comparisons of the algorithms reveal the following:
-
1.
With respect to Chebyshev in Table 3, FSLSF exhibits superior performance across 13 datasets. Taking the dataset Natural Scene as an example, FSLSF achieves the best performance of 0.2977, with the Chebyshev distance of FSLSF is 11.31% to 31.20% less compared to the RLFC, G3WI, LRLSF, FSFL, and CLEMLFS algorithms. Furthermore, it is noted that the statistical results of “Win\Tie\Loss” indicate that the FSLSF algorithm performs better across all datasets compared to the RLFC, G3WI, and LRLSF algorithms.
-
2.
As depicted in Table 4 for the Clark metric, FSLSF delivers the best performance for almost all datasets, except for the dataset SBU_3DFE, Yeast-heat, and Yeast-cold. Notably, the performance enhancement on the SJAFFE and Movie datasets is quite significant. For example, for dataset SJAFFE, the proposed method LRLSF is 5.46%, 18.31%, 21.74%, 36.90% and 12.34% better than RLFC, G3WI, LRLSF, FSFL, and CLEMLFS, respectively. In addition, FSLSF attains near-optimal performance on 3 datasets, marginally falling short of FSFL on the SBU_3DFE, Yeast-heat, and Yeast-cold datasets.
-
3.
According to the Canberra metric presented in Table 5, FSLSF achieves the highest or near-optimal performance across 13 datasets. In particular, FSLSF outperforms the second-ranked algorithms by 14.16% and 28.91% on Yeast-alpha and Yeast-cdc, respectively. FSLSF achieves a moderate level of performance on 3 datasets, slightly inferior to CLEMLFS and FSFL on datasets Yeast-heat and Yeast-cold, and slightly inferior to CLEMLFS and G3WI on datasets RAF-ML. Despite not always achieving the best performance, FSLSF consistently outperforms RLFC and LRLSF across all datasets and delivers the best average performance. This underscores the effectiveness and versatility of the proposed model.
-
4.
For the Kullback–Leibler metric, shown in Table 6, it is clear that FSLSF obtains the highest performance on 12 datasets, excluding the datasets SBU_3DFE, Yeast-heat, Yeast-cold and RAF-ML. In detail, on the dataset Yeast-diau, the proposed method obtains the Kullback–Leibler value of 0.0086, which is 23.72%-46.13% better than RLFC, G3WI, LRLSF, FSFL, and CLEMLFS. In addition, on the dataset SBU_3DFE, FSFL ranks second and 2.42% inferior to RLFC.
-
5.
With respect to the Cosine metric from Table 7, FSLSF exhibits the highest performance on 12 datasets. In detail, on the dataset Movie, the proposed method outperforms RLFC, G3WI, LRLSF, FSFL and CLEMLFS by up to 13.63%, 4.95%, 22.52%, 9.88% and 15.36%, respectively. In addition, FSLSF performs slightly worse than RLFC and CLEMLFS on dataset SBU_3DFE and slightly inferior to FSFL and CLEMLFS on dataset Yeast-cold.
-
6.
In relation to the Intersection metric, as shown in Table 8, FSLSF achieves superior performance on 12 datasets. The statistical results of “Win\Tie\Loss” indicate that FSLSF consistently outperforms RLFC. Furthermore, when FSLSF ranks first, it surpasses the performance of the second-ranked algorithms by 1.50%, 1.36%, 0.69%, 5.39% and 4.94% on datasets SJAFFE, Human Gene, SBU_3DFE, Natural Scene and Movie, respectively. When FSLSF is behind, it is 0.27%, 0.26%, 1.26% and 0.05% inferior to the first-ranked algorithm on datasets Yeast-elu, Yeast-heat, Yeast-cold and RAF-ML, respectively. These statistics underscore that FSLSF’s performance improvements outweigh its reductions, thereby confirming FSLSF’s effectiveness.
To provide a comprehensive comparison of the performance between our proposed algorithm FSLSF and the other algorithms under consideration, we conducted a series of experiments to illustrate the performance trends for varying numbers of selected features. The fluctuations in the six evaluation metrics under different experimental methods are depicted in Fig. 4. Each figure in this set represents the performance change under a specific evaluation metric as the number of features varies. The x-coordinate represents the number of selected features, with an increasing step size set at three. Meanwhile, the y-coordinate signifies the performance after feature selection.
As seen in Fig. 4, we can clearly notice that our proposed algorithm FSLSF consistently surpasses the other four state-of-the-art algorithms in most cases, and the advantage is more pronounced when the number of features is greater than 70. Taking the Cosine similarity as an example, it is clear that FSLSF outperforms RLFC, G3WI, LRLSF, and FSFL by up to 40.25%, 16.44%, 2.51%, 17.10%, and 23.03%, respectively. Moreover, it’ s important to note that there isn’ t a universally optimal value for the number of selected features. For instance, for the Chebyshev metrics, it is at the 50th feature. However, for the Intersection metrics, it is at 88th features.
In summary, our proposed algorithm outperforms both multi-label feature selection algorithms and the label distribution feature selection algorithm. Unlike the training process of multi-label feature selection, which utilizes discretized label distribution data, our algorithm fully capitalizes on label distribution information to guide the selection of a more discriminative feature subset. Furthermore, our algorithm surpasses the performance of the label distribution feature selection algorithm. This enhanced performance can be attributed to our algorithm’ s ability to simultaneously learn common and label-specific features.
4.5 Parameter analysis
In this section, we examine the impact of parameters on the performance of the FSLSF algorithm, conducting a parameter analysis based on the Chebyshev, Clark, Canberra, KL, Cosine, and Intersection metrics. The parameters \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are the four parameters associated with the proposed FSLSF method. Here, \(\alpha \) is the regularized parameter for the label-specific features, \(\beta \) is the regularized parameter for the common features. In addition, \(\gamma \) modulates the contribution of label correlation, and \(\delta \) governs the influence of feature relevance. In experiments, we fixed two parameters while varying the others.
First, we set \(\gamma \) = 0.001 and \(\delta \) = 0.001, and find the optimal \(\delta \) and \(\beta \) through parameter comparison experiments. For dataset Yeast-dtt, the parameters \(\alpha \) varies from 0.001 to 100, and parameter \(\beta \) varies from 0.001 to 100. Figure 5 shows the change in the evaluation metric when parameter \(\alpha \) and the parameter \(\beta \) become large. Figure 5 reveals the distinct roles that parameters \(\delta \) and \(\beta \) play across various evaluation metrics. For metrics such as Chebyshev, Clark, Canberra, and Kullback–Leibler, FSLSF’s performance exhibits a fluctuating pattern, increasing and decreasing with the rise of parameters \(\delta \) and \(\beta \). Conversely, under the Cosine and Intersection evaluation metrics, FSLSF’s performance remains notably stable, with no significant changes observed. In addition, as \(\alpha \) increasing, the influence of parameter \(\beta \) on the algorithm decreases and FSLSF reached stability when parameter \(\alpha \) reaches 10. Similarly, as \(\beta \) increasing, there is no obvious variation in algorithm performance. When parameter \(\beta \) is 0.001, FSLSF gets better performance in most cases. Then, we fixed \(\alpha \)=0.001 and \(\beta \)=0.001 and investigated the impacts of the parameters \(\gamma \) and \(\delta \). Figure 6 illustrates that, with the exception of the Kullback-Leibler metric, the performance of FSLSF remains consistent across nearly all other metrics. In Fig. 5(d), the Kullback–Leibler metric obtains the optimal value when \(\gamma =1\) and \(\delta =0.001\). In general, these results demonstrate that FSLSF’ s performance remains stable despite changes in its parameters.
4.6 The comparison of computational time
This subsection is dedicated to assess the time efficiency of the FSLSF algorithm. Table 9 presents the experimental results of our method alongside several other feature selection algorithms. The row labeled ‘ Average’ in this table signifies the mean computation time across all datasets. In addition, the calculation time is provided in seconds.
As shown in Table 9, FSLSF exhibits a moderate average computational time, which is somewhat worse than the G3WI algorithm and the LRLSF algorithm. Since our proposed algorithm takes into account not only the label correlation, but also the relevance of each feature and label based on information theory, it requires more run time. In addition, the G3WI algorithm benefits from pre-computing the feature correlation matrix and the feature interaction matrix, significantly reducing computational time. Although FSLSF is not optimal in terms of computational efficiency, the performance of FSLSF in six performance metrics is better than that of the G3WI algorithm and the LRLSF algorithm. In general, compared with the experimental results of the other five algorithms, the proposed feature selection algorithm FSLSF shows superior competitiveness.
5 Conclusion and future work
Some existing label distribution feature selection methods predominantly select a subset of common features that are shared with all class labels. However, in label distribution learning, an instance is linked with multiple class labels at once, and each class label may be influenced by its own specific features. Therefore, traditional feature selection approaches ignore label-specific features that are relevant and distinguishing for each class label during the feature selection process. To tackle the issue, we develop a feature selection approach by utilizing label-specific features to select highly informative features for label distribution data. Specifically, the label-specific features are extracted by exploiting the \(l_{1}\)-regularization, while common features are learned through the application of \(l_{2,1}\)-regularization. Then, the relevant information contained between labels and the feature relevance is exploited to direct the feature selection process to facilitate the generation of an optimal feature subset. Comprehensive experiments demonstrate that our method achieves highly competitive performance against other well-established feature selection algorithms. However, the proposed method not directly address the complexities of incomplete label distribution data and exhibits sensitivity to noisy data. In future work, we will further focus on the challenges of partial label distribution learning, and propose corresponding robustness feature selection algorithms.
Data Availability
Data sharing not applicable.
Code Availability
Code availability not applicable.
References
Akbari A, Awais M, Fatemifar S, Khalid SS, Kittler J (2021) A novel ground metric for optimal transport-based chronological age estimation. IEEE Trans Cybern 52(10):9986–9999
Al-Fahdawi S, Al-Waisy AS, Zeebaree DQ, Qahwaji R, Natiq H, Mohammed MA, Nedoma J, Martinek R, Deveci M (2024) Fundus-deepnet: multi-label deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf Fusion 102:102059
Berger A, Della Pietra SA, Della Pietra VJ (1996) A maximum entropy approach to natural language processing. Comput linguist 22(1):39–71
Fan Y, Chen B, Huang W, Liu J, Weng W, Lan W (2022) Multi-label feature selection based on label correlations and feature redundancy. Knowl-Based Syst 241:108256
Fan Y, Liu J, Tang J, Liu P, Lin Y, Du Y (2024) Learning correlation information for multi-label feature selection. Pattern Recognit 145:109899
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748
Geng X, Xia Y (2014) Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1837–1842
Gupta A, Narayan S, Khan S, Khan FS, Shao L, van de Weijer J (2023) Generative multi-label zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence
Han Q, Hu L, Gao W (2024) Feature relevance and redundancy coefficients for multi-view multi-label feature selection. Inf Sci 652:119747
Hang JY, Zhang ML (2021) Collaborative learning of label semantics and deep label-specific features for multi-label classification. IEEE Trans Pattern Anal Mach Intell 44(12):9860–9871
Hao P, Hu L, Gao W (2023) Partial multi-label feature selection via subspace optimization. Inf Sci 648:119556
He Z, Lin Y, Wang C, Guo L, Ding W (2023) Multi-label feature selection based on correlation label enhancement. Inf Sci 647:119526
Hu L, Gao L, Li Y, Zhang P, Gao W (2022) Feature-specific mutual information variation for multi-label feature selection. Inf Sci 593:449–471
Huang J, Qin F, Zheng X, Cheng Z, Yuan Z, Zhang W, Huang Q (2019) Improving multi-label classification with missing labels by learning label-specific features. Inf Sci 492:124–146
Kashef S, Nezamabadi-pour H, Nikpour B (2018) Multilabel feature selection: a comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(2):e1240
Li GL, Zhang HR, Min F, Lu YN (2023) Two-stage label distribution learning with label-independent prediction based on label-specific features. Knowl-Based Syst 267:110426
Li J, Zhang C, Zhou JT, Fu H, Xia S, Hu Q (2021) Deep-lift: deep label-specific feature learning for image annotation. IEEE Trans Cybern 52(8):7732–7741
Li J, Li P, Hu X, Yu K (2022a) Learning common and label-specific features for multi-label classification with correlation information. Pattern Recognit 121:108259
Li S, Deng W (2019) Blended emotion in-the-wild: multi-label facial expression recognition using crowdsourced annotations and deep locality feature learning. Int J Comput Vis 127(6–7):884–906
Li W, Chen J, Lu Y, Huang Z (2022b) Filling missing labels in label distribution learning by exploiting label-specific feature selection. In: 2022 International joint conference on neural networks (IJCNN), IEEE, pp 1–8
Lin Y, Liu H, Zhao H, Hu Q, Zhu X, Wu X (2022) Hierarchical feature selection based on label distribution learning. IEEE Transactions on Knowledge and Data Engineering
Liu H, Lin Y, Wang C, Guo L, Chen J (2023a) Semantic-gap-oriented feature selection in hierarchical classification learning. Inf Sci 642:119241
Liu K, Li T, Yang X, Chen H, Wang J, Deng Z (2023b) Semifree: semi-supervised feature selection with fuzzy relevance and redundancy. IEEE Transactions on Fuzzy Systems
Lu Y, Li W, Li H, Jia X (2023) Predicting label distribution from tie-allowed multi-label ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence
Ma J, Chow TW, Zhang H (2020) Semantic-gap-oriented feature selection and classifier construction in multilabel learning. IEEE Trans Cybern 52(1):101–115
Paul D, Bardhan S, Saha S, Mathew J (2023) Ml-knockoffgan: deep online feature selection for multi-label learning. Knowl-Based Syst 271:110548
Peng Y, Liu H, Li J, Huang J, Lu BL, Kong W (2022) Cross-session emotion recognition by joint label-common and label-specific eeg features exploration. IEEE Trans Neural Syst Rehabil Eng 31:759–768
Qian W, Xiong C, Qian Y, Wang Y (2022a) Label enhancement-based feature selection via fuzzy neighborhood discrimination index. Knowl-Based Syst 250:109119
Qian W, Xiong Y, Yang J, Shu W (2022b) Feature selection for label distribution learning via feature similarity and label correlation. Inf Sci 582:38–59
Qian W, Ye Q, Li Y, Dai S (2022c) Label distribution feature selection with feature weights fusion and local label correlations. Knowl-Based Syst 256:109778
Qian W, Ye Q, Li Y, Huang J, Dai S (2022d) Relevance-based label distribution feature selection via convex optimization. Inf Sci 607:322–345
Qian W, Xu F, Huang J, Qian J (2023) A novel granular ball computing-based fuzzy rough set for feature selection in label distribution learning. Knowl-Based Syst 278:110898
Qian W, Xiong Y, Ding W, Huang J, Vong CM (2024) Label correlations-based multi-label feature selection with label enhancement. Eng Appl Artif Intell 127:107310
Ren T, Jia X, Li W, Chen L, Li Z (2019) Label distribution learning with label-specific features. In: IJCAI, pp 3318–3324
Sharifi-Noghabi H, Harjandi PA, Zolotareva O, Collins CC, Ester M (2021) Out-of-distribution generalization from labelled and unlabelled gene expression data for drug response prediction. Nat Mach Intell 3(11):962–972
Su Y, Zhao W, Jing P, Nie L (2022) Exploiting low-rank latent gaussian graphical model estimation for visual sentiment distributions. IEEE Trans Multimed 25:1243–1255
Wang J, Geng X (2019) Classification with label distribution learning. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, pp 3712–3718
Xing C, Geng X, Xue H (2016) Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4489–4497
Xu P, Xiao L, Liu B, Lu S, Jing L, Yu J (2023a) Label-specific feature augmentation for long-tailed multi-label text classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 10602–10610
Xu T, Xu Y, Yang S, Li B, Zhang W (2023b) Learning accurate label-specific features from partially multilabeled data. IEEE Transactions on Neural Networks and Learning Systems
Yang L, Li M, Shen C, Hu Q, Wen J, Xu S (2020) Discriminative transfer learning for driving pattern recognition in unlabeled scenes. IEEE Trans Cybern 52(3):1429–1442
Yang Y, Chen H, Mi Y, Luo C, Horng SJ, Li T (2023) Multi-label feature selection based on stable label relevance and label-specific features. Inf Sci 648:119525
Yu ZB, Zhang ML (2021) Multi-label classification with label-specific feature generation: a wrapped approach. IEEE Trans Pattern Anal Mach Intell 44(9):5199–5210
Zeng Q, Geng J, Jiang W, Huang K, Wang Z (2021) Idln: iterative distribution learning network for few-shot remote sensing image scene classification. IEEE Geosci Remote Sens Lett 19:1–5
Zhang J, Liu K, Yang X, Ju H, Xu S (2023a) Multi-label learning with relief-based label-specific feature selection. Appl Intell 53(15):18517–18530
Zhang J, Wu H, Jiang M, Liu J, Li S, Tang Y, Long J (2023b) Group-preserving label-specific feature selection for multi-label learning. Expert Syst ApplSystems with Applications 213:118861
Zhang ML, Wu L (2014) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Zhang ML, Fang JP, Wang YB (2021) Bilabel-specific features for multi-label classification. ACM Transactions on Knowledge Discovery from Data 16(1):1–23
Zhang Q, Tsang EC, He Q, Guo Y (2023c) Ensemble of kernel extreme learning machine based elimination optimization for multi-label classification. Knowl-Based Syst 278:110817
Zou Y, Hu X, Li P (2024) Gradient-based multi-label feature selection considering three-way variable interaction. Pattern Recognition 145:109900
Acknowledgements
This work is supported by National Natural Science Foundation of China (62266018 and 62366019), and Natural Science Foundation of Jiangxi Province (20202BABL202037 and 20232
BAB202052).
Author information
Authors and Affiliations
Contributions
Wenhao Shu: Conceptualization, Methodology, Visualization, Writing-original draft. Qiang Xia: Data curation, Software, Validation, Formal analysis, Writing-original draft. Wenbin Qian: Investigation, Supervision, Writing-review and editing.
Corresponding author
Ethics declarations
Conflicts of Interest
All the authors do not have any possible conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shu, W., Xia, Q. & Qian, W. Label distribution feature selection based on label-specific features. Appl Intell 54, 9195–9212 (2024). https://doi.org/10.1007/s10489-024-05668-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05668-8