Abstract
Label distribution learning, where deal with label ambiguity by describing the degree of relevance of each label to a specific instance. As a novel machine learning paradigm, the curse of dimensionality is one of the prominent problems. Feature selection is a vital preprocessing step to reduce the high dimensionality of data. However, most existing label distribution feature selection methods focus on selecting a feature subset that has relevant capabilities for all labels, ignoring labelspecific features with the maximum discriminatory power for each label. To tackle this issue, a label distribution feature selection algorithm based on labelspecific features is proposed in this paper. Initially, we introduce a feature selection optimization model for label distribution data that simultaneously considers common and labelspecific features, leveraging sparse learning to further investigate the intricate relationships between features and labels. Subsequently, the label correlation coefficient is employed to enhance the collaborative learning effect of labels. Finally, the relevance between features and labels is taken into account to guide the feature selection process, which can effectively eliminate the redundant features. Comprehensive experiments demonstrate the advantage of our proposed method over other wellestablished feature selection algorithms for selecting labelspecific features to label distribution data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, singlelabel learning and multilabel learning are two representative paradigms, which have been widely used in solving label ambiguity problems. Specifically speaking, singlelabel learning assumes that each instance is associated with a single label, and labels are mutually exclusive. However, in the realworld application, an instance could simultaneously embody multiple semantic meanings [26, 49]. For instance, in facial emotional expressions, an expression may include a variety of complex emotions such as happiness, anger, and sadness. Similarly, in text annotation, news may be relevant to several topics such as politics, economics, and entertainment. In multilabel learning [2, 8], each instance is related to multiple labels, and each label is regarded to be equally important. Both singlelabel learning and multilabel learning basically regard the relationship between an instance and its label as binary, meaning that it determines whether the label is pertinent to the instance or not. In some situations, some tasks consider not only whether instances are associated with labels, but also the different importance degree of each label [24]. Consequently, [6] proposed a more generalized learning paradigm, called label distribution learning. Label distribution learning solves the problem of label ambiguity explicitly by depicting how relevant the label is to the instance.
Owing to the advantage of learning richer semantics from data compared to multilabel learning, label distribution learning has been widely used in various realworld applications. Applications of label distribution learning can be mainly divided into the following aspects: head posture recognition, scene classification, emotion analysis, and indoor crowd counting. Taking scene classification as an example, a scene topic is generally composed of various basic words, namely, ‘ road’ , ‘ water’ , and ‘ veg’ . Each of these basic words contributes in a unique way to the formation of the scene classification. However, as data from videos, images, and text continues to increase, the label distribution data in the above applications often grapples with the issue of high dimensionality. The emergence of highdimensional label distribution data not only increases the computational cost and storage, but also reduces the quality of learning models. Feature selection, as an effective data preprocessing technology, which reduces the data dimension and improves algorithm performance. Until now, a variety of feature selection approaches have been studied to select relevant features from label distribution data [21, 23, 28, 32]. Qian et al. [29] proposed a specialized feature selection algorithm by taking into account both the feature similarity and label correlation, which can deal with highdimensional label distribution data efficiently. [22] proposed a hierarchical feature selection method to deal with the Semantic gap, which uses label enhancement to obtain a label distribution. [33] developed multilabel feature selection algorithm for label distribution data, which utilizes label enhancement based on the deep forest to transform the logical label.These methods aim to learn a subset of features that are used for discriminating all labels. However, this strategy may not be optimal, as each label may require its own specific features for decisionmaking. Therefore, it is essential to consider labelspecific features explicitly, thereby enhancing the performance of label distribution learning in the feature selection process.
Recently, the strategy of labelspecific features has been integrated with several tasks in label distribution learning, such as exploiting label correlation [34] and completing missing labels [20]. Ren et al. [34] developed a novel label distribution learning algorithm by leveraging labelspecific features, which simultaneously selected labelspecific features and common features. [20] proposed a label distribution learning approach for incomplete data by exploiting labelspecific feature selection, which can fill missing label information. In general, these methods select labelspecific features regarding each label by a coefficient matrix W. For example, as shown in Fig. 1, let us consider 3 instances with a 6dimensional feature space {\(f_{1},f_{2},\ldots ,f_{6}\)}, and their corresponding labels \(y_{1}\) and \(y_{2}\). Upon training, we obtain a coefficient matrix W, which comprises two vectors {\(w_{1},w_{2}\)} corresponding for two labels. The nonzero elements of {\(w_{1},w_{2}\)} allow us to identify the specific features associated with labels \(y_{1}\) and \(y_{2}\). It is worth noting that two correlated labels may have different labelspecific features. Therefore, these methods employ label correlation to ensure that two strongly correlated labels yield similar outputs. Previous studies on labelspecific features have considered the task of exploiting label correlation or completing missing labels, but they have rarely taken into account the feature relevance along with the generation of labelspecific features.
In light of these observations, the strategy of labelspecific features has been widely accepted in data classification. However, most models are dealing with multilabel data, which reduces some label information for label distribution data [4, 5, 9, 11]. Moreover, many existing labelspecific features works ignore the relationship between features and labels [16, 18, 40]. To address this issue, a novel feature selection approach is proposed that exploits labelspecific features for label distribution data in this paper. First, the optimization framework is formulated by sparse learning, which exploits specific features with \(l_{1}\)norm regularization and common features with \(l_{2,1}\) regularization. Secondly, the Pearson correlation coefficient is applied to measure the correlation information among labels, which can assist the optimization process and enhance the generalization ability of the proposed framework. Furthermore, Mutual information variation is incorporated into the proposed optimization function to examine the relevance between features and labels, which can enhance the discriminatory ability of the feature selection model. Finally, a series of complete experiments are developed to verify the effectiveness of our proposed method.
The major contributions of this work are summarized in the following four points:

1.
A novel feature selection approach for label distribution data is proposed, which considers simultaneously common, and labelspecific features. Moreover, the proposed method further improves the performance by exploiting labelspecific features and sparse learning.

2.
Label correlation and feature relevance are incorporated in feature selection, simultaneously, thereby effectively enhancing the generalization ability of the learning model and reducing the redundancy of features.

3.
The proposed method can directly deal with label distribution data, without discretizing the label information. Thus, the performance of label distribution learning can be significantly improved by selecting discriminate features.

4.
Extensive empirical comparison of our method and other stateoftheart feature selection algorithms are conducted on fifteen benchmark datasets using six evaluation metrics. The experiment result demonstrated the superiority of our method.
This paper is structured as follows. Section 2 introduces the work related to this paper. The proposed approach is detailed in Section 3. In Section 4, the experimental results obtained by the proposed approach are analyzed. In Section 5, the paper is concluded and future work is discussed.
2 Related work
2.1 Label distribution learning
Label Distribution Learning (LDL), which has demonstrated a superior ability to extract complex semantics from data compared to multilabel learning, has broad applications in areas such as gene selection and scene recognition [35, 41, 44]. Various label distribution learning approaches can be categorized into three types: problem transformation, algorithm adaptation, and specialized algorithms. Problem transformation is the simplest approach, where the LDL problem is reformulated into existing learning paradigms. [6] introduced two typical problem transformation algorithms, which treated each label as an independent binary classification problem. The algorithm adaptation method attempts to modify existing learning models to handle label distribution directly. For example, LogitBoost [38] proposed an additive weighted function regression to fit the LDL paradigm. [6] adopted kNN algorithm to predict the label distribution.
Unlike the indirect strategy of problem transformation and algorithm adaptation, the specialized algorithms are tailormade for label distribution learning problems. Most specialized LDL algorithms mainly adopt the maximum entropy model [3] as output model, including such examples as SAIIS [6] and SABFGS [6]. Both these approaches use the KL (KullbackLeibler) divergence as the objective function to quantify the loss between the predicted and actual label values. Meanwhile, some LDL algorithms use the other similarity measure as the objective function. For example, [7] opted for the Jeffrey divergence as their optimization objective in the realm of head pose estimation. Akbari et al. [1] proposed the \(\gamma \) Wasserstein loss that leverages the specific geometric structure of the age label space to enhance the impact of highly correlated ages. Additionally, there are also differences in the optimization methods of LDL algorithms. LDLSP [34] and LGGME [36] both utilize ADMM (Alternating Direction Method of Multipliers) as the optimization method. LDL4C [37] employs the BFGS method for optimization.
The specialized algorithms outperform the algorithms based on the problem transformation and algorithm adaptation strategies from relevant studies and experiments, which received much attention from researchers. However, label distribution learning, like other conventional learning paradigms, is challenged by the curse of dimensionality. In this paper, we will propose a specialized algorithm to reduce the dimensionality of label distribution datasets, which utilize labelspecific features to direct feature selection.
2.2 Labelspecific features learning
Labelspecific features is a processing strategy that assumes different features for each label rather than using the same original features for all labels. This strategy stems from multilabel learning and seeks to obtain features that are highly relevant to each label to improve model performance. Based on the learning process, labelspecific features learning can be classified into two types. The first is generating labelspecific features in the transformed feature space. For example, LIFT [47] is the pioneer multilabel learning method based on labelspecific features, which uses cluster analysis on positive and negative instances for each label to generate the corresponding labelspecific features. However, LIFT is unable to distinguish the features that are strongly relevant to each label and does not take into consideration the correlation information among labels. Hang and Zhang [10] proposed a labelspecific features learning approach based on deep neural networks, which use the label semantics to select labelspecific features. Zhang et al. [48] presented an effective BiLabelspecific feature learning approach, which generates labelspecific features for label pairs through heuristic prototype selection and embedding. These methods based on first strategy improve the performance of multilabel learning, but they are often not intuitive and unexplainable.
The second is generating labelspecific features from original feature space directly. Specifically, these approaches typically integrate the process of labelspecific feature selection into the classification model, thereby facilitating a mutual enhancement between the two [17, 39, 45]. Among a variety of methods, most employ the l1norm to learn labelspecific features. For example, [14] proposed a novel method for learning labelspecific features, which incorporates the learned highorder label correlations with missing labels. Since learning labelspecific features is independent of the classification model, the performance of classification model may be impacted. To handle the issues, [43] introduced a comprehensive learning strategy that simultaneously generates labelspecific features and induces a classification model. However, these methods ignore some common features that are discriminative across all labels. Ma et al. [25] performed samplespecific and labelspecific classifications, which combines interlabel and interinstance correlations. [40] proposed a new twostage partial multilabel feature selection method, which fuses label correlations into the feature selection process. Moreover, some approaches achieved the generation of labelspecific features by involving instance correlations. Li et al. [18] developed a novel method of learning common and labelspecific features, which uses the correlation information from labels and instances in multilabel classification. Zhang et al. [46] proposed a novel labelspecific feature selection framework, which simultaneously considers labelgroup and instancegroup correlations. Peng et al. [27] proposed a joint labelcommon and labelspecific features selection model, which can achieve semisupervised crosssession Electroencephalogram emotion recognition.
The above research have shown that labelspecific features is an efficient processing strategy for multilabel data. However, current labelspecific features methods are usually for multilabel learning, and do not consider feature correlation. To deal with this issue, an approach based on labelspecific features is proposed by involving label correlation and feature correlation, which can select distinctive features for label distribution data.
3 Proposed approach
This section proposes a novel feature selection approach based on label distribution, which simultaneously considers labelspecific features for each label and common features for the label set. First, the approach of learning labelspecific features is proposed. Second, the Pearson coefficient is utilized to investigate the correlation among labels, thereby improving the performance of model. Finally, feature relevance is also considered through mutual information theory.
3.1 Problem formulation
Let \(X=[x_{1},x_{2},\ldots ,x_{i},...,x_{n}]\in R^{n\times m}\) denote the instance matrix of the training data in the mdimensional feature space, where \(x_{i}\) is the ith instance, and n is the number of instance. \(Y=[y_{1},y_{2},...,y_{n}]\in R^{n\times l}\) denote the label space with l class labels. Given a training set \(S=\{(x_{1},D_{1})\), \((x_{2},D_{2}),\) \(\ldots ,(x_{n},D_{n})\}\), where \(D_{i}=\{d_{i1},d_{i2},\ldots ,d_{il}\}\) is the label distribution associated with \(x_{i}\), and \(d_{il}\) represent the significance of label \(y_{l}\) to instance \(x_{i}\). In addition, the label distribution \(d_{il}\)satisfies \(d_{il}\in [0,1]\) and \(\varSigma _{l=1}^{c}d_{il}=1\).
3.2 Feature selection model based on labelspecific features
For label distribution feature selection, leveraging common features of the label set to improve model performance is an effective and popular method. In general, some common features exhibit comparable discriminatory power across all class labels, thereby enhancing the efficacy of selecting features from the original feature set. Nevertheless, class labels may have distinct subsets of labelspecific features that are particularly relevant and discriminatory for their corresponding label. Thus, the selected subset of features may be suboptimal due to the ignorance of labelspecific features for each class label. In this paper, we adopt an optimization learning framework to learn accurate labelspecific features and common features from label distribution data. The overall framework of this paper is shown in Fig. 2.
As depicted in Fig. 2, our proposed method, FSLSF, is composed of several components which include data preparation, learning common and labelspecific features, exploring label correlation and feature relevance, and feature selection. In the data preparation stage, we transform multilabel data into label distribution data through a label enhancement algorithm. In the second stage, we learn both common and labelspecific features, assuming a linear relationship between the feature space and the label space. This relationship can be represented as \(X\varvec{W}=\hat{D}\). Here, the matrix \(\varvec{W}=[w_{1},w_{2},...,w_{l}]\in R^{m\times l}\) indicates the correlations among different labels and features, where \(w_{l}\) is a coefficient vector contains the discriminative power of all features for the lth label. The optimization function is developed to minimize empirical loss between modeling outputs \(X\,\varvec{W}\) and predicted label distribution \(\hat{D}\). Specifically, we use the sparsity of the \(l_{1}\) regularizer to obtain labelspecific features matrix \(\varvec{Q}\), and use the \(l_{2,1}\) regularizer to retain common features. Furthermore, the correlation of labels in label space is explored to constrain labelspecific features matrix \(\varvec{Q}\). The relevance between features and labels is incorporated to eliminate the redundant features. During the feature selection process, the final feature significant matrix \(\varvec{W}\) is obtained by fusing \(\varvec{V}\) and \(\varvec{Q}\). With the feature importance matrix, we can rank the features and select the topranked ones.
The corresponding optimization function for learning can be formulated as follows.
where the regularizers \(\varPhi (.)\) and \(\varPsi (.)\) are employed to promote stability in learning the labelspecific features matrix \(\varvec{Q}\) and common features matrix \(\varvec{V}\), respectively, and \(\varvec{W}\) is consisted of \(\varvec{Q}\) and \(\varvec{V}\). Besides, \(\alpha \) and \(\beta \) are the regularization parameters to control the sparsity of the labelspecific features matrix \(\varvec{Q}\) and common features matrix \(\varvec{V}\), respectively.
Since each label is determined by several specific features of its own, the labelspecific features matrix \(\varvec{Q}\) is enforced to be sparse. Meanwhile, regularization using \(l_{1}\) norm can induce sparsity among all elements in \(\varvec{Q}\), leading to certain parameters being shrunk to zero. Therefore, in order to extract labelspecific features of each label, \(l_{1}\) norm regularizer is utilized to make full use of this sparsity. To regularize the common features matrix \(\varvec{V}\), we use the \(l_{2,1}\) norm, which promotes a sparse representation with a few nonzero rows. This ensures that discriminative features that are common across all labels are selected. To sum up, the optimization function can be reformulated as follows:
For labelspecific features matrix \(\varvec{Q}\), a nonzero value of \(q_{ij}\) signifies that the ith feature distinguishes the jth label, thereby qualifying as a labelspecific feature for that label. Conversely, a zero value of \(q_{ij}\) implies that the corresponding feature is not informative for the jth label. For common features matrix \(\varvec{V}\), if \(v_{i}\ne 0\), it indicates that the ith feature exhibits strong discriminatory power for the label set.
3.3 Label correlation
In label distribution learning, each sample is assigned to interrelated label distributions. Therefore, taking label correlation into account proves advantageous in enhancing the weights of discriminative features. In addition, the label correlation constraint has been demonstrated to provide substantial performance improvements in label distribution learning. Leveraging the labels linked with a specific label allows for the adjustment of that label’ s distribution value. If a robust connection exists between two labels, their distribution values ought to be alike. Otherwise, if \(l_{p}\) and \(l_{q}\) indicate the distribution of discriminative features, these should differ. As a result, the objective function can be reformulated as follows:
where \(r_{pq}=\frac{1}{s_{pq}1}\) , and \(s_{pq}\) denotes the correlation coefficient between the distributions of the pth label \(y_{p}\) and qth label \(y_{q}\). \(\gamma \) is the balance factor. In this paper, the Pearson correlation function is employed to compute the label correlation matrix \(s_{pq}\), as outlined below:
where \(\mu _{p}\) and \(\mu _{q}\) denote the mean of the pth and qth label distribution for all instances in the label distribution matrix D, respectively. In the light of that \(\sum _{p=1}^{c}\sum _{q=1}^{c}r_{pq}q_{p}^{T}q_{q}=tr(RQ^{T}Q)\), the optimization objective function can be further rewritten as the form:
where tr(.) is the trace of a matrix, and \(\varvec{R}=[r_{pq}]_{l\times l}\) denotes the correlation matrix of the labels. By considering label correlation through Pearson’s coefficient, the optimization function can more efficiently select features with distinguishing ability.
3.4 Feature relevance
In label distribution learning, utilizing information theory to develop feature evaluation criteria is a common and effective approach. Generally speaking, information theory is a traditional and useful measure for determining the correlation among random variables. Formally, the mutual information shared by two random variables, X and Y, can be delineated as follows:
Consider Z as a discrete random variable. The conditional mutual information can also be articulated using information entropy as follows:
In feature selection approaches based on information theory, the construction of feature relevance terms is typically dependent on the quantity of information that selected or candidate features contribute to the label set. Hu et al. [13] proposed a new feature relevance term, which considered both the altered ratio of the undetermined information quantity and the modified ratio of the established information quantity. The corresponding evaluation function is outlined below:
where F represents the whole set of features, S denotes the subset of selected features and L signifies the label set, \(f_{k}\in FS;f_{j}\in S;l_{i}\in L\). The function \(J(f_{k};f_{j};l_{i})\) concurrently takes into account the altered ratio of the undetermined information quantity and the modified ratio of the established information quantity.
We inherit the evaluation method to access feature importance, and propose the following optimization objective function :
where \(J\left( f_{k};f_{Fk};l_{j}\right) \) signifies the relevance between the feature \(f_{k}\) and the label \(l_{j}\), while \(w_{kj}\) represents the importance of the feature \(f_{k}\) for the label \(l_{j}\). To simplify, the similarity between them can be directly computed as follows:
where p is set as 2.
Then, the final objective function arrives at:
where \(\delta \) is the tradeoff parameter. In summary, the proposed optimization feature selection framework uses the first term as a loss function, that calculates the difference between the predicted label distribution and the actual one. The second and third terms are to extract labelspecific features for the corresponding class label and to find common features for all labels, respectively. Besides, the coefficient matrix \(\varvec{W}\) is involved in terms four and five, which makes the optimized feature selection result affected by the label correlation and feature relevance simultaneously.
3.5 Optimization
Our approach’ s ultimate objective function is convex in nature, yet it doesn’ t constitute a smooth optimization problem owing to the presence of l2, 1 and l1 norm regularizers. To tackle this issue, the optimization problem can be reformulated in the following manner:
In this context, \(V_{2,1}\) is directly relaxed by \(tr(V^{T}AV)\), with \(\varvec{A}\) representing the \(p\times p\) diagonal matrix and the i th diagonal value in \(\varvec{A}\) expressed as \(A_{ii}=\frac{1}{2V_{i}_{2}}\). Furthermore, the optimization problem can be efficiently addressed by optimizing \(\varvec{Q}\) and \(\varvec{V}\) in an alternating manner.
Fix \(\varvec{Q}\), Optimize \(\varvec{V}\): With \(\varvec{Q}\) held constant, the optimization problem in relation to \(\varvec{V}\) can be restructured as follows:
where \(\varvec{E}=DX\varvec{Q}\). Here, matrices \(\varvec{A}\) and \(\varvec{R}\) are positive semidefinite, and \(\varvec{V}^{T}\varvec{A}\varvec{V}\ge 0\) and \(\varvec{R}\varvec{V}^{T}\varvec{V}\) for any nonzero \(\varvec{V}\) hold. Thus, we can get \(\frac{\partial tr(\varvec{V}^{T}\varvec{A}\varvec{V})}{\partial \varvec{V}}=(\varvec{A}^{T}+\varvec{A})\varvec{V}\) and \(\frac{\partial tr(\varvec{R}\varvec{V}^{T}\varvec{V})}{\partial \varvec{V}}=V(\varvec{R}^{T}+\varvec{R}).\) (13) is differentiable and has a closedform solution. Therefore, we can transform (13) into the following one:
By setting the derivative of (14) with respect to \(\varvec{V}\) to 0, we can get the solution for \(\varvec{V}\) as follows:
Fix \(\varvec{V}\), Optimize \(\varvec{Q}\): With \(\varvec{V}\) held constant, the optimization problem in relation to \(\varvec{Q}\) can be restructured as follows:
where \(\varvec{E}=DX\varvec{V}\).
Employing the accelerated proximal gradient strategy, we address the convex optimization problem, which is expressed as follows:
Here, H represents the real Hilbert space. According to (15) and (17), \(f(\varvec{Q})\) and \(g(\varvec{Q})\) can be derived as follows:
\(f(\varvec{Q})\) is both convex and smooth, while g(Q), although convex, is typically not smooth.
For \(\nabla f(\varvec{Q})\), it can be obtained by taking the derivative of (18) with respect to \(\varvec{Q}\). That is
According to the existing studies [18], below we propose to minimize the separable quadratic approximation sequence of \(f(\varvec{Q})\) by the proximal gradient algorithm rather than minimizing it directly, which is expressed as
where \(L_{f}\) is termed as the Lipschitz constant and \(O^{(t)}=\varvec{Q}^{(t)}\frac{1}{L_{f}}\nabla f(\varvec{Q})^{(t)}\).
Proposition 1
If H is a Euclidean space endowed with the Frobenius norm \(\cdot _{F}\) and the \(l_{1}\) norm \(\cdot _{1}\), \(\varvec{Q}\) can be updated by softshrinkage operator as follows:
According to Proposition 1, the softthresholding operator \(\zeta [O^{(t)}]\) and the closedform solution for \(\varvec{Q}\) can be collectively expressed as follows:
where \(1\le i\le n\) and \(1\le j\le l\). This operator can be extended to vectors and matrices by applying it element wisely.
Given \(\varvec{Q}_{1}\) and \(\varvec{Q}_{2}\) , the Lipschitz constant \(L_{f}\) satisfies the inequality. Thus, the Lipschitz constant \(L_{f}\) can be obtained by the following equation:
The Lipschitz constant \(L_{f}\) can be set as
On the basis of the above optimization process, a label distribution feature selection algorithm based on labelspecific features is proposed. The flow chart of this algorithm is depicted in Fig. 3.
As shown in Fig. 3, the feature matrix is initialized and the correlation coefficient of the optimization function is calculated. Then, the labelspecific feature matrix \(Q^{(t)}\) and the common feature matrix \(V^{(t)}\) are progressively updated until the termination condition is met. Finally, the topranked features are obtained through the feature importance matrix.
Analysis of algorithm FSLSF The time complexity of the proposed method FSLSF is composed of three components: initialization, computation of the correlation matrix and Lipschitz constant, and feature selection. Initially, the time complexity of initializing the variables \(\varvec{V}\) and \(\varvec{Q}\) is \(O(m^{2}n+m^{3}+mnl)\), where n, m, l denote the number of instances, features, and labels, respectively. Then, the feature relevance matrix \(\varvec{C}\) is computed with a time complexity of O(mnl) and the label correlation matrix \(\varvec{R}\) is computed with a time complexity of \(O(ml^{2})\). The process of computing the Lipschitz constant \(L_{f}\) is \(O(m^{3}+l^{3})\). Finally, the feature selection process has a time complexity of \(O(t(m^{2}l+ml^{2}+m^{2}n+m^{3}+mnl))\), where t indicates the number of iterations. Therefore, the total time complexity of the proposed algorithm is \(O(l^{3}+t(m^{2}l+ml^{2}+m^{2}n+m^{3}+mnl))\).
4 Experiments
In this section, we delve into the performance and feasibility of our algorithm via comparative experiments. Initially, the experimental datasets are outlined in Section 4.1. Subsequently, Section 4.2 introduces the widely used evaluation metrics. Following this, Section 4.3 establishes the settings for the parameters and the count of selected features for the experimental algorithms. Next, the effectiveness of FSLSF is confirmed through an analysis of experimental results in Section 4.4. Finally, we conduct an analysis of parameter sensitivity in Section 4.5.
4.1 Datasets
To assess the reliability of the proposed FSLSF approach, we have conducted extensive comparative experiments for feature selection tasks across sixteen benchmark datasets. The specifics of these datasets are encapsulated in Table 1, where ‘ #instance’, ‘ #feature’ , and ‘ #label’ denote the number of instances, features, and labels respectively. The sixteen datasets are labeldistributed datasets, and the first fifteen accessible as downloaded from the associated website^{Footnote 1}. In addition, RAFML is a facial expression dataset that can be found in [19].
4.2 Evaluation metrics
In this study, six evaluation metrics are utilized to measure the performance of FSLSF. Considering the testing dataset \(\bar{S}=\{\bar{x_{i}};\bar{D_{i}}1\le i\le n\}\), where \(\bar{D_{i}}=\{\bar{d_{1}},\ldots ,\bar{d_{j}},\ldots ,\bar{d_{l}}\}\) represents the actual label distribution of the instance \(\bar{x_{i}}\). The predicted label distribution for \(\bar{x_{i}}\), denoted as \(\hat{D_{i}}=\{\hat{d_{1}},\ldots ,\hat{d_{j}},\ldots ,\) \(\hat{d_{l}}\}\), is predicted by the label distribution algorithm SABFGS. The six evaluation metrics are categorized into two types: distancebased metrics (including Chebyshev, Clark, Canberra, and Kullback–Leibler divergence) and similaritybased metrics (comprising Cosine and Intersection). For the distancebased metrics, a lower value signifies better performance, indicated by the downward arrow (\(\downarrow \)). Conversely, for the similaritybased metrics, a higher value represents better performance, indicated by the upward arrow (\(\uparrow \)). Detailed explanations of these evaluation metrics are provided in Table 2.
4.3 Experimental settings
In this subsection, we discuss the settings of experiments. Our proposed method, FSLSF, is evaluated against five other stateoftheart feature selection methods, including RLFC, G3WI, LRLSF, and FSFL. As for FSLSF, the regularization parameters \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are adjusted within the range of \(\{10^{3},10^{2},...,10^{2}\}\), respectively. The parameters associated with these methods are configured as follows:

RLFC [31] was an relevancebased feature selection approach for label distribution data, which takes both feature relevance and label correlation into account. The parameters \(\alpha \), \(\beta \), \(\gamma \) are searched in \(\{10^{3},10^{2},...,10^{2}\}\).

G3WI [50] was a novel gradientbased multilabel feature selection approach, which integrates the aforementioned threeway variable interactions into a comprehensive global optimization objective. The parameters \(\alpha \), \(\beta \) are tuned in the range of \(\{0,10^{3},10^{2},\ldots ,10^{1}\}\).

LRLSF [42] was a method for multilabel feature selection that relied on stable label relevance and labelspecific features. It integrated both global and local label relevance to learn the features specific to each label. The tradeoff parameters \(\lambda _{1}\) , \(\lambda _{2}\), \(\lambda _{3}\) and \(\lambda _{4}\) are searched in \(\{10^{3},10^{2},...,10^{3}\}\).

FSFL [30] was a feature selection approach for label distribution data, which considers feature weights fusion and local label correlations. The number of clusters k is set as 5 while \(\alpha \) and \(\beta \) are tuned in \(\{10^{3},10^{2},...,10^{2}\}\).

CLEMLFS [12] was a feature selection algorithm for multilabel data, which uses a label enhancement algorithm to enhance traditional logical labels into label distribution. The balance factors \(\lambda _{1}\) and \(\lambda _{2}\) are tuned in the range of \(\{10^{2},10^{1},...,10^{2}\}\).
Since the multilabel feature selection method cannot directly deal with the realworld label distribution dataset. Therefore, we employ an equalwidth strategy to discretize the label distribution dataset prior to performing dimensionality reduction. Furthermore, it is important to highlight that the algorithms mentioned earlier are capable of ranking features and managing the length of the selected feature subset on datasets. To ensure a fair comparison, it is crucial that each experimental algorithm selects an equal number of features on their respective datasets. Consequently, feature spaces of varying range dimensionality are reduced to sizes recommended in [15]. The specifics of this process are outlined below:

When \(m\le \)100, \(u=40\%\), that is, select the top 40% \(\times \) m features.

When \(100\le m\le \)500, \(u=30\%\), that is, select the top 30% \(\times \) m features.

When \(500\le m\le \)1000, \(u=20\%\), that is, select the top 20% \(\times \) m features.

When \(m>\)1000, \(u=10\%\), that is, select the top 10% \(\times \) m features.
Where m represents the number of original features. We adopt a 10fold crossvalidation strategy for conducting our experiments. Subsequently, the SABFGS algorithm [6] is utilized to train the learner on the aforementioned fifteen datasets after feature selection. This process yields the output and provides the predictive performance during testing.
4.4 Experimental results
The purpose of this subsection is to demonstrate the efficiency of FSLSF on fifteen label distribution datasets in terms of Chebyshev, Clark, Canberra, KL, Cosine, and Intersection. All algorithms employ the same crossvalidation method for performance evaluation in our experiments. Subsequently, the experimental results of the five algorithms on the test set for the selected feature subsets, under the six evaluation criteria, are presented in Tables 3, 4, 5, 6, 7 and 8. Each table provides a performance overview of the experimental algorithms, evaluated based on corresponding metrics. The secondtolast row in each table, labeled “Avg. Ranking” , records the average rankings of these experimental algorithms. Furthermore, the Win/Tie/Loss counts comparing FSLSF to other algorithms on the fifteen datasets are displayed in the final row. The best performance across the five feature selection algorithms is highlighted in bold.
From Tables 38, it’s clear to notice that our proposed method, FSLSF, consistently outperforms the other compared algorithms in most instances. The average rankings for the evaluation metrics, which include Chebyshev distance, Clark distance, Canberra distance, Kullback–Leibler divergence, cosine similarity, and intersection similarity, are 1.25, 1.31, 1.50, 1.38, 1.56, and 1.50, respectively. Moreover, detailed comparisons of the algorithms reveal the following:

1.
With respect to Chebyshev in Table 3, FSLSF exhibits superior performance across 13 datasets. Taking the dataset Natural Scene as an example, FSLSF achieves the best performance of 0.2977, with the Chebyshev distance of FSLSF is 11.31% to 31.20% less compared to the RLFC, G3WI, LRLSF, FSFL, and CLEMLFS algorithms. Furthermore, it is noted that the statistical results of “Win\Tie\Loss” indicate that the FSLSF algorithm performs better across all datasets compared to the RLFC, G3WI, and LRLSF algorithms.

2.
As depicted in Table 4 for the Clark metric, FSLSF delivers the best performance for almost all datasets, except for the dataset SBU_3DFE, Yeastheat, and Yeastcold. Notably, the performance enhancement on the SJAFFE and Movie datasets is quite significant. For example, for dataset SJAFFE, the proposed method LRLSF is 5.46%, 18.31%, 21.74%, 36.90% and 12.34% better than RLFC, G3WI, LRLSF, FSFL, and CLEMLFS, respectively. In addition, FSLSF attains nearoptimal performance on 3 datasets, marginally falling short of FSFL on the SBU_3DFE, Yeastheat, and Yeastcold datasets.

3.
According to the Canberra metric presented in Table 5, FSLSF achieves the highest or nearoptimal performance across 13 datasets. In particular, FSLSF outperforms the secondranked algorithms by 14.16% and 28.91% on Yeastalpha and Yeastcdc, respectively. FSLSF achieves a moderate level of performance on 3 datasets, slightly inferior to CLEMLFS and FSFL on datasets Yeastheat and Yeastcold, and slightly inferior to CLEMLFS and G3WI on datasets RAFML. Despite not always achieving the best performance, FSLSF consistently outperforms RLFC and LRLSF across all datasets and delivers the best average performance. This underscores the effectiveness and versatility of the proposed model.

4.
For the Kullback–Leibler metric, shown in Table 6, it is clear that FSLSF obtains the highest performance on 12 datasets, excluding the datasets SBU_3DFE, Yeastheat, Yeastcold and RAFML. In detail, on the dataset Yeastdiau, the proposed method obtains the Kullback–Leibler value of 0.0086, which is 23.72%46.13% better than RLFC, G3WI, LRLSF, FSFL, and CLEMLFS. In addition, on the dataset SBU_3DFE, FSFL ranks second and 2.42% inferior to RLFC.

5.
With respect to the Cosine metric from Table 7, FSLSF exhibits the highest performance on 12 datasets. In detail, on the dataset Movie, the proposed method outperforms RLFC, G3WI, LRLSF, FSFL and CLEMLFS by up to 13.63%, 4.95%, 22.52%, 9.88% and 15.36%, respectively. In addition, FSLSF performs slightly worse than RLFC and CLEMLFS on dataset SBU_3DFE and slightly inferior to FSFL and CLEMLFS on dataset Yeastcold.

6.
In relation to the Intersection metric, as shown in Table 8, FSLSF achieves superior performance on 12 datasets. The statistical results of “Win\Tie\Loss” indicate that FSLSF consistently outperforms RLFC. Furthermore, when FSLSF ranks first, it surpasses the performance of the secondranked algorithms by 1.50%, 1.36%, 0.69%, 5.39% and 4.94% on datasets SJAFFE, Human Gene, SBU_3DFE, Natural Scene and Movie, respectively. When FSLSF is behind, it is 0.27%, 0.26%, 1.26% and 0.05% inferior to the firstranked algorithm on datasets Yeastelu, Yeastheat, Yeastcold and RAFML, respectively. These statistics underscore that FSLSF’s performance improvements outweigh its reductions, thereby confirming FSLSF’s effectiveness.
To provide a comprehensive comparison of the performance between our proposed algorithm FSLSF and the other algorithms under consideration, we conducted a series of experiments to illustrate the performance trends for varying numbers of selected features. The fluctuations in the six evaluation metrics under different experimental methods are depicted in Fig. 4. Each figure in this set represents the performance change under a specific evaluation metric as the number of features varies. The xcoordinate represents the number of selected features, with an increasing step size set at three. Meanwhile, the ycoordinate signifies the performance after feature selection.
As seen in Fig. 4, we can clearly notice that our proposed algorithm FSLSF consistently surpasses the other four stateoftheart algorithms in most cases, and the advantage is more pronounced when the number of features is greater than 70. Taking the Cosine similarity as an example, it is clear that FSLSF outperforms RLFC, G3WI, LRLSF, and FSFL by up to 40.25%, 16.44%, 2.51%, 17.10%, and 23.03%, respectively. Moreover, it’ s important to note that there isn’ t a universally optimal value for the number of selected features. For instance, for the Chebyshev metrics, it is at the 50th feature. However, for the Intersection metrics, it is at 88th features.
In summary, our proposed algorithm outperforms both multilabel feature selection algorithms and the label distribution feature selection algorithm. Unlike the training process of multilabel feature selection, which utilizes discretized label distribution data, our algorithm fully capitalizes on label distribution information to guide the selection of a more discriminative feature subset. Furthermore, our algorithm surpasses the performance of the label distribution feature selection algorithm. This enhanced performance can be attributed to our algorithm’ s ability to simultaneously learn common and labelspecific features.
4.5 Parameter analysis
In this section, we examine the impact of parameters on the performance of the FSLSF algorithm, conducting a parameter analysis based on the Chebyshev, Clark, Canberra, KL, Cosine, and Intersection metrics. The parameters \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are the four parameters associated with the proposed FSLSF method. Here, \(\alpha \) is the regularized parameter for the labelspecific features, \(\beta \) is the regularized parameter for the common features. In addition, \(\gamma \) modulates the contribution of label correlation, and \(\delta \) governs the influence of feature relevance. In experiments, we fixed two parameters while varying the others.
First, we set \(\gamma \) = 0.001 and \(\delta \) = 0.001, and find the optimal \(\delta \) and \(\beta \) through parameter comparison experiments. For dataset Yeastdtt, the parameters \(\alpha \) varies from 0.001 to 100, and parameter \(\beta \) varies from 0.001 to 100. Figure 5 shows the change in the evaluation metric when parameter \(\alpha \) and the parameter \(\beta \) become large. Figure 5 reveals the distinct roles that parameters \(\delta \) and \(\beta \) play across various evaluation metrics. For metrics such as Chebyshev, Clark, Canberra, and Kullback–Leibler, FSLSF’s performance exhibits a fluctuating pattern, increasing and decreasing with the rise of parameters \(\delta \) and \(\beta \). Conversely, under the Cosine and Intersection evaluation metrics, FSLSF’s performance remains notably stable, with no significant changes observed. In addition, as \(\alpha \) increasing, the influence of parameter \(\beta \) on the algorithm decreases and FSLSF reached stability when parameter \(\alpha \) reaches 10. Similarly, as \(\beta \) increasing, there is no obvious variation in algorithm performance. When parameter \(\beta \) is 0.001, FSLSF gets better performance in most cases. Then, we fixed \(\alpha \)=0.001 and \(\beta \)=0.001 and investigated the impacts of the parameters \(\gamma \) and \(\delta \). Figure 6 illustrates that, with the exception of the KullbackLeibler metric, the performance of FSLSF remains consistent across nearly all other metrics. In Fig. 5(d), the Kullback–Leibler metric obtains the optimal value when \(\gamma =1\) and \(\delta =0.001\). In general, these results demonstrate that FSLSF’ s performance remains stable despite changes in its parameters.
4.6 The comparison of computational time
This subsection is dedicated to assess the time efficiency of the FSLSF algorithm. Table 9 presents the experimental results of our method alongside several other feature selection algorithms. The row labeled ‘ Average’ in this table signifies the mean computation time across all datasets. In addition, the calculation time is provided in seconds.
As shown in Table 9, FSLSF exhibits a moderate average computational time, which is somewhat worse than the G3WI algorithm and the LRLSF algorithm. Since our proposed algorithm takes into account not only the label correlation, but also the relevance of each feature and label based on information theory, it requires more run time. In addition, the G3WI algorithm benefits from precomputing the feature correlation matrix and the feature interaction matrix, significantly reducing computational time. Although FSLSF is not optimal in terms of computational efficiency, the performance of FSLSF in six performance metrics is better than that of the G3WI algorithm and the LRLSF algorithm. In general, compared with the experimental results of the other five algorithms, the proposed feature selection algorithm FSLSF shows superior competitiveness.
5 Conclusion and future work
Some existing label distribution feature selection methods predominantly select a subset of common features that are shared with all class labels. However, in label distribution learning, an instance is linked with multiple class labels at once, and each class label may be influenced by its own specific features. Therefore, traditional feature selection approaches ignore labelspecific features that are relevant and distinguishing for each class label during the feature selection process. To tackle the issue, we develop a feature selection approach by utilizing labelspecific features to select highly informative features for label distribution data. Specifically, the labelspecific features are extracted by exploiting the \(l_{1}\)regularization, while common features are learned through the application of \(l_{2,1}\)regularization. Then, the relevant information contained between labels and the feature relevance is exploited to direct the feature selection process to facilitate the generation of an optimal feature subset. Comprehensive experiments demonstrate that our method achieves highly competitive performance against other wellestablished feature selection algorithms. However, the proposed method not directly address the complexities of incomplete label distribution data and exhibits sensitivity to noisy data. In future work, we will further focus on the challenges of partial label distribution learning, and propose corresponding robustness feature selection algorithms.
Data Availability
Data sharing not applicable.
Code Availability
Code availability not applicable.
References
Akbari A, Awais M, Fatemifar S, Khalid SS, Kittler J (2021) A novel ground metric for optimal transportbased chronological age estimation. IEEE Trans Cybern 52(10):9986–9999
AlFahdawi S, AlWaisy AS, Zeebaree DQ, Qahwaji R, Natiq H, Mohammed MA, Nedoma J, Martinek R, Deveci M (2024) Fundusdeepnet: multilabel deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf Fusion 102:102059
Berger A, Della Pietra SA, Della Pietra VJ (1996) A maximum entropy approach to natural language processing. Comput linguist 22(1):39–71
Fan Y, Chen B, Huang W, Liu J, Weng W, Lan W (2022) Multilabel feature selection based on label correlations and feature redundancy. KnowlBased Syst 241:108256
Fan Y, Liu J, Tang J, Liu P, Lin Y, Du Y (2024) Learning correlation information for multilabel feature selection. Pattern Recognit 145:109899
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748
Geng X, Xia Y (2014) Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1837–1842
Gupta A, Narayan S, Khan S, Khan FS, Shao L, van de Weijer J (2023) Generative multilabel zeroshot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence
Han Q, Hu L, Gao W (2024) Feature relevance and redundancy coefficients for multiview multilabel feature selection. Inf Sci 652:119747
Hang JY, Zhang ML (2021) Collaborative learning of label semantics and deep labelspecific features for multilabel classification. IEEE Trans Pattern Anal Mach Intell 44(12):9860–9871
Hao P, Hu L, Gao W (2023) Partial multilabel feature selection via subspace optimization. Inf Sci 648:119556
He Z, Lin Y, Wang C, Guo L, Ding W (2023) Multilabel feature selection based on correlation label enhancement. Inf Sci 647:119526
Hu L, Gao L, Li Y, Zhang P, Gao W (2022) Featurespecific mutual information variation for multilabel feature selection. Inf Sci 593:449–471
Huang J, Qin F, Zheng X, Cheng Z, Yuan Z, Zhang W, Huang Q (2019) Improving multilabel classification with missing labels by learning labelspecific features. Inf Sci 492:124–146
Kashef S, Nezamabadipour H, Nikpour B (2018) Multilabel feature selection: a comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(2):e1240
Li GL, Zhang HR, Min F, Lu YN (2023) Twostage label distribution learning with labelindependent prediction based on labelspecific features. KnowlBased Syst 267:110426
Li J, Zhang C, Zhou JT, Fu H, Xia S, Hu Q (2021) Deeplift: deep labelspecific feature learning for image annotation. IEEE Trans Cybern 52(8):7732–7741
Li J, Li P, Hu X, Yu K (2022a) Learning common and labelspecific features for multilabel classification with correlation information. Pattern Recognit 121:108259
Li S, Deng W (2019) Blended emotion inthewild: multilabel facial expression recognition using crowdsourced annotations and deep locality feature learning. Int J Comput Vis 127(6–7):884–906
Li W, Chen J, Lu Y, Huang Z (2022b) Filling missing labels in label distribution learning by exploiting labelspecific feature selection. In: 2022 International joint conference on neural networks (IJCNN), IEEE, pp 1–8
Lin Y, Liu H, Zhao H, Hu Q, Zhu X, Wu X (2022) Hierarchical feature selection based on label distribution learning. IEEE Transactions on Knowledge and Data Engineering
Liu H, Lin Y, Wang C, Guo L, Chen J (2023a) Semanticgaporiented feature selection in hierarchical classification learning. Inf Sci 642:119241
Liu K, Li T, Yang X, Chen H, Wang J, Deng Z (2023b) Semifree: semisupervised feature selection with fuzzy relevance and redundancy. IEEE Transactions on Fuzzy Systems
Lu Y, Li W, Li H, Jia X (2023) Predicting label distribution from tieallowed multilabel ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence
Ma J, Chow TW, Zhang H (2020) Semanticgaporiented feature selection and classifier construction in multilabel learning. IEEE Trans Cybern 52(1):101–115
Paul D, Bardhan S, Saha S, Mathew J (2023) Mlknockoffgan: deep online feature selection for multilabel learning. KnowlBased Syst 271:110548
Peng Y, Liu H, Li J, Huang J, Lu BL, Kong W (2022) Crosssession emotion recognition by joint labelcommon and labelspecific eeg features exploration. IEEE Trans Neural Syst Rehabil Eng 31:759–768
Qian W, Xiong C, Qian Y, Wang Y (2022a) Label enhancementbased feature selection via fuzzy neighborhood discrimination index. KnowlBased Syst 250:109119
Qian W, Xiong Y, Yang J, Shu W (2022b) Feature selection for label distribution learning via feature similarity and label correlation. Inf Sci 582:38–59
Qian W, Ye Q, Li Y, Dai S (2022c) Label distribution feature selection with feature weights fusion and local label correlations. KnowlBased Syst 256:109778
Qian W, Ye Q, Li Y, Huang J, Dai S (2022d) Relevancebased label distribution feature selection via convex optimization. Inf Sci 607:322–345
Qian W, Xu F, Huang J, Qian J (2023) A novel granular ball computingbased fuzzy rough set for feature selection in label distribution learning. KnowlBased Syst 278:110898
Qian W, Xiong Y, Ding W, Huang J, Vong CM (2024) Label correlationsbased multilabel feature selection with label enhancement. Eng Appl Artif Intell 127:107310
Ren T, Jia X, Li W, Chen L, Li Z (2019) Label distribution learning with labelspecific features. In: IJCAI, pp 3318–3324
SharifiNoghabi H, Harjandi PA, Zolotareva O, Collins CC, Ester M (2021) Outofdistribution generalization from labelled and unlabelled gene expression data for drug response prediction. Nat Mach Intell 3(11):962–972
Su Y, Zhao W, Jing P, Nie L (2022) Exploiting lowrank latent gaussian graphical model estimation for visual sentiment distributions. IEEE Trans Multimed 25:1243–1255
Wang J, Geng X (2019) Classification with label distribution learning. In: Proceedings of the twentyeighth international joint conference on artificial intelligence, pp 3712–3718
Xing C, Geng X, Xue H (2016) Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4489–4497
Xu P, Xiao L, Liu B, Lu S, Jing L, Yu J (2023a) Labelspecific feature augmentation for longtailed multilabel text classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 10602–10610
Xu T, Xu Y, Yang S, Li B, Zhang W (2023b) Learning accurate labelspecific features from partially multilabeled data. IEEE Transactions on Neural Networks and Learning Systems
Yang L, Li M, Shen C, Hu Q, Wen J, Xu S (2020) Discriminative transfer learning for driving pattern recognition in unlabeled scenes. IEEE Trans Cybern 52(3):1429–1442
Yang Y, Chen H, Mi Y, Luo C, Horng SJ, Li T (2023) Multilabel feature selection based on stable label relevance and labelspecific features. Inf Sci 648:119525
Yu ZB, Zhang ML (2021) Multilabel classification with labelspecific feature generation: a wrapped approach. IEEE Trans Pattern Anal Mach Intell 44(9):5199–5210
Zeng Q, Geng J, Jiang W, Huang K, Wang Z (2021) Idln: iterative distribution learning network for fewshot remote sensing image scene classification. IEEE Geosci Remote Sens Lett 19:1–5
Zhang J, Liu K, Yang X, Ju H, Xu S (2023a) Multilabel learning with reliefbased labelspecific feature selection. Appl Intell 53(15):18517–18530
Zhang J, Wu H, Jiang M, Liu J, Li S, Tang Y, Long J (2023b) Grouppreserving labelspecific feature selection for multilabel learning. Expert Syst ApplSystems with Applications 213:118861
Zhang ML, Wu L (2014) Lift: multilabel learning with labelspecific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Zhang ML, Fang JP, Wang YB (2021) Bilabelspecific features for multilabel classification. ACM Transactions on Knowledge Discovery from Data 16(1):1–23
Zhang Q, Tsang EC, He Q, Guo Y (2023c) Ensemble of kernel extreme learning machine based elimination optimization for multilabel classification. KnowlBased Syst 278:110817
Zou Y, Hu X, Li P (2024) Gradientbased multilabel feature selection considering threeway variable interaction. Pattern Recognition 145:109900
Acknowledgements
This work is supported by National Natural Science Foundation of China (62266018 and 62366019), and Natural Science Foundation of Jiangxi Province (20202BABL202037 and 20232
BAB202052).
Author information
Authors and Affiliations
Contributions
Wenhao Shu: Conceptualization, Methodology, Visualization, Writingoriginal draft. Qiang Xia: Data curation, Software, Validation, Formal analysis, Writingoriginal draft. Wenbin Qian: Investigation, Supervision, Writingreview and editing.
Corresponding author
Ethics declarations
Conflicts of Interest
All the authors do not have any possible conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shu, W., Xia, Q. & Qian, W. Label distribution feature selection based on labelspecific features. Appl Intell 54, 9195–9212 (2024). https://doi.org/10.1007/s10489024056688
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489024056688