1 Introduction

In recent years, single-label learning and multi-label learning are two representative paradigms, which have been widely used in solving label ambiguity problems. Specifically speaking, single-label learning assumes that each instance is associated with a single label, and labels are mutually exclusive. However, in the real-world application, an instance could simultaneously embody multiple semantic meanings [26, 49]. For instance, in facial emotional expressions, an expression may include a variety of complex emotions such as happiness, anger, and sadness. Similarly, in text annotation, news may be relevant to several topics such as politics, economics, and entertainment. In multi-label learning [2, 8], each instance is related to multiple labels, and each label is regarded to be equally important. Both single-label learning and multi-label learning basically regard the relationship between an instance and its label as binary, meaning that it determines whether the label is pertinent to the instance or not. In some situations, some tasks consider not only whether instances are associated with labels, but also the different importance degree of each label [24]. Consequently, [6] proposed a more generalized learning paradigm, called label distribution learning. Label distribution learning solves the problem of label ambiguity explicitly by depicting how relevant the label is to the instance.

Owing to the advantage of learning richer semantics from data compared to multi-label learning, label distribution learning has been widely used in various real-world applications. Applications of label distribution learning can be mainly divided into the following aspects: head posture recognition, scene classification, emotion analysis, and indoor crowd counting. Taking scene classification as an example, a scene topic is generally composed of various basic words, namely, ‘ road’ , ‘ water’ , and ‘ veg’ . Each of these basic words contributes in a unique way to the formation of the scene classification. However, as data from videos, images, and text continues to increase, the label distribution data in the above applications often grapples with the issue of high dimensionality. The emergence of high-dimensional label distribution data not only increases the computational cost and storage, but also reduces the quality of learning models. Feature selection, as an effective data preprocessing technology, which reduces the data dimension and improves algorithm performance. Until now, a variety of feature selection approaches have been studied to select relevant features from label distribution data [21, 23, 28, 32]. Qian et al. [29] proposed a specialized feature selection algorithm by taking into account both the feature similarity and label correlation, which can deal with high-dimensional label distribution data efficiently. [22] proposed a hierarchical feature selection method to deal with the Semantic gap, which uses label enhancement to obtain a label distribution. [33] developed multi-label feature selection algorithm for label distribution data, which utilizes label enhancement based on the deep forest to transform the logical label.These methods aim to learn a subset of features that are used for discriminating all labels. However, this strategy may not be optimal, as each label may require its own specific features for decision-making. Therefore, it is essential to consider label-specific features explicitly, thereby enhancing the performance of label distribution learning in the feature selection process.

Recently, the strategy of label-specific features has been integrated with several tasks in label distribution learning, such as exploiting label correlation [34] and completing missing labels [20]. Ren et al. [34] developed a novel label distribution learning algorithm by leveraging label-specific features, which simultaneously selected label-specific features and common features. [20] proposed a label distribution learning approach for incomplete data by exploiting label-specific feature selection, which can fill missing label information. In general, these methods select label-specific features regarding each label by a coefficient matrix W. For example, as shown in Fig. 1, let us consider 3 instances with a 6-dimensional feature space {\(f_{1},f_{2},\ldots ,f_{6}\)}, and their corresponding labels \(y_{1}\) and \(y_{2}\). Upon training, we obtain a coefficient matrix W, which comprises two vectors {\(w_{1},w_{2}\)} corresponding for two labels. The non-zero elements of {\(w_{1},w_{2}\)} allow us to identify the specific features associated with labels \(y_{1}\) and \(y_{2}\). It is worth noting that two correlated labels may have different label-specific features. Therefore, these methods employ label correlation to ensure that two strongly correlated labels yield similar outputs. Previous studies on label-specific features have considered the task of exploiting label correlation or completing missing labels, but they have rarely taken into account the feature relevance along with the generation of label-specific features.

Fig. 1
figure 1

Illustration of specific features on label distribution data

In light of these observations, the strategy of label-specific features has been widely accepted in data classification. However, most models are dealing with multi-label data, which reduces some label information for label distribution data [4, 5, 9, 11]. Moreover, many existing label-specific features works ignore the relationship between features and labels [16, 18, 40]. To address this issue, a novel feature selection approach is proposed that exploits label-specific features for label distribution data in this paper. First, the optimization framework is formulated by sparse learning, which exploits specific features with \(l_{1}\)-norm regularization and common features with \(l_{2,1}\) regularization. Secondly, the Pearson correlation coefficient is applied to measure the correlation information among labels, which can assist the optimization process and enhance the generalization ability of the proposed framework. Furthermore, Mutual information variation is incorporated into the proposed optimization function to examine the relevance between features and labels, which can enhance the discriminatory ability of the feature selection model. Finally, a series of complete experiments are developed to verify the effectiveness of our proposed method.

The major contributions of this work are summarized in the following four points:

  1. 1.

    A novel feature selection approach for label distribution data is proposed, which considers simultaneously common, and label-specific features. Moreover, the proposed method further improves the performance by exploiting label-specific features and sparse learning.

  2. 2.

    Label correlation and feature relevance are incorporated in feature selection, simultaneously, thereby effectively enhancing the generalization ability of the learning model and reducing the redundancy of features.

  3. 3.

    The proposed method can directly deal with label distribution data, without discretizing the label information. Thus, the performance of label distribution learning can be significantly improved by selecting discriminate features.

  4. 4.

    Extensive empirical comparison of our method and other state-of-the-art feature selection algorithms are conducted on fifteen benchmark datasets using six evaluation metrics. The experiment result demonstrated the superiority of our method.

This paper is structured as follows. Section 2 introduces the work related to this paper. The proposed approach is detailed in Section 3. In Section 4, the experimental results obtained by the proposed approach are analyzed. In Section 5, the paper is concluded and future work is discussed.

2 Related work

2.1 Label distribution learning

Label Distribution Learning (LDL), which has demonstrated a superior ability to extract complex semantics from data compared to multi-label learning, has broad applications in areas such as gene selection and scene recognition [35, 41, 44]. Various label distribution learning approaches can be categorized into three types: problem transformation, algorithm adaptation, and specialized algorithms. Problem transformation is the simplest approach, where the LDL problem is reformulated into existing learning paradigms. [6] introduced two typical problem transformation algorithms, which treated each label as an independent binary classification problem. The algorithm adaptation method attempts to modify existing learning models to handle label distribution directly. For example, LogitBoost [38] proposed an additive weighted function regression to fit the LDL paradigm. [6] adopted kNN algorithm to predict the label distribution.

Unlike the indirect strategy of problem transformation and algorithm adaptation, the specialized algorithms are tailor-made for label distribution learning problems. Most specialized LDL algorithms mainly adopt the maximum entropy model [3] as output model, including such examples as SAIIS [6] and SABFGS [6]. Both these approaches use the K-L (Kull-backLeibler) divergence as the objective function to quantify the loss between the predicted and actual label values. Meanwhile, some LDL algorithms use the other similarity measure as the objective function. For example, [7] opted for the Jeffrey divergence as their optimization objective in the realm of head pose estimation. Akbari et al. [1] proposed the \(\gamma \) -Wasserstein loss that leverages the specific geometric structure of the age label space to enhance the impact of highly correlated ages. Additionally, there are also differences in the optimization methods of LDL algorithms. LDLSP [34] and LGGME [36] both utilize ADMM (Alternating Direction Method of Multipliers) as the optimization method. LDL4C [37] employs the BFGS method for optimization.

The specialized algorithms outperform the algorithms based on the problem transformation and algorithm adaptation strategies from relevant studies and experiments, which received much attention from researchers. However, label distribution learning, like other conventional learning paradigms, is challenged by the curse of dimensionality. In this paper, we will propose a specialized algorithm to reduce the dimensionality of label distribution datasets, which utilize label-specific features to direct feature selection.

2.2 Label-specific features learning

Label-specific features is a processing strategy that assumes different features for each label rather than using the same original features for all labels. This strategy stems from multi-label learning and seeks to obtain features that are highly relevant to each label to improve model performance. Based on the learning process, label-specific features learning can be classified into two types. The first is generating label-specific features in the transformed feature space. For example, LIFT [47] is the pioneer multi-label learning method based on label-specific features, which uses cluster analysis on positive and negative instances for each label to generate the corresponding label-specific features. However, LIFT is unable to distinguish the features that are strongly relevant to each label and does not take into consideration the correlation information among labels. Hang and Zhang [10] proposed a label-specific features learning approach based on deep neural networks, which use the label semantics to select label-specific features. Zhang et al. [48] presented an effective BiLabel-specific feature learning approach, which generates label-specific features for label pairs through heuristic prototype selection and embedding. These methods based on first strategy improve the performance of multi-label learning, but they are often not intuitive and unexplainable.

The second is generating label-specific features from original feature space directly. Specifically, these approaches typically integrate the process of label-specific feature selection into the classification model, thereby facilitating a mutual enhancement between the two [17, 39, 45]. Among a variety of methods, most employ the l1-norm to learn label-specific features. For example, [14] proposed a novel method for learning label-specific features, which incorporates the learned high-order label correlations with missing labels. Since learning label-specific features is independent of the classification model, the performance of classification model may be impacted. To handle the issues, [43] introduced a comprehensive learning strategy that simultaneously generates label-specific features and induces a classification model. However, these methods ignore some common features that are discriminative across all labels. Ma et al. [25] performed sample-specific and label-specific classifications, which combines interlabel and interinstance correlations. [40] proposed a new two-stage partial multilabel feature selection method, which fuses label correlations into the feature selection process. Moreover, some approaches achieved the generation of label-specific features by involving instance correlations. Li et al. [18] developed a novel method of learning common and label-specific features, which uses the correlation information from labels and instances in multi-label classification. Zhang et al. [46] proposed a novel label-specific feature selection framework, which simultaneously considers label-group and instance-group correlations. Peng et al. [27] proposed a joint label-common and label-specific features selection model, which can achieve semi-supervised crosssession Electroencephalogram emotion recognition.

Fig. 2
figure 2

The overall framework of FSLSF

The above research have shown that label-specific features is an efficient processing strategy for multi-label data. However, current label-specific features methods are usually for multi-label learning, and do not consider feature correlation. To deal with this issue, an approach based on label-specific features is proposed by involving label correlation and feature correlation, which can select distinctive features for label distribution data.

3 Proposed approach

This section proposes a novel feature selection approach based on label distribution, which simultaneously considers label-specific features for each label and common features for the label set. First, the approach of learning label-specific features is proposed. Second, the Pearson coefficient is utilized to investigate the correlation among labels, thereby improving the performance of model. Finally, feature relevance is also considered through mutual information theory.

3.1 Problem formulation

Let \(X=[x_{1},x_{2},\ldots ,x_{i},...,x_{n}]\in R^{n\times m}\) denote the instance matrix of the training data in the m-dimensional feature space, where \(x_{i}\) is the i-th instance, and n is the number of instance. \(Y=[y_{1},y_{2},...,y_{n}]\in R^{n\times l}\) denote the label space with l class labels. Given a training set \(S=\{(x_{1},D_{1})\), \((x_{2},D_{2}),\) \(\ldots ,(x_{n},D_{n})\}\), where \(D_{i}=\{d_{i1},d_{i2},\ldots ,d_{il}\}\) is the label distribution associated with \(x_{i}\), and \(d_{il}\) represent the significance of label \(y_{l}\) to instance \(x_{i}\). In addition, the label distribution \(d_{il}\)satisfies \(d_{il}\in [0,1]\) and \(\varSigma _{l=1}^{c}d_{il}=1\).

3.2 Feature selection model based on label-specific features

For label distribution feature selection, leveraging common features of the label set to improve model performance is an effective and popular method. In general, some common features exhibit comparable discriminatory power across all class labels, thereby enhancing the efficacy of selecting features from the original feature set. Nevertheless, class labels may have distinct subsets of label-specific features that are particularly relevant and discriminatory for their corresponding label. Thus, the selected subset of features may be suboptimal due to the ignorance of label-specific features for each class label. In this paper, we adopt an optimization learning framework to learn accurate label-specific features and common features from label distribution data. The overall framework of this paper is shown in Fig. 2.

As depicted in Fig. 2, our proposed method, FSLSF, is composed of several components which include data preparation, learning common and label-specific features, exploring label correlation and feature relevance, and feature selection. In the data preparation stage, we transform multi-label data into label distribution data through a label enhancement algorithm. In the second stage, we learn both common and label-specific features, assuming a linear relationship between the feature space and the label space. This relationship can be represented as \(X\varvec{W}=\hat{D}\). Here, the matrix \(\varvec{W}=[w_{1},w_{2},...,w_{l}]\in R^{m\times l}\) indicates the correlations among different labels and features, where \(w_{l}\) is a coefficient vector contains the discriminative power of all features for the l-th label. The optimization function is developed to minimize empirical loss between modeling outputs \(X\,\varvec{W}\) and predicted label distribution \(\hat{D}\). Specifically, we use the sparsity of the \(l_{1}\) regularizer to obtain label-specific features matrix \(\varvec{Q}\), and use the \(l_{2,1}\) regularizer to retain common features. Furthermore, the correlation of labels in label space is explored to constrain label-specific features matrix \(\varvec{Q}\). The relevance between features and labels is incorporated to eliminate the redundant features. During the feature selection process, the final feature significant matrix \(\varvec{W}\) is obtained by fusing \(\varvec{V}\) and \(\varvec{Q}\). With the feature importance matrix, we can rank the features and select the top-ranked ones.

The corresponding optimization function for learning can be formulated as follows.

$$ \min _{\varvec{W},\varvec{Q},\varvec{V}}\frac{1}{2}||X\varvec{W}-D||_{F}^{2}+\alpha \varPhi (\varvec{Q})+\beta \varPsi (\varvec{V}), $$
$$\begin{aligned} s.t.\,\varvec{W}=\varvec{Q}+\varvec{V}, \end{aligned}$$
(1)

where the regularizers \(\varPhi (.)\) and \(\varPsi (.)\) are employed to promote stability in learning the label-specific features matrix \(\varvec{Q}\) and common features matrix \(\varvec{V}\), respectively, and \(\varvec{W}\) is consisted of \(\varvec{Q}\) and \(\varvec{V}\). Besides, \(\alpha \) and \(\beta \) are the regularization parameters to control the sparsity of the label-specific features matrix \(\varvec{Q}\) and common features matrix \(\varvec{V}\), respectively.

Since each label is determined by several specific features of its own, the label-specific features matrix \(\varvec{Q}\) is enforced to be sparse. Meanwhile, regularization using \(l_{1}\) norm can induce sparsity among all elements in \(\varvec{Q}\), leading to certain parameters being shrunk to zero. Therefore, in order to extract label-specific features of each label, \(l_{1}\) norm regularizer is utilized to make full use of this sparsity. To regularize the common features matrix \(\varvec{V}\), we use the \(l_{2,1}\) norm, which promotes a sparse representation with a few non-zero rows. This ensures that discriminative features that are common across all labels are selected. To sum up, the optimization function can be reformulated as follows:

$$ \min _{\mathbf {\varvec{W},\varvec{Q},\varvec{V}}}\frac{1}{2}||X\varvec{W}-D||_{F}^{2}+\alpha ||\varvec{Q}||_{1}+\beta ||\varvec{V}||_{2,1}, $$
$$\begin{aligned} s.t.\,\varvec{W}=\varvec{Q}+\varvec{V}, \end{aligned}$$
(2)

For label-specific features matrix \(\varvec{Q}\), a non-zero value of \(q_{ij}\) signifies that the i-th feature distinguishes the j-th label, thereby qualifying as a label-specific feature for that label. Conversely, a zero value of \(q_{ij}\) implies that the corresponding feature is not informative for the j-th label. For common features matrix \(\varvec{V}\), if \(v_{i}\ne 0\), it indicates that the i-th feature exhibits strong discriminatory power for the label set.

3.3 Label correlation

In label distribution learning, each sample is assigned to interrelated label distributions. Therefore, taking label correlation into account proves advantageous in enhancing the weights of discriminative features. In addition, the label correlation constraint has been demonstrated to provide substantial performance improvements in label distribution learning. Leveraging the labels linked with a specific label allows for the adjustment of that label’ s distribution value. If a robust connection exists between two labels, their distribution values ought to be alike. Otherwise, if \(l_{p}\) and \(l_{q}\) indicate the distribution of discriminative features, these should differ. As a result, the objective function can be reformulated as follows:

$$\begin{aligned} \min _{\mathbf {\varvec{W},\varvec{Q},\varvec{V}}}\frac{1}{2}||X\varvec{W}-D||_{F}^{2}+\alpha ||\varvec{Q}||_{1}+\beta ||\varvec{V}||_{2,1}+\gamma \sum _{p=1}^{c}\sum _{q=1}^{c}r_{pq}\varvec{Q}_{p}^{T}\varvec{Q}_{q}, \end{aligned}$$
$$\begin{aligned} s.t.\,\varvec{W}=\varvec{Q}+\varvec{V}, \end{aligned}$$
(3)

where \(r_{pq}=\frac{1}{s_{pq}-1}\) , and \(s_{pq}\) denotes the correlation coefficient between the distributions of the p-th label \(y_{p}\) and q-th label \(y_{q}\). \(\gamma \) is the balance factor. In this paper, the Pearson correlation function is employed to compute the label correlation matrix \(s_{pq}\), as outlined below:

$$\begin{aligned} s_{pq}=\frac{\sum _{i=1}^{n}(d_{ip}-\mu _{p})\sum _{i=1}^{n}(d_{iq}-\mu _{q})}{\sqrt{\sum _{i=1}^{n}(d_{ip}-\mu _{p})^{2}}\sqrt{\sum _{i=1}^{n}(d_{iq}-\mu _{q})^{2}}}, \end{aligned}$$
(4)

where \(\mu _{p}\) and \(\mu _{q}\) denote the mean of the p-th and q-th label distribution for all instances in the label distribution matrix D, respectively. In the light of that \(\sum _{p=1}^{c}\sum _{q=1}^{c}r_{pq}q_{p}^{T}q_{q}=tr(RQ^{T}Q)\), the optimization objective function can be further rewritten as the form:

$$ \min _{\mathbf {\varvec{W},\varvec{Q},\varvec{V}}}\frac{1}{2}||X\varvec{W}-D||_{F}^{2}+\alpha ||\varvec{Q}||_{1}+\beta ||\varvec{V}||_{2,1}+\gamma tr(\varvec{R}\varvec{Q}^{T}\varvec{Q}), $$
$$\begin{aligned} s.t.\,\varvec{W}=\varvec{Q}+\varvec{V}, \end{aligned}$$
(5)

where tr(.) is the trace of a matrix, and \(\varvec{R}=[r_{pq}]_{l\times l}\) denotes the correlation matrix of the labels. By considering label correlation through Pearson’s coefficient, the optimization function can more efficiently select features with distinguishing ability.

3.4 Feature relevance

In label distribution learning, utilizing information theory to develop feature evaluation criteria is a common and effective approach. Generally speaking, information theory is a traditional and useful measure for determining the correlation among random variables. Formally, the mutual information shared by two random variables, X and Y, can be delineated as follows:

$$\begin{aligned} I(X;Y)=H(X)-H(X\mid Y)=H(Y)-H(Y\mid X). \end{aligned}$$
(6)

Consider Z as a discrete random variable. The conditional mutual information can also be articulated using information entropy as follows:

$$\begin{aligned} I(X;Y|Z)=H(X\mid Z)-H(X\mid Y,Z). \end{aligned}$$
(7)

In feature selection approaches based on information theory, the construction of feature relevance terms is typically dependent on the quantity of information that selected or candidate features contribute to the label set. Hu et al. [13] proposed a new feature relevance term, which considered both the altered ratio of the undetermined information quantity and the modified ratio of the established information quantity. The corresponding evaluation function is outlined below:

$$\begin{aligned} J(f_{k};f_{j};l_{i})=\sum _{l_{i}\in L}&\left\{ I\left( f_{k};l_{i}\right) \sum _{f_{j}\in S}\left[ \frac{I\left( f_{k};I_{i}\mid f_{j}\right) }{I\left( f_{k}:l_{j}\right) }\right. \right. \nonumber \\&\left. \left. +\frac{I\left( f_{j};l_{i}\mid f_{k}\right) }{I\left( f_{j}\cdot l_{i}\right) }\right] \right\} , \end{aligned}$$
(8)

where F represents the whole set of features, S denotes the subset of selected features and L signifies the label set, \(f_{k}\in F-S;f_{j}\in S;l_{i}\in L\). The function \(J(f_{k};f_{j};l_{i})\) concurrently takes into account the altered ratio of the undetermined information quantity and the modified ratio of the established information quantity.

We inherit the evaluation method to access feature importance, and propose the following optimization objective function :

$$\begin{aligned} \max _{\mathbf {\varvec{W}}}\sum _{k=1}^{m}\sum _{j=1}^{I}J\left( f_{k};f_{F-k};l_{j}\right) w_{kj}, \end{aligned}$$
(9)

where \(J\left( f_{k};f_{F-k};l_{j}\right) \) signifies the relevance between the feature \(f_{k}\) and the label \(l_{j}\), while \(w_{kj}\) represents the importance of the feature \(f_{k}\) for the label \(l_{j}\). To simplify, the similarity between them can be directly computed as follows:

$$\begin{aligned} dis(w,c)=\root p \of {\sum _{k=1}^{m}\sum _{j=1}^{I}\left| w_{kj}-c_{kj}\right| ^{p}}, \end{aligned}$$
(10)

where p is set as 2.

Then, the final objective function arrives at:

$$ {\begin{matrix} \min _{\mathbf {\varvec{W},\varvec{Q},\varvec{V}}}\frac{1}{2}||X\varvec{W}-D||_{F}^{2}+\alpha ||\varvec{Q}||_{1}\\ +\beta ||\varvec{V}||_{2,1}+\gamma tr(\varvec{R}\varvec{Q}^{T}\varvec{Q})+\\ \delta |\varvec{W}-\varvec{C}|_{2}^{2}, \end{matrix}} $$
$$\begin{aligned} s.t.\,\varvec{W}=\varvec{Q}+\varvec{V}, \end{aligned}$$
(11)

where \(\delta \) is the tradeoff parameter. In summary, the proposed optimization feature selection framework uses the first term as a loss function, that calculates the difference between the predicted label distribution and the actual one. The second and third terms are to extract label-specific features for the corresponding class label and to find common features for all labels, respectively. Besides, the coefficient matrix \(\varvec{W}\) is involved in terms four and five, which makes the optimized feature selection result affected by the label correlation and feature relevance simultaneously.

3.5 Optimization

Our approach’ s ultimate objective function is convex in nature, yet it doesn’ t constitute a smooth optimization problem owing to the presence of l2, 1 and l1 -norm regularizers. To tackle this issue, the optimization problem can be reformulated in the following manner:

$$\begin{aligned} {\begin{matrix} \min _{\mathbf {\varvec{Q},\varvec{V}}}\frac{1}{2}||X\varvec{W}-D||_{F}^{2}+\alpha ||\varvec{Q}||_{1}\\ +\beta ||\varvec{V}||_{2,1}+\gamma tr(\varvec{R}\varvec{Q}^{T}\varvec{Q})+\\ \delta |\varvec{W}-\varvec{C}|_{2}^{2}. \end{matrix}} \end{aligned}$$
(12)

In this context, \(||V||_{2,1}\) is directly relaxed by \(tr(V^{T}AV)\), with \(\varvec{A}\) representing the \(p\times p\) diagonal matrix and the i th diagonal value in \(\varvec{A}\) expressed as \(A_{ii}=\frac{1}{2||V_{i}||_{2}}\). Furthermore, the optimization problem can be efficiently addressed by optimizing \(\varvec{Q}\) and \(\varvec{V}\) in an alternating manner.

Fix \(\varvec{Q}\), Optimize \(\varvec{V}\): With \(\varvec{Q}\) held constant, the optimization problem in relation to \(\varvec{V}\) can be restructured as follows:

$$\begin{aligned} \min _{\mathbf {\varvec{V}}}\frac{1}{2}||X\varvec{V}-\varvec{E}||_{F}^{2}+\beta tr(\varvec{V}^{T}A\varvec{V})+\delta |\varvec{V}-\varvec{C}|_{2}^{2}, \end{aligned}$$
(13)

where \(\varvec{E}=D-X\varvec{Q}\). Here, matrices \(\varvec{A}\) and \(\varvec{R}\) are positive semidefinite, and \(\varvec{V}^{T}\varvec{A}\varvec{V}\ge 0\) and \(\varvec{R}\varvec{V}^{T}\varvec{V}\) for any nonzero \(\varvec{V}\) hold. Thus, we can get \(\frac{\partial tr(\varvec{V}^{T}\varvec{A}\varvec{V})}{\partial \varvec{V}}=(\varvec{A}^{T}+\varvec{A})\varvec{V}\) and \(\frac{\partial tr(\varvec{R}\varvec{V}^{T}\varvec{V})}{\partial \varvec{V}}=V(\varvec{R}^{T}+\varvec{R}).\) (13) is differentiable and has a closed-form solution. Therefore, we can transform (13) into the following one:

$$\begin{aligned} (X\varvec{V}-\varvec{E})+\beta (\varvec{A}^{T}+\varvec{A})\varvec{V}+2\delta (\varvec{V}-\varvec{C})=0. \end{aligned}$$
(14)

By setting the derivative of (14) with respect to \(\varvec{V}\) to 0, we can get the solution for \(\varvec{V}\) as follows:

$$\begin{aligned} \varvec{V}=(X^{T}X+\beta \varvec{A}+2\delta )^{-1}(2\varvec{C}+X^{T}\varvec{E}). \end{aligned}$$
(15)

Fix \(\varvec{V}\), Optimize \(\varvec{Q}\): With \(\varvec{V}\) held constant, the optimization problem in relation to \(\varvec{Q}\) can be restructured as follows:

$$\begin{aligned} \min _{\mathbf {\varvec{Q}}}\frac{1}{2}||X\varvec{Q}-\varvec{E}||_{F}^{2}+\alpha ||\varvec{Q}||_{1}+\gamma tr(\varvec{R}\varvec{Q}^{T}\varvec{Q})+\delta |\varvec{Q}-\varvec{C}|_{2}^{2}, \end{aligned}$$
(16)

where \(\varvec{E}=D-X\varvec{V}\).

Employing the accelerated proximal gradient strategy, we address the convex optimization problem, which is expressed as follows:

$$\begin{aligned} \min _{\mathbf {\varvec{Q}}}F(\varvec{Q})=f(\varvec{Q})+g(\varvec{Q}). \end{aligned}$$
(17)

Here, H represents the real Hilbert space. According to (15) and (17), \(f(\varvec{Q})\) and \(g(\varvec{Q})\) can be derived as follows:

$$\begin{aligned} f(\varvec{Q})=\frac{1}{2}||X\varvec{Q}-\varvec{E}||_{F}^{2}+\gamma tr(\varvec{R}\varvec{Q}^{T}\varvec{Q})+\delta |\varvec{Q}-\varvec{C}|_{2}^{2}, \end{aligned}$$
(18)
$$\begin{aligned} g(\varvec{Q})=\alpha ||\varvec{Q}||_{1}. \end{aligned}$$
(19)

\(f(\varvec{Q})\) is both convex and smooth, while g(Q), although convex, is typically not smooth.

For \(\nabla f(\varvec{Q})\), it can be obtained by taking the derivative of (18) with respect to \(\varvec{Q}\). That is

$$\begin{aligned} \nabla f(\varvec{Q})=X^{T}X\varvec{Q}-X^{T}\varvec{E}+2\gamma \varvec{Q}\varvec{R}+2\delta (\varvec{Q}-\varvec{C}). \end{aligned}$$
(20)

According to the existing studies [18], below we propose to minimize the separable quadratic approximation sequence of \(f(\varvec{Q})\) by the proximal gradient algorithm rather than minimizing it directly, which is expressed as

$$\begin{aligned} \varvec{Q}=arg~{\min }_{\varvec{Q}}\alpha ||\varvec{Q}||_{1}+\frac{L_{f}}{2}||\varvec{Q}-O^{(t)}||_{2}^{2}, \end{aligned}$$
(21)

where \(L_{f}\) is termed as the Lipschitz constant and \(O^{(t)}=\varvec{Q}^{(t)}-\frac{1}{L_{f}}\nabla f(\varvec{Q})^{(t)}\).

Proposition 1

If H is a Euclidean space endowed with the Frobenius norm \(||\cdot ||_{F}\) and the \(l_{1}\) norm \(||\cdot ||_{1}\), \(\varvec{Q}\) can be updated by soft-shrinkage operator as follows:

$$\begin{aligned} \varvec{Q}^{(t+1)}=\zeta [O^{(t)}]=arg~{\min }_{\varvec{Q}}\alpha ||\varvec{Q}||_{1}+\frac{L_{f}}{2}||\varvec{Q}-O^{(t)}||_{2}^{2}. \end{aligned}$$
(22)

According to Proposition 1, the soft-thresholding operator \(\zeta [O^{(t)}]\) and the closed-form solution for \(\varvec{Q}\) can be collectively expressed as follows:

$$\begin{aligned} q_{ij}=\zeta [o_{il}]={\left\{ \begin{array}{ll} e_{ij}-\frac{\alpha }{L_{f}}, & if\,e_{ij}>\frac{\alpha }{L_{f}},\\ e_{ij}+\frac{\alpha }{L_{f}}\text {,} & if\,e_{ij}<-\frac{\alpha }{L_{f}}\text {,}\\ 0\text {,} & otherwise. \end{array}\right. } \end{aligned}$$
(23)

where \(1\le i\le n\) and \(1\le j\le l\). This operator can be extended to vectors and matrices by applying it element wisely.

Given \(\varvec{Q}_{1}\) and \(\varvec{Q}_{2}\) , the Lipschitz constant \(L_{f}\) satisfies the inequality. Thus, the Lipschitz constant \(L_{f}\) can be obtained by the following equation:

$$\begin{aligned} {\begin{matrix} & ||\nabla f(\varvec{Q}_{1})-\nabla f(\varvec{Q}_{2})||_{2}^{2}=||X^{T}X{\varDelta } \varvec{Q}+2\gamma \varvec{R}{\varDelta } \varvec{Q}+2\delta {\varDelta } \varvec{Q}||_{2}^{2}\\ & \le (||X^{T}X{\varDelta } \varvec{Q}||+||2\gamma \varvec{R}{\varDelta } \varvec{Q}||+||2\delta {\varDelta } \varvec{Q}||)^{2}=||X^{T}X{\varDelta } \varvec{Q}||^{2}+\\ & ||2\gamma \varvec{R}{\varDelta } \varvec{Q}||^{2}+||2\delta {\varDelta } \varvec{Q}||^{2}+2||X^{T}X{\varDelta } \varvec{Q}||\cdot ||2\gamma \varvec{R}{\varDelta } \varvec{Q}||+\\ & 2||2\gamma \varvec{R}{\varDelta } \varvec{Q}||\cdot ||2\delta {\varDelta } \varvec{Q}||+2||X^{T}X{\varDelta } \varvec{Q}|| \cdot ||2\delta {\varDelta } \varvec{Q}||\\ & \le 3(||X^{T}X{\varDelta } \varvec{Q}||^{2}+||2\gamma \varvec{R}{\varDelta } \varvec{Q}||^{2}+||2\delta {\varDelta } \varvec{Q}||^{2}) \\ & =3(||X^{T}X||^{2}+||2\gamma \varvec{R}||^{2}+||2\delta ||^{2})||{\varDelta } \varvec{Q}||^{2}. \end{matrix}} \end{aligned}$$
(24)

The Lipschitz constant \(L_{f}\) can be set as

$$\begin{aligned} L_{f}=\sqrt{3(||X^{T}X||_{2}^{2}+||2\gamma \varvec{R}||_{2}^{2}+||2\delta ||_{2}^{2})}. \end{aligned}$$
(25)

On the basis of the above optimization process, a label distribution feature selection algorithm based on label-specific features is proposed. The flow chart of this algorithm is depicted in Fig. 3.

As shown in Fig. 3, the feature matrix is initialized and the correlation coefficient of the optimization function is calculated. Then, the label-specific feature matrix \(Q^{(t)}\) and the common feature matrix \(V^{(t)}\) are progressively updated until the termination condition is met. Finally, the top-ranked features are obtained through the feature importance matrix.

Fig. 3
figure 3

Flow chart of Algorithm FSLSF

Algorithm 1
figure a

Label distribution feature selection algorithm based on label-specific features (Algorithm FSLSF).

Analysis of algorithm FSLSF The time complexity of the proposed method FSLSF is composed of three components: initialization, computation of the correlation matrix and Lipschitz constant, and feature selection. Initially, the time complexity of initializing the variables \(\varvec{V}\) and \(\varvec{Q}\) is \(O(m^{2}n+m^{3}+mnl)\), where n, m, l denote the number of instances, features, and labels, respectively. Then, the feature relevance matrix \(\varvec{C}\) is computed with a time complexity of O(mnl) and the label correlation matrix \(\varvec{R}\) is computed with a time complexity of \(O(ml^{2})\). The process of computing the Lipschitz constant \(L_{f}\) is \(O(m^{3}+l^{3})\). Finally, the feature selection process has a time complexity of \(O(t(m^{2}l+ml^{2}+m^{2}n+m^{3}+mnl))\), where t indicates the number of iterations. Therefore, the total time complexity of the proposed algorithm is \(O(l^{3}+t(m^{2}l+ml^{2}+m^{2}n+m^{3}+mnl))\).

4 Experiments

In this section, we delve into the performance and feasibility of our algorithm via comparative experiments. Initially, the experimental datasets are outlined in Section 4.1. Subsequently, Section 4.2 introduces the widely used evaluation metrics. Following this, Section 4.3 establishes the settings for the parameters and the count of selected features for the experimental algorithms. Next, the effectiveness of FSLSF is confirmed through an analysis of experimental results in Section 4.4. Finally, we conduct an analysis of parameter sensitivity in Section 4.5.

4.1 Datasets

To assess the reliability of the proposed FSLSF approach, we have conducted extensive comparative experiments for feature selection tasks across sixteen benchmark datasets. The specifics of these datasets are encapsulated in Table 1, where ‘ #instance’, ‘ #feature’ , and ‘ #label’ denote the number of instances, features, and labels respectively. The sixteen datasets are label-distributed datasets, and the first fifteen accessible as downloaded from the associated websiteFootnote 1. In addition, RAF-ML is a facial expression dataset that can be found in [19].

Table 1 Data sets information descriptions

4.2 Evaluation metrics

In this study, six evaluation metrics are utilized to measure the performance of FSLSF. Considering the testing dataset \(\bar{S}=\{\bar{x_{i}};\bar{D_{i}}|1\le i\le n\}\), where \(\bar{D_{i}}=\{\bar{d_{1}},\ldots ,\bar{d_{j}},\ldots ,\bar{d_{l}}\}\) represents the actual label distribution of the instance \(\bar{x_{i}}\). The predicted label distribution for \(\bar{x_{i}}\), denoted as \(\hat{D_{i}}=\{\hat{d_{1}},\ldots ,\hat{d_{j}},\ldots ,\) \(\hat{d_{l}}\}\), is predicted by the label distribution algorithm SA-BFGS. The six evaluation metrics are categorized into two types: distance-based metrics (including Chebyshev, Clark, Canberra, and Kullback–Leibler divergence) and similarity-based metrics (comprising Cosine and Intersection). For the distance-based metrics, a lower value signifies better performance, indicated by the downward arrow (\(\downarrow \)). Conversely, for the similarity-based metrics, a higher value represents better performance, indicated by the upward arrow (\(\uparrow \)). Detailed explanations of these evaluation metrics are provided in Table 2.

Table 2 Evaluation metrics for label distribution learning

4.3 Experimental settings

In this subsection, we discuss the settings of experiments. Our proposed method, FSLSF, is evaluated against five other state-of-the-art feature selection methods, including RLFC, G3WI, LRLSF, and FSFL. As for FSLSF, the regularization parameters \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are adjusted within the range of \(\{10^{-3},10^{-2},...,10^{2}\}\), respectively. The parameters associated with these methods are configured as follows:

  • RLFC [31] was an relevance-based feature selection approach for label distribution data, which takes both feature relevance and label correlation into account. The parameters \(\alpha \), \(\beta \), \(\gamma \) are searched in \(\{10^{-3},10^{-2},...,10^{2}\}\).

  • G3WI [50] was a novel gradient-based multi-label feature selection approach, which integrates the aforementioned three-way variable interactions into a comprehensive global optimization objective. The parameters \(\alpha \), \(\beta \) are tuned in the range of \(\{0,10^{-3},10^{-2},\ldots ,10^{1}\}\).

  • LRLSF [42] was a method for multi-label feature selection that relied on stable label relevance and label-specific features. It integrated both global and local label relevance to learn the features specific to each label. The trade-off parameters \(\lambda _{1}\) , \(\lambda _{2}\), \(\lambda _{3}\) and \(\lambda _{4}\) are searched in \(\{10^{-3},10^{-2},...,10^{3}\}\).

  • FSFL [30] was a feature selection approach for label distribution data, which considers feature weights fusion and local label correlations. The number of clusters k is set as 5 while \(\alpha \) and \(\beta \) are tuned in \(\{10^{-3},10^{-2},...,10^{2}\}\).

  • CLEMLFS [12] was a feature selection algorithm for multi-label data, which uses a label enhancement algorithm to enhance traditional logical labels into label distribution. The balance factors \(\lambda _{1}\) and \(\lambda _{2}\) are tuned in the range of \(\{10^{-2},10^{-1},...,10^{2}\}\).

Since the multi-label feature selection method cannot directly deal with the real-world label distribution dataset. Therefore, we employ an equal-width strategy to discretize the label distribution dataset prior to performing dimensionality reduction. Furthermore, it is important to highlight that the algorithms mentioned earlier are capable of ranking features and managing the length of the selected feature subset on datasets. To ensure a fair comparison, it is crucial that each experimental algorithm selects an equal number of features on their respective datasets. Consequently, feature spaces of varying range dimensionality are reduced to sizes recommended in [15]. The specifics of this process are outlined below:

  • When \(m\le \)100, \(u=40\%\), that is, select the top 40% \(\times \) m features.

  • When \(100\le m\le \)500, \(u=30\%\), that is, select the top 30% \(\times \) m features.

  • When \(500\le m\le \)1000, \(u=20\%\), that is, select the top 20% \(\times \) m features.

  • When \(m>\)1000, \(u=10\%\), that is, select the top 10% \(\times \) m features.

Where m represents the number of original features. We adopt a 10-fold cross-validation strategy for conducting our experiments. Subsequently, the SA-BFGS algorithm [6] is utilized to train the learner on the aforementioned fifteen datasets after feature selection. This process yields the output and provides the predictive performance during testing.

Table 3 Comparison results of different feature selection algorithms on real-world datasets measured by Chebyshev(\(\downarrow \))

4.4 Experimental results

The purpose of this subsection is to demonstrate the efficiency of FSLSF on fifteen label distribution datasets in terms of Chebyshev, Clark, Canberra, KL, Cosine, and Intersection. All algorithms employ the same cross-validation method for performance evaluation in our experiments. Subsequently, the experimental results of the five algorithms on the test set for the selected feature subsets, under the six evaluation criteria, are presented in Tables 3, 4, 5, 6, 7 and 8. Each table provides a performance overview of the experimental algorithms, evaluated based on corresponding metrics. The second-to-last row in each table, labeled “Avg. Ranking” , records the average rankings of these experimental algorithms. Furthermore, the Win/Tie/Loss counts comparing FSLSF to other algorithms on the fifteen datasets are displayed in the final row. The best performance across the five feature selection algorithms is highlighted in bold.

From Tables 3-8, it’s clear to notice that our proposed method, FSLSF, consistently outperforms the other compared algorithms in most instances. The average rankings for the evaluation metrics, which include Chebyshev distance, Clark distance, Canberra distance, Kullback–Leibler divergence, cosine similarity, and intersection similarity, are 1.25, 1.31, 1.50, 1.38, 1.56, and 1.50, respectively. Moreover, detailed comparisons of the algorithms reveal the following:

  1. 1.

    With respect to Chebyshev in Table 3, FSLSF exhibits superior performance across 13 datasets. Taking the dataset Natural Scene as an example, FSLSF achieves the best performance of 0.2977, with the Chebyshev distance of FSLSF is 11.31% to 31.20% less compared to the RLFC, G3WI, LRLSF, FSFL, and CLEMLFS algorithms. Furthermore, it is noted that the statistical results of “Win\Tie\Loss” indicate that the FSLSF algorithm performs better across all datasets compared to the RLFC, G3WI, and LRLSF algorithms.

  2. 2.

    As depicted in Table 4 for the Clark metric, FSLSF delivers the best performance for almost all datasets, except for the dataset SBU_3DFE, Yeast-heat, and Yeast-cold. Notably, the performance enhancement on the SJAFFE and Movie datasets is quite significant. For example, for dataset SJAFFE, the proposed method LRLSF is 5.46%, 18.31%, 21.74%, 36.90% and 12.34% better than RLFC, G3WI, LRLSF, FSFL, and CLEMLFS, respectively. In addition, FSLSF attains near-optimal performance on 3 datasets, marginally falling short of FSFL on the SBU_3DFE, Yeast-heat, and Yeast-cold datasets.

    Table 4 Comparison results of different feature selection algorithms on real-world datasets measured by Clark(\(\downarrow \))
  3. 3.

    According to the Canberra metric presented in Table 5, FSLSF achieves the highest or near-optimal performance across 13 datasets. In particular, FSLSF outperforms the second-ranked algorithms by 14.16% and 28.91% on Yeast-alpha and Yeast-cdc, respectively. FSLSF achieves a moderate level of performance on 3 datasets, slightly inferior to CLEMLFS and FSFL on datasets Yeast-heat and Yeast-cold, and slightly inferior to CLEMLFS and G3WI on datasets RAF-ML. Despite not always achieving the best performance, FSLSF consistently outperforms RLFC and LRLSF across all datasets and delivers the best average performance. This underscores the effectiveness and versatility of the proposed model.

    Table 5 Comparison results of different feature selection algorithms on real-world datasets measured by Canberra(\(\downarrow \))
    Table 6 Comparison results of different feature selection algorithms on real-world datasets measured by Kullback–Leibler(\(\downarrow \))
  4. 4.

    For the Kullback–Leibler metric, shown in Table 6, it is clear that FSLSF obtains the highest performance on 12 datasets, excluding the datasets SBU_3DFE, Yeast-heat, Yeast-cold and RAF-ML. In detail, on the dataset Yeast-diau, the proposed method obtains the Kullback–Leibler value of 0.0086, which is 23.72%-46.13% better than RLFC, G3WI, LRLSF, FSFL, and CLEMLFS. In addition, on the dataset SBU_3DFE, FSFL ranks second and 2.42% inferior to RLFC.

  5. 5.

    With respect to the Cosine metric from Table 7, FSLSF exhibits the highest performance on 12 datasets. In detail, on the dataset Movie, the proposed method outperforms RLFC, G3WI, LRLSF, FSFL and CLEMLFS by up to 13.63%, 4.95%, 22.52%, 9.88% and 15.36%, respectively. In addition, FSLSF performs slightly worse than RLFC and CLEMLFS on dataset SBU_3DFE and slightly inferior to FSFL and CLEMLFS on dataset Yeast-cold.

    Table 7 Comparison results of different feature selection algorithms on real-world datasets measured by Cosine(\(\uparrow \))
    Table 8 Comparison results of different feature selection algorithms on real-world datasets measured by Intersection(\(\uparrow \))
  6. 6.

    In relation to the Intersection metric, as shown in Table 8, FSLSF achieves superior performance on 12 datasets. The statistical results of “Win\Tie\Loss” indicate that FSLSF consistently outperforms RLFC. Furthermore, when FSLSF ranks first, it surpasses the performance of the second-ranked algorithms by 1.50%, 1.36%, 0.69%, 5.39% and 4.94% on datasets SJAFFE, Human Gene, SBU_3DFE, Natural Scene and Movie, respectively. When FSLSF is behind, it is 0.27%, 0.26%, 1.26% and 0.05% inferior to the first-ranked algorithm on datasets Yeast-elu, Yeast-heat, Yeast-cold and RAF-ML, respectively. These statistics underscore that FSLSF’s performance improvements outweigh its reductions, thereby confirming FSLSF’s effectiveness.

Fig. 4
figure 4

Six metrics performance of SA-BFGS on the dataset Natural Scene

To provide a comprehensive comparison of the performance between our proposed algorithm FSLSF and the other algorithms under consideration, we conducted a series of experiments to illustrate the performance trends for varying numbers of selected features. The fluctuations in the six evaluation metrics under different experimental methods are depicted in Fig. 4. Each figure in this set represents the performance change under a specific evaluation metric as the number of features varies. The x-coordinate represents the number of selected features, with an increasing step size set at three. Meanwhile, the y-coordinate signifies the performance after feature selection.

As seen in Fig. 4, we can clearly notice that our proposed algorithm FSLSF consistently surpasses the other four state-of-the-art algorithms in most cases, and the advantage is more pronounced when the number of features is greater than 70. Taking the Cosine similarity as an example, it is clear that FSLSF outperforms RLFC, G3WI, LRLSF, and FSFL by up to 40.25%, 16.44%, 2.51%, 17.10%, and 23.03%, respectively. Moreover, it’ s important to note that there isn’ t a universally optimal value for the number of selected features. For instance, for the Chebyshev metrics, it is at the 50th feature. However, for the Intersection metrics, it is at 88th features.

In summary, our proposed algorithm outperforms both multi-label feature selection algorithms and the label distribution feature selection algorithm. Unlike the training process of multi-label feature selection, which utilizes discretized label distribution data, our algorithm fully capitalizes on label distribution information to guide the selection of a more discriminative feature subset. Furthermore, our algorithm surpasses the performance of the label distribution feature selection algorithm. This enhanced performance can be attributed to our algorithm’ s ability to simultaneously learn common and label-specific features.

4.5 Parameter analysis

In this section, we examine the impact of parameters on the performance of the FSLSF algorithm, conducting a parameter analysis based on the Chebyshev, Clark, Canberra, KL, Cosine, and Intersection metrics. The parameters \(\alpha \), \(\beta \), \(\gamma \), and \(\delta \) are the four parameters associated with the proposed FSLSF method. Here, \(\alpha \) is the regularized parameter for the label-specific features, \(\beta \) is the regularized parameter for the common features. In addition, \(\gamma \) modulates the contribution of label correlation, and \(\delta \) governs the influence of feature relevance. In experiments, we fixed two parameters while varying the others.

Fig. 5
figure 5

Performance of FSLSF on Yeast-dtt datasets w.r.t \(\alpha \) and \(\beta \)

Fig. 6
figure 6

Performance of FSLSF on Yeast-dtt datasets w.r.t \(\gamma \) and \(\delta \)

Table 9 Comparison of computational time for the six algorithms

First, we set \(\gamma \) = 0.001 and \(\delta \) = 0.001, and find the optimal \(\delta \) and \(\beta \) through parameter comparison experiments. For dataset Yeast-dtt, the parameters \(\alpha \) varies from 0.001 to 100, and parameter \(\beta \) varies from 0.001 to 100. Figure 5 shows the change in the evaluation metric when parameter \(\alpha \) and the parameter \(\beta \) become large. Figure 5 reveals the distinct roles that parameters \(\delta \) and \(\beta \) play across various evaluation metrics. For metrics such as Chebyshev, Clark, Canberra, and Kullback–Leibler, FSLSF’s performance exhibits a fluctuating pattern, increasing and decreasing with the rise of parameters \(\delta \) and \(\beta \). Conversely, under the Cosine and Intersection evaluation metrics, FSLSF’s performance remains notably stable, with no significant changes observed. In addition, as \(\alpha \) increasing, the influence of parameter \(\beta \) on the algorithm decreases and FSLSF reached stability when parameter \(\alpha \) reaches 10. Similarly, as \(\beta \) increasing, there is no obvious variation in algorithm performance. When parameter \(\beta \) is 0.001, FSLSF gets better performance in most cases. Then, we fixed \(\alpha \)=0.001 and \(\beta \)=0.001 and investigated the impacts of the parameters \(\gamma \) and \(\delta \). Figure 6 illustrates that, with the exception of the Kullback-Leibler metric, the performance of FSLSF remains consistent across nearly all other metrics. In Fig. 5(d), the Kullback–Leibler metric obtains the optimal value when \(\gamma =1\) and \(\delta =0.001\). In general, these results demonstrate that FSLSF’ s performance remains stable despite changes in its parameters.

4.6 The comparison of computational time

This subsection is dedicated to assess the time efficiency of the FSLSF algorithm. Table 9 presents the experimental results of our method alongside several other feature selection algorithms. The row labeled ‘ Average’ in this table signifies the mean computation time across all datasets. In addition, the calculation time is provided in seconds.

As shown in Table 9, FSLSF exhibits a moderate average computational time, which is somewhat worse than the G3WI algorithm and the LRLSF algorithm. Since our proposed algorithm takes into account not only the label correlation, but also the relevance of each feature and label based on information theory, it requires more run time. In addition, the G3WI algorithm benefits from pre-computing the feature correlation matrix and the feature interaction matrix, significantly reducing computational time. Although FSLSF is not optimal in terms of computational efficiency, the performance of FSLSF in six performance metrics is better than that of the G3WI algorithm and the LRLSF algorithm. In general, compared with the experimental results of the other five algorithms, the proposed feature selection algorithm FSLSF shows superior competitiveness.

5 Conclusion and future work

Some existing label distribution feature selection methods predominantly select a subset of common features that are shared with all class labels. However, in label distribution learning, an instance is linked with multiple class labels at once, and each class label may be influenced by its own specific features. Therefore, traditional feature selection approaches ignore label-specific features that are relevant and distinguishing for each class label during the feature selection process. To tackle the issue, we develop a feature selection approach by utilizing label-specific features to select highly informative features for label distribution data. Specifically, the label-specific features are extracted by exploiting the \(l_{1}\)-regularization, while common features are learned through the application of \(l_{2,1}\)-regularization. Then, the relevant information contained between labels and the feature relevance is exploited to direct the feature selection process to facilitate the generation of an optimal feature subset. Comprehensive experiments demonstrate that our method achieves highly competitive performance against other well-established feature selection algorithms. However, the proposed method not directly address the complexities of incomplete label distribution data and exhibits sensitivity to noisy data. In future work, we will further focus on the challenges of partial label distribution learning, and propose corresponding robustness feature selection algorithms.