1 Introduction

Dysarthria is a speech disorder that impacts the ability of individuals to produce clear speech, often resulting in reduced speech intelligibility [11, 20, 29]. Dysarthria can arise from various factors, encompassing neurological conditions such as Parkinson’s disease, cerebral palsy, muscular dystrophy, Wilson’s disease, and brain injury or trauma [31, 40]. Depending on the severity of the condition, the speech of a person with dysarthria can range from nearly normal to completely unintelligible, particularly for unfamiliar listeners. Conventionally, the intelligibility level of speech is evaluated through subjective assessments that are time-consuming, expensive, and susceptible to bias influenced by experts’ pre-existing familiarity with the type of specific medical condition [17]. Recently, automated objective assessment methods have gained popularity due to their accessibility and reliable performance, making them a promising option for early-stage diagnosis of dysarthria [10].

The quality of speech of an individual with dysarthria differs from that of a normal speaker due to variations in voice instability, speaking rate, loudness level, fundamental frequency, and voice breaks [4, 42]. Representing these aspects of speech production, several acoustic features in the spectral, temporal, and spectro-temporal domains are investigated for the assessment of dysarthric speech. The spectral features such as spectral centroid (SC), spectral bandwidth, spectral flatness (SF), spectral tilt (ST), spectral roll-off (SRoll), spectral flux, second-order spectral flux, the ratio of SF and SC, and the ratio of ST and SRoll are used as the features for the assessment of dysarthric speech. The cepstral features such as Mel-frequency cepstral coefficients (MFCC) [6, 19], perceptual linear prediction cepstral coefficients [29], multitaper MFCC [37], constant Q cepstral coefficients [7], linear prediction cepstral coefficients [33], and the cepstral coefficients computed through the frequency domain linear prediction [14] are considered for the assessment of the dysarthric speech. The cepstral coefficients obtained from the single-frequency filtered signal and its analytical phase are also reported for assessment of the dysarthric speech [14].

The excitation and prosodic features such as pitch, intonation, number of syllables, number of pauses, speaking rate [15], jitter and shimmer, harmonics-to-noise ratio, degree of voice break, articulation rate, vocalic duration [18], and short-term and long-term temporal dynamic  [12, 13] are reported for the assessment of the dysarthric speech. Auditory characteristics concerning fundamental frequency and formants [46] are also examined for the evaluation of dysarthric speech. An automatic speech recognition (ASR) system trained on standard non-impaired speech provides a significantly lower word recognition rate for dysarthric speech [17, 27]. In [30], an ASR is used to generate features that capture a speaker’s pronunciation characteristics. These features are then partitioned into subsets for the intelligibility evaluation of dysarthric speech. Model-based features such as state-level log-likelihood ratios, log-likelihoods, and word recognition rates are used for the assessment of dysarthric speech. The fine-level excitation source features and gross-level vocal tract resonance structures carry different cues for dysarthria [38, 39]. Several reported works show that a combination of excitation, prosodic, and spectral features further improves the performance of an automated dysarthric assessment system [15, 32, 38].

The use of spectro-temporal features in the convolutional neural network (CNN) has recently received attention for the assessment of dysarthric speech [2, 8, 9, 22]. The time-frequency representations, such as the short-term Fourier transform (STFT) spectrogram (STFT-SPEC) and Mel spectrogram (MEL-SPEC), and spectrogram computed through single-frequency filtered signal are generally used as the input to the CNN. The performance of the CNN-based classifier using temporal, spectral, and spectro-temporal representations of speech signals is explored for the task of dysarthric-level assessment [9]. In addition, first-order and second-order derivatives of the MEL-SPEC are also studied. This work shows that the spectro-temporal input yields better performance than the temporal and spectral representations. Other spectro-temporal representations of speech signals, such as two-dimensional discrete cosine transform coefficients and bivariate polynomial coefficients, are also studied for assessment of dysarthric speech [8]. The fine-level excitation source features and gross-level vocal-tract resonance structures carry different cues for dysarthria [22, 38, 39]. The single-channel CNN employing such representation does not ensure capture of the source and system information simultaneously due to the filtering operation using a fixed-size filter [22]. Consequently, better modeling can be done by separating fine-level features from the gross-level variations before the filtering operation in the CNN. Building upon these observations, this paper presents an approach for capturing fine-level features and slow-varying spectral structures through multi-channel CNN. The measure contributions of this paper include:

  • A comparative study employing single-channel CNN for STFT-SPEC, MEL-SPEC, and MEL-SPEC appended with delta (\(\varDelta \)) and delta delta (\(\varDelta \varDelta \)) coefficients (MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \)) inputs to demonstrate difficulties in the assessment of dysarthria in a speaker and text-independent mode.

  • An experimental study to demonstrate the impact of convolution filter size on the performance of CNN-based dysarthric assessment system employing STFT-SPEC.

  • An approach to separate the source and system features present in the STFT-SPEC using discrete wavelet transform (DWT).

  • Simultaneously capturing the source and system features from DWT decomposed STFT-SPEC through multi-channel CNN for improving the performance of dysarthric assessment system.

The findings from the experiments conducted on the Universal Access (UA)-speech corpus [21] highlight the practical significance of these contributions. The proposed multi-channel CNN dysarthric assessment system outperforms the single-channel CNN using various time-frequency representations, such as STFT-SPEC, MEL-SPEC, and MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \).

The remainder of this paper is structured as follows: Sect. 2 briefly explains the two-dimensional DWT of the STFT-SPEC and CNN. Section 3 outlines the experimental setup for the development of a dysarthric-level assessment system. The findings from the experiments are detailed in Sect. 4. Finally, Sect. 5 concludes this work.

2 Materials and Methodology

The present study employs DWT to decompose the input STFT-SPEC. This decomposition facilitates the separation of source and system features, contributing to enhanced modeling of dysarthric information through a multi-channel CNN. This section first presents the two-dimensional (2-D) DWT and its significance in separating the source and system features present in STFT-SPEC. Next, the CNN is presented briefly.

2.1 Decomposition of STFT-SPEC Using DWT

The DWT is a powerful tool for representing the time and frequency correlation of a non-stationary signal  [45]. The DWT-based signal decomposition and reconstruction are successfully applied to different types of data, such as biomedical signals [3, 41], images [25, 26, 28] and speech [23, 24, 36, 45] for analysis, noise suppression and feature representation. For a time-varying signal like speech, the STFT magnitude spectra (STFT-MS) computed over a short-analysis frame in an overlapping manner are used for STFT-SPEC representation. Generally, the duration of the analysis frame and the overlap between two adjacent frames remained fixed for the entire utterance. The slow-varying envelope of the STFT-MS signifies the resonance structure of the vocal tract system. In contrast, the rapidly varying fine structure in the STFT-MS is due to pitch and breathiness. Similar to the resonance structure, the nature of the fine-level structure also depends on the speaker and underlying sound units. Consequently, the slow-varying envelope and the fine structure contain features related to dysarthria [22]. The main objective of this research is to simultaneously capture source and system information for improving the effectiveness of the CNN-based dysarthric assessment system, which can be achieved by decomposing the STFT-SPEC using DWT.

Fig. 1
figure 1

Block diagram representing two-dimensional DWT decomposition of the STFT-SPEC. LL, LH, HL, and HH represent the approximation coefficient, horizontal detail coefficient, vertical detail coefficient, and diagonal detail coefficient, respectively

As illustrated in Fig. 1, the 2-D DWT of the STFT-SPEC is performed by first implementing the 1-D DWT to the rows (frequency) of the STFT-SPEC, as follows:

$$\begin{aligned} y_{H}[k]= & {} \sum _{n} y[n]\cdot g[2k -n] \end{aligned}$$
(1)
$$\begin{aligned} y_{L}[k]= & {} \sum _{n} y[n]\cdot h[2k -n] \end{aligned}$$
(2)

where g and h represent the low-pass filter (LPF) and high-pass filter (HPF), respectively. y(n), \(y_{H}[k]\), and \(y_{L}[k]\) represent original, low-pass, and high-pass filtered STFT-SPEC, respectively. Since the \(y_{H}[k]\) and \(y_{L}[k]\) contain redundancy, these outputs are down-sampled by a factor of 2. The resulting sub-band STFT-SPEC are then decomposed along the columns (time) by applying LPF and HPF and downsampling by a factor of 2. For the sake of simplicity and expedited decomposition, the Haar wavelet [43] is employed as the basis function in this study.

$$\begin{aligned} \psi (t)&= \left\{ \begin{array}{lr} 1 \quad 0 \le t< 1/2 \\ -1 \quad 1/2 \le t < 1\\ 0 \quad otherwise \end{array} \right. \end{aligned}$$
(3)

where \(\psi (t)\) represents the step function.

A single-level decomposition of the STFT-SPEC results in four sub-bands, namely approximation coefficient, horizontal detail coefficient, vertical detail coefficient, and diagonal detail coefficient [28]. These four sub-bands cover all the frequency components of the original STFT-SPEC.

2.2 Significance of DWT in Separating the Source and System Features

Figure 2 shows the single-level decomposition of an input STFT-SPEC using the Haar wavelet function. The speech signal, devoid of silence, for the word dissatisfaction sourced from the UA database  [21] uttered by high, mid, low, and very low intelligible female speakers is given in Fig. 2A–D, respectively. The corresponding STFT-SPEC computed by analyzing the input speech using a fixed Hamming window of 25 ms duration with a frameshift of 10 ms is shown in Fig. 2E–H, respectively. For this analysis, the STFT is computed using 512 bins. For each kind of intelligible speaker, Fig. 2I–X shows approximation, horizontal detail, vertical detail, and diagonal detail coefficients, respectively. By comparing the approximation and vertical detail coefficients, it can be observed that the approximation coefficients mostly contain the formant structure, whereas the vertical detail coefficients contain the fine detail of the pitch contour. On the other hand, the horizontal coefficients contain the frequency distribution at the onset and offset of sound units. Furthermore, the spectro-temporal variation in each sub-band is quite different for each kind of dysarthric level. Consequently, an improved intelligibility assessment system can be developed by exploiting the distinct nature of sub-band spectrograms using a multi-channel CNN.

Fig. 2
figure 2

Figure illustrating the separation of source and system features using a one-level DWT spectrogram A-D shows the silence-removed speech signal of the word dissatisfaction uttered by female dysarthric speakers of high, mid, low, and very low intelligibility levels, respectively. E-H shows the corresponding original spectrogram. I-L, M-P, Q-T, and U-X show the approximation coefficient, horizontal detail coefficient, vertical detail coefficient, and diagonal detail coefficient, respectively

2.3 Convolutional Neural Network

A CNN is a neural network variant designed for processing and analyzing grid-like data using convolutional operations. Convolution is a process that combines a filter with a feature map, where both the feature map and filter are matrices. It’s like moving the filter systematically across the feature map, commencing from the top-left corner and progressing to the bottom-right corner, with each step being called a “stride.” The region covered on the feature map during this movement is known as a ‘receptive field’, and it’s the same size as the filter [34].

Every step involves a mathematical interaction between the filter and the receptive field that produces a different outputs. Together, these outputs produce a new feature map.

$$\begin{aligned} z^{(l)} = \sigma \left( f^{(l)} *z^{(l-1)} + b^{(l)} \right) \end{aligned}$$
(4)

In the given expression, \(z^{(l-1)}\) and \(z^{(l)}\) correspond to feature maps in successive layers. The convolution operation, represented as \(*\), is executed using the filter \(f^{(l)}\) on the feature map \(z^{(l-1)}\). Subsequently, the bias \(b^{(l)}\) is incorporated, and ultimately, the activation function \(\sigma (\cdot )\) commonly chosen from options such as Rectified Linear Units (ReLU), leaky ReLU, softmax, tanh and sigmoid [1] is utilized to produce the outputs of the convolutional layer. This layer is followed by a batch normalization layer which normalizes the input of each layer in a mini-batch, reducing internal covariate shift and making the training more efficient and stable, followed by a pooling layer that reduces the spatial dimensions of the feature map for more efficient processing. Afterwards, the processed feature map is flattened into a one-dimensional array, preparing it for the fully connected layer. The fully connected layer connects neurons, allowing the model to understand patterns and associations. The final output layer produces the network’s prediction. CNN is widely used in speech processing, image processing, object detection, medical image analysis, denoising [44], natural language processing etc.

3 Experimental Setup

The following subsection represents the details of the experimental data and the architecture of CNN-based dysarthric-level assessment systems.

3.1 Speech Corpus and Selection of Experimental Data

In this study, all the experimental evaluations are performed on the UA-speech corpus [21]. This openly accessible speech repository comprises of the speech recording of 15 dysarthric speakers (4 females and 11 males). Each speaker has uttered 765 isolated words, encompassing 300 unique or less-frequent words, as well as 3 instances each of radio alphabet entries, digits, computer commands, and commonly used words. The UA-speech includes monosyllables (digits and common words), bisyllables (computer commands and radio alphabet letters), and polysyllables (uncommon words), resulting in increased phoneme sequence diversity. Each speaker’s speech intelligibility is determined by using the mean score of a subjective assessment from five native speakers. In this database, all the speakers are grouped into four intelligibility levels, namely, very low (\(0-25\)%), low (\(26-50\)%), mid (\(51-75\)%) and high (\(76-100\)%). All the speech utterances are recorded at a 48 kHz sampling rate using an array of eight microphones.

The researchers have proved that the comprehensibility of dysarthric speech depends on the context and length of the spoken words [5]. The speech data along with the severity level of dysarthria, also contains speaker-specific and sensor-specific information. Therefore the effectiveness of an automatic intelligibility assessment system varies depending on the matching of the text and speaker present in the test and training speech data [8]. To ensure the practical applicability of the automatic assessment system, it must be functioning in speaker-independent and text-independent (SI-TI) mode. For a better insight into impact of speaker and text variability on the effectiveness of the assessment system, the UA-speech corpus is partitioned into training and test test sets as follows:

  • Training set: This training set comprised the speech recording of one female and one male speaker from each dysarthric label i.e. speakers having dysarthric label very low (M04 and F03), low (M07 and F02), mid (M05 and F04), high (M08 and F05).

  • Speaker-dependent and text-dependent (SD-TD) test set: This test set contains 25% of the training speech data equally distributed among the speakers present in the training set. In this case, the speaker as well as the text has already been seen by the system during training.

  • Speaker-dependent and text-independent (SD-TI) test set: This test set contains speech recordings of the training speakers excluding the words used in the training set. In this case, the speakers are seen but the text is unseen by the system during training.

  • Speaker-independent and text-dependent (SI-TD) test set: This test set contains common words from the training set spoken by the remaining seven speakers. In this case, the speakers are unseen and the text is seen by the system during the training.

  • Speaker-independent and text-independent (SI-TI) test set: This test set contains uncommon words from the remaining seven speakers who are not present in the training set. Here both the speakers and text are unseen by the system during training.

3.2 Spectro-Temporal Representation of Speech Data

The speech recording present in the UA-speech corpus contains longer silence regions during the utterances specifically at the beginning and end of the speech. Furthermore, longer silence regions are also present within the speech utterances. An energy-based voice activity detection [35] is employed to eliminate these silent segments before initiating the feature computation process, in which the speech frames that have energy below a predefined threshold (0.1 times the average energy of the entire utterance) are treated as silence. The silence-removed speech data undergoes pre-emphasis with a filter coefficient of 0.97 to enhance high-frequency components. The short-term analysis frames are then generated using an overlapping Hamming window of 25 ms duration at a frameshift of 5 ms. For every analysis frame, STFT-MS is calculated using 512 bins. These STFT-MS are constructed by maintaining symmetry around the central points, forming STFT-SPEC. Subsequently, the STFT-MS undergo processing through a 40-channel Mel-filterbank to calculate the Mel-filterbank energies. The computed Mel-filterbank energies across all analysis frames are assembled to represent the MEL-SPEC. The \(\varDelta \) and \(\varDelta \varDelta \) coefficients computed from the MEL-SPEC capture the long-term features like temporal dynamics present in the speech signals. Consequently, the \(\varDelta \) and \(\varDelta \varDelta \) coefficients add additional features that are not captured by MEL-SPEC. In this study, \(\varDelta \) and \(\varDelta \varDelta \) coefficients are computed by considering two preceding and succeeding frames. The \(\varDelta \) and \(\varDelta \varDelta \) coefficients are appended to the MEL-SPEC for another spectro-temporal representation MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \). The uniform size of STFT-SPEC or MEL-SPEC can be constructed from the variable length of speech utterance by truncation or zero-padding. Specifically, if the pre-processed speech data exceeds one second in duration, it is truncated from a random starting point to reduce the duration to one second. Conversely, zeros are padded to shorter utterances to extend their duration to one second.

3.3 Architecture of the Baseline Dysarthric Assessment System

The baseline system for assessment of dysarthric level utilized a single-channel CNN. The detailed architecture of the CNN classifier is shown in Fig. 3. As shown in Fig. 3, the CNN classifier comprises of three convolutional and two fully connected layers where each convolutional layer includes a convolutional, batch normalization, activation, and max pooling layer. The first convolutional layer uses 32 filters, while the rest of the convolutional layers uses 64 filters of size 5 \(\times \) 5. The activation layer utilizes the ReLU activation function, and the pooling layer employs the max pooling function with a size of \(2 \times 2\). The fully connected layer consists of 256 neurons in the initial layer and 16 neurons in the subsequent layer. Each layer of the fully connected layers utilizes the ReLU activation function and batch normalization. The extracted feature from the previous layer is fed to the output layer. The output layer employs the softmax function to classify the input data into four different dysarthric levels. The CNN models underwent training and validation for 100 epochs with a batch size of 8, maintaining a fixed initial learning rate of 0.01. The optimization of the loss function was performed using the Adam optimizer, and categorical cross-entropy was employed for loss calculation.

Fig. 3
figure 3

Architecture of the single-channel CNN for assessment of dysarthric level. In the figure, m and n represent the frame index and frequency bin, respectively

4 Experimental Results and Discussions

The dysarthric speech utterances are categorized into four intelligibility levels: high (76–100%), mid (51–75%), low (26–50%), and very low (0–25%). This section thoroughly examines and discusses the trends found in the experimental results. Throughout all experimental evaluations, accuracy and F1 score are used as the performance metrics. The performance of the proposed methodology was further validated using additional performance metrics such as precision, recall, and receiver operating characteristic area under the curve (ROC-AUC) [16]. The aforementioned metrics are computed as follows:

  • Accuracy: Accuracy gauges the closeness of the prediction classes to the true classes.

    $$\begin{aligned} \text {Accuracy} = \frac{T_P + T_N}{T_P + T_N + F_P +F_N} \end{aligned}$$
    (5)

    where \(T_P, T_N, F_P\) and \(F_N\) represent the total number of true positives, true negatives, false positives and false negatives, respectively.

  • Specificity: Specificity gauges the classifier’s capacity to accurately identify negative instances.

    $$\begin{aligned} \text {Specificity} = \frac{T_N}{T_N + F_P} \end{aligned}$$
    (6)
  • Recall (Sensitivity): Recall evaluates the ability of a classifier to accurately identify positive instances.

    $$\begin{aligned} \text {Recall} = \frac{T_P}{T_P + F_N} \end{aligned}$$
    (7)
  • Precision: Precision assesses the classifier’s capacity to identify true positive classes.

    $$\begin{aligned} \text {Precision} = \frac{T_P}{T_P + F_P} \end{aligned}$$
    (8)
  • F1 Score: F1 score is computed as the harmonic mean of precision and recall, an integrated metric that balances the trade-off between precision and recall in a single assessment.

    $$\begin{aligned} \text {F1 Score} = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
    (9)
  • ROC-AUC: The ROC-AUC illustrates the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) at a varying threshold settings.

    $$\begin{aligned} \text {ROC-AUC} = \int _{0}^{1} T_{PR} \, d (F_{PR}) \end{aligned}$$
    (10)

    where \(T_{PR}\), \(d (F_{PR})\) is the true positive rate and the differential (or infinitesimal change) in the false positive rate, respectively.

4.1 Performance of the Single-Channel CNN Classifier

The performance of the single-channel CNN classifier employing STFT-SPEC, MEL-SPEC and MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \) as the input spectro-temporal representation is given in Table  1. The performances observed in terms of accuracy and F1 score for the SD-TD, SD-TI, SI-TD, and SI-TI test sets are presented in the table. The input dysarthric speech data, along with dysarthric features, also contains speaker-specific information. The dysarthric features captured by the CNN models also depend on the context of the spoken utterances. Consequently, the performance of the CNN-based automated dysarthria assessment system depends on the similarity of speakers and utterances present in the training and test speech data. In this investigation, the system’s performance is examined by varying the speaker and spoken utterance across the training and test speech data. For all kinds of spectro-temporal representations, the best accuracy and F1 score are observed when the test and speakers are seen by the system (SD-TD test set). On the other hand, degraded performances are observed when both the speaker and text are unseen by the system (SI-TI test set). For a practical application, an automated dysarthric assessment system needs to be operated in speaker-independent mode (speakers are unseen by the system). Therefore, this work focused on performance improvements for the SI-TD and SI-TI test sets. For these two test sets, the MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \) provides improved performance more than the STFT-SPEC and MEL-SPEC. These experimental results show that the long-term features captured through the \(\varDelta \)-\(\varDelta \varDelta \) coefficients contain additional dysarthric information. The observed accuracy trend of this experimental result is very similar to the CNN-based classifier reported for a similar task in [9]. The MEL-SPEC is derived by subjecting the STFT-MS to a Mel-filterbank. The Mel-filterbank applies different levels of filtering to the frequency bands of STFT-MS and mostly captures the spectral dynamics. The deployment of multiple feature representations in the input serves specific objectives aimed at achieving a comprehensive understanding of dysarthric speech. Furthermore, the incorporation of these representations allows the in-depth analysis of dysarthric speech for capturing speaker characteristics and spoken utterances.

Table 1 Dysarthric level assessment performance of the single-channel CNN classifier employing STFT-SPEC, MEL-SPEC, and MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \) as the spectro-temporal representation of the input speech data

4.2 Impact of Convolution Filter Size on the Performance of CNN Classifier

In a CNN framework, the long-term features and spectral dynamics from the STFT-SPEC can be achieved through careful selection of convolution filter sizes. Expanding the filter size in the horizontal direction averages the frequency bins of STFT-SPEC over a larger number of adjacent frames. On the other hand, an increase in filter size in the vertical direction averages the frequency bins of STFT-SPEC over a wider frequency range. It is worth mentioning that the MEL-SPEC is already averaged over the frequency bin by the Mel-filterbank. Consequently, an increase in convolutional filtering smooths the features. Motivated by these observations, this study delves into the effectiveness of the single-channel CNN classifier employing STFT-SPEC by varying the size of the convolution filter in horizontal and vertical directions. The dysarthric level assessment performance of the single-channel CNN for different sizes of convolution filters is summarized in Table  2. Initially, the size of the convolution filter is changed in the horizontal direction by fixing the size in the vertical direction. The increase in the size improves the accuracy and F1 score for both SI-TD and SI-TI test sets, and the best performances are observed when the filter size is \(5 \times 7\). Next, the performance is evaluated by varying the filter size in the vertical direction and fixing the size of the filter in the horizontal direction. The accuracy and F1 score are improved further by increasing the filter size in the vertical direction. For the SI-TD and SI-TI test sets, the best performances are observed when the filter size is \(5 \times 7\). Furthermore, the observed accuracy and F1 score are better than the MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \) presented in Table  1. As discussed, these experimental results show that long-term features and spectral dynamics can be captured in a better manner by properly selecting the size of the convolution filter. However, increasing the size of the filter in the vertical direction smoothed out the fine-level features in the STFT-SPEC. So, the spectral dynamics and fine-level features can be captured by separating them through spectral decomposition. The next section explores the significance of fine-level features for crafting an automated dysarthric system, utilizing a multi-channel CNN.

Table 2 Dysarthric level assessment performance of the single-channel CNN classifier employing STFT-SPEC for the varied size of the convolution filter

4.3 Performance of the Multi-channel CNN Employing DWT of STFT-SPEC

The design of the multi-channel CNN for the assessment of dysarthric speech using DWT of STFT-SPEC is shown in Fig. 4. As discussed earlier, in the MEL-SPEC, the fine-level features are smoothed out by the Mel-filters. Consequently, for modeling the source and system features through the multi-channel CNN, the STFT-SPEC is used as the spectro-temporal input in this study. As depicted in Fig. 4, the STFT-SPEC is subjected to a single-level DWT for extracting the approximation coefficient (LL), horizontal detail coefficient (LH), vertical detail coefficient (HL), and diagonal detail coefficient (HH). The LL, LH, HL, and HH coefficients are then independently given as the input to the four channels of the CNN. The LL and HL coefficients mostly contain the formant structure and pitch contour, respectively. The LH and HH coefficients contain the fine-level features in the lower and higher frequency bands. Consequently, each channel of the CNN captures a different type of information or feature present in the input data.

The multi-channel CNN comprises five convolutional layers and two fully connected layers. Each layer consists of a convolutional, batch normalization, activation, and max pooling layer wherein the max pooling layer is not utilized in fully connected layers. The initial convolutional layer employs 32 filters, while the subsequent convolutional layers employs 64 filters. To represent the complementary features, different filter sizes are used for each channel, as mentioned in Table 3. The ReLU activation function is used in the activation layer, max pooling function of the size of \(2 \times 2\) in the pooling layer. The fully connected layers employ 256 neurons in the initial layer and 64 neurons in the next layer. The hyperparameter of the CNN model is the same as described in Sect. 3.3. The wavelet coefficients are fed as input to the CNN model through multi-channel, with each channel carrying a specific coefficient of size \(131 \times 103\). Different sizes of convolution filters are used for the four channels of the CNN to capture the spectral dynamics and fine-level features. Each subsequent layer of the model receives inputs of sizes \(65 \times 51 \times 32\) and \(32 \times 25 \times 64\), which are then merged to generate features of size \(128 \times 25 \times 64\). These merged features are passed through three additional layers, resulting in a feature of size \(16 \times 3 \times 64\). The features are then flattened at the flattening layer, resulting in a feature with a size of 3072, which is subsequently forwarded to the fully connected layer for classification, as illustrated in Fig. 4.

The dysarthric assessment performance of the multi-channel CNN employing DWT of STFT-SPEC in terms of accuracy and F1 score is given in Table 3. Examination of Table  3 reveals that the highest accuracy and F1 scores are achieved when the coefficients of the lower frequency band coefficients (LL and LH) are convolved with a filter size of \(5 \times 7\), and the higher frequency band coefficients (HL and HH) are convolved with a filter size of \(5 \times 5\). This is mainly because the lower frequency band contains the formant dynamics while the other coefficients contain the fine structure of the STFT-MS. These experimental results show that the fine structure of the STFT-MS contains additional dysarthric features. Consequently, modeling these features with vocal tract resonance structure improves the performance of an automated dysarthric level system. To enhance the dependability of the findings, the performance of the multi-channel CNN is further evaluated for the SI-TI data set using various metrics. As highlighted in Sect. 4.1, the primary focus of this study is to enhance the model’s performance in the SI-TI scenario. All the metrics utilized to evaluate the SI-TI model’s performance are presented in Table 4. From Table 4 it can be observed that the best performances are observed when lower-frequency band coefficients are convolved with a filter size of \(5 \times 7\), and the higher frequency band coefficients are convolved with a filter size of \(5 \times 5\), which validates the claims made in this study.

Fig. 4
figure 4

Block diagram representing the architecture of multi-channel CNN for the classification of dysarthric speech using the DWT of STFT-SPEC

Table 3 Dysarthric level assessment performance of the multi-channel CNN classifier employing DWT of STFT-SPEC
Table 4 Dysarthric level assessment performance of the multi-channel CNN classifier employing DWT of STFT-SPEC for SI-TI test set

5 Conclusion

This paper presents a single-channel CNN-based dysarthric assessment system employing spectro-temporal representations such as STFT-SPEC, MEL-SPEC, and MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \) in both speaker-dependent and independent modes. Then the significance of the convolution filter in capturing the long-term features and spectral dynamics from the STFT-SPEC is presented. Experimental findings indicate that, like MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \), the long-term features and spectral dynamics can be captured in a better manner from the STFT-SPEC by performing convolution operations over a relatively larger number of analysis frames and frequency bins. Next, to independently capture the spectral dynamics and fine-level features present in the STFT-SPEC, it is subjected to a single-level DWT. The resulting approximation, horizontal detail, vertical detail, and diagonal detail coefficients are fed as independent inputs to a four-channel CNN. Each channel of the CNN is convolved with a different size of the convolution filter to capture an independent set of features. These features are finally merged and flattened to represent a feature vector containing vocal-tract and excitation-based features. The experimental findings demonstrated on the UA-speech corpus indicate that the proposed approach employing multi-channel CNN provides improved accuracy and F1 score compared to the single-channel CNN employing STFT-SPEC, MEL-SPEC, and MEL-SPEC-\(\varDelta \)-\(\varDelta \varDelta \) for the assessment of dysarthric speech in speaker-independent and text-independent mode.

The current research can be further extended by exploring various speech-embedding techniques in multiple channels of CNN. Long short-term memory may also be explored to capture the sequential nature of the speech signal. Other signal processing techniques may also be explored to capture better features for dysarthric speech. Optimization algorithms may be explored to optimize the feature extracted for dysarthric speech from the various resources.