1 Introduction

Symptoms comorbidity between neurodevelopmental disorders has been increasing over thepast years [1]. Autism spectrum disorder (ASD) and attention deficit/hyperactivity disorder (ADHD) show considerable symptomatology overlap [2] with rates ranging between 50 and 70% [3]. ASD is characterized by persistent deficits in socio-communication and socio-interaction across multiple contexts, alongside with restricted and repetitive patterns of behavior, interests, or activities [4], while ADHD is characterized by altered functioning in the areas of attention, hyperactivity, and impulsivity. Both disorders’ prevalence has increased over the past decades [5], are more common in boys than girls and cause a significant disruption in children’s life, such as difficulties in school performance or in social interactions with other children and adults [6].

Children and adolescents diagnosed with comorbid ASD/ADHD often present poorer behavioral, cognitive, and socio-emotional outcomes compared to children and adolescents with ASD alone, suggesting that ASD/ADHD comorbidity may constitute a distinctive phenotype characterized by marked difficulties compared to ASD alone, see [7] for a detailed review. Indeed, those with both ASD/ADHD present more severe autistic symptoms, particularly marked impairments in social interaction abilities, i.e., social responsiveness [8, 9], showing increased difficulties in adaptive skills [10] and poorer quality of life [11] than children and adolescents with ASD alone. In addition, compared to children with only ASD, children with both ASD and ADHD display lower cognitive functioning [12], especially problems in inhibitory control, attention abilities [13] and Theory of Mind [2]. Evidence also shows that children with ASD/ADHD have increased stereotypic and repetitive behaviors, such as mannerisms [12], tantrum behaviors and conduct problems [14] than those diagnosed with ASD alone. These increased difficulties frequently lead to anxiety, externalizing, and mood problems [15].

The previous findings enclose relevant implications for clinical practice as they evidence the negative impact that the co-occurrence of ASD and ADHD symptoms have on children’s multiple functioning outcomes. Thus, the early identification of ASD/ADHD comorbidity is important not only to help establishing earlier and more accurate diagnosis, but also to determine and optimize intervention programs for these children. The early identification of comorbidities facilitates the implementation of specific interventions for ASD and/or ADHD, positively influencing children’s adhesion to treatment and functioning outcomes [16].

Altered sensory abilities are frequently co-observed in ASD and ADHD. Thus, a possible approach to early identify ASD/ADHD comorbidity may be related to sensory processing abnormalities. Difficulties in sensory processing are frequent and one of the earliest and common clinical observations in children with neurodevelopmental disorders, especially in those with ASD and behavioral problems [17], compared to typically developing children [18]. ASD is most often associated with impairments in sensory processing [19], but ADHD also presents atypical sensory processing [20]. Differences in the unusual responses to sensorial experiences exist between children with ASD alone, with ADHD alone, and with ASD/ADHD [21]. Children with ASD seem to have increased difficulties in oral processing (i.e., taste) compared to children with ADHD that, in turn, present more difficulties in visual processing [17] and proprioception [22]. Compared to children with ASD and ADHD alone, children with ASD/ADHD present increased overall sensory difficulties, e.g., visual, auditory or balance and motion [22]. However, a further characterization of sensory processing impairments in children with ASD/ADHD, ASD and ADHD is still lacking. Thus, clarifying the specificity of sensory processing difficulties in children with ASD and/or ADHD may enclose important information for an early differential diagnosis between these neurodevelopmental conditions with important implication for intervention programs.

Importantly, machine learning methods are a very promising avenue towards more successful clinical decision making and diagnosis [23]. The systematic review of Song et al [24] showed that deep learning improved the accuracy of the ASD and/or ADHD diagnosis by extracting clinical information from standardized diagnostic evaluations of neurodevelopmental disorders. The recent work of [25] evidenced good overall sensitivity and specificity obtained from machine learning methods in the diagnostic classification of ASD, particularly in children with less severe symptoms for whom establishing a diagnosis is frequently more difficult. Also, a previous study from our research group showed that machine learning techniques may be a useful method in predicting behavioral problems from altered sensory processing in children and adolescents with ASD [26].

Duda et al [27] performed the classification of 2,925 ASD and ADHD participants (150 with ADHD) using six machine learning classifiers (decision trees, linear discriminant analysis, random forest, support vector machines, logistic regression and categorical LASSO) and forward feature selection on social responsiveness scale score sheets, SRS [28], achieving 96% of area under curve with only 5 of 65 scores. Wall et al [29] use 16 classifiers to diagnose autism using the Autism Diagnostic Observation Schedule, ADOS [30] for 612 individuals, achieving a perfect classification with decision trees. Deep learning [31] achieves 70% of accuracy in the diagnosis of ASD using images of brain activation patterns for 1,035 patients. Abdullah et al [32] used the Autism Spectrum Quotient (AQ) questionnaire [33] to diagnose autism using random forest, k-nearest neighbors and logistic regression classifiers with three feature selection methods (chi-square, LASSO and random forest), achieving 97% of accuracy on a dataset of 704 persons, 189 of them with ASD.

Hence, applying machine learning methods to atypical sensory processing could be an effective approach to an early detection of ASD/ADHD comorbidity. The current study uses classification techniques to detect comorbid ADHD symptomatology in children and adolescents diagnosed with ASD. This is a binary classification problem with classes ASD and ASD/ADHD. The class prediction is performed using data about specific altered sensory processing abilities, that are described in section 2. It is expected that, in accordance with previous studies using machine learning, persons with comorbid ASD/ADHD will present increased sensory processing difficulties compared to those with ASD alone. Given the low performance achieved by the standard classifiers (section 3), we proposed alternative approaches specifically focused to this problem in subsections 4 and 5, that increased the classification performance. Since data are high-dimensional and exhibit class unbalance, we also tried a wide set of feature selection and unbalanced classification methods in subsections 6.1 and 6.2, respectively. The results are discussed in subsection 6.3. Section 7 compiles the main findings of this work.

2 Dataset description

We used a dataset that includes 52 children and adolescents, aged from 6 to 14 years (mean\(=\) 8.25; standard deviation\(=\)2.70), diagnosed with ASD. Of these, 14 children and adolescents also present comorbid ADHD (representing 26.9% of the total sample). All the diagnoses were made by qualified professionals of neurodevelopmental disorders and mental health based on: 1) ASD criteria established by the Diagnostic and Statistical Manual of Mental Disorders [4] in its revised fourth version, DSM-IV-TR, or fifth version, DSM-5; or 2) a specific diagnosis instrument such as Autism Diagnostic Interview Revised, ADI-R [34] and the Autism Diagnostic Observation Schedule, ADOS [30]. The ADHD comorbidity was also carried out by clinicians and mental health professionals based on ADHD criteria established by the DSM-IV-TR or DSM-5. All parents who agreed to participate in this research gave the written informed consent. Informed consent was obtained in accordance with the Declaration of Helsinki. This study was approved in 2012 by the Galician ethic committee for clinical research with ethics board protocol number 2012/098.

Each participant in this study is defined by a vector of features, where each one is the numeric answer to a question in the Sensory Profile-2 questionnaire (SP2), Spanish version 3:0-14:11 [35]. The SP2 is a standardized tool to assess children and adolescents’ sensory processing patterns in the context of everyday life. This questionnaire helps to determine how sensory processing difficulties in children and adolescents with neurodevelopmental disorders may be contributing to or interfering with their participation in activities and influencing their behavior. The questionnaire forms for children aged 3-14 years were completed by the caregivers. The SP2 measures children and adolescents’ sensory processing patterns to different sensory modalities– Auditory Processing, Visual Processing, Touch Processing, Movement, Body Position and Oral Processing– in 86 questions scored on a five-point Likert-type scale (1\(=\)“Almost Never”, 2\(=\)“Occasionally”, 3\(=\)“Half of the time”, 4\(=\)“Frequently”, 5\(=\)“Almost Always”). Table 1 reports the question list of each group of questions in the SP2 questionnaire, with the number of questions, nature and list of the questions included. These scores characterize the sensory processing response into four different quadrants: seeking, avoiding, sensitivity and registration. In this study, these four quadrants, alongside with the total score (all the items in the questionnaire) and the touch processing sensory section score were used. The latter was included due to the evidence suggesting that tactile impairments are related to the onset of ASD, and because is the most common sensory alteration in this population [36, 37].

Table 1 List of the 6 datasets created from the SP2 questionnaire: the four SP2 quadrants avoiding, registration, seeking and sensitivity, alongside with touch and total

The machine learning classifier uses the feature vector, standardizing each feature to have zero mean and deviation one, as input data in order to predict the class for each participant: ASD alone, or ASD/ADHD. Each SP2 group defines a different dataset, e.g. avoiding has 20 questions listed in the corresponding row of Table 1, while total includes the 86 questions of SP2. The populations of both classes are unbalanced, because there are almost three times more participants with ASD alone (38 of 52) than participants with ASD/ADHD (only 14).

Fig. 1
figure 1

(a) Two feature vectors of dataset avoiding of ASD and ASD/ADHD classes that have the lowest distance between them. The “Minimum distance” printed is the number of features that differ between them. (b) Two nearest vectors of class ASD. (c) Two nearest vectors of class ASD/ADHD. (d) Histograms of distances between vectors of the same and different class

The panel (a) of Fig. 1 shows the two avoiding feature vectors, one of class ASD (continuous squared line) and other of class ASD/ADHD (dashed circled line), that have the lowest distance between them over all the training patterns of different classes. This distance is the number of features where they differ. In this panel (a), the minimum distance is 7. Considering only vectors of class ASD, the lowest distance is 9, see panel (b). With class ASD/ADHD, the lowest distance is 7 again, see panel (c). These distances are equal to or higher than the lowest distance between vectors of different classes, as in panel (a). Thus, similarities between feature vectors of participants with ASD and ASD/ADHD are comparable to differences between vectors of the same class. The panel (d) shows the histogram of distances between vectors of the same and different class (continuous and dashed lines, respectively), confirming that within- and between-class distances follow similar distributions and have similar ranges. This fact suggests that: 1) classifiers based on distances that consider equally all the features might be unable to adequately discriminate between classes; and 2) differences in certain features may be more important for classification than others. Therefore, methods that weight differently each feature might be able to detect differences between classes in just one or few features and, perhaps, to achieve better classification performance.

3 Standard classification methods

The classification of participants in ASD or ASD/ADHD according to their SP2 features was performed using a large collection of 42 standard state-of-the-art classifiers (Table 1 in the supplementary material) implemented in four programming languages: Matlab, Octave, Python (using the scikit-learn module) and R, using different packages listed in the table. Several versions of the most common classifiers were executed, such as classification trees (ctree) in Matlab, Octave (own implementation), Python and R. The recursive partitioning classification tree (rpart) was also executed in R. Multi-layer perceptron neural networks were executed in Matlab, Python and R with packages nnet and neuralnet. Other neural networks that were executed are extreme learning machine (elm) in Octave (own implementation) and R, and probabilistic neural network (pnn) in Matlab. Nearest neighbor classifiers were executed in Matlab, Python and R. Linear and diagonal linear, quadratic, flexible and mixture discriminant analysis were executed in Matlab, Python and R. The classifier ensembles executed in this work include: adaboost and bagging, both in Matlab and R; gradient boosted machine (gbm) in Python and R; and random forest (rforest) in Matlab, Python and R. The support vector machine was executed in Matlab, in Octave using the LibSVM library, in Python and in R. Other standard classifiers are: naive Bayes (awnb, manb, nb, nbDiscrete and nbSearch) in R; k-nearest neighbors with Euclidean distance (knn), in Matlab and R; multivariate adaptive regression spline (earth); Gaussian process regression (gpr) and learning vector quantization (lvq), both implemented in R. Table 2 of the supplementary material lists details about hyper-parameter tuning.

The previous is a very diverse collection including many different classifiers, so there is no a priori reason to deem them inadequate for automatic ASD/ADHD classification. However, their performance was poor for this problem, as reported in the results section below. In order to achieve better results, the following subsections propose alternative classification methods that exploit the characteristics of the problem of ASD/ADHD prediction using questions of the SP2 questionnaire.

4 Proposed nearest neighbor classifiers

We also studied the use of other classifiers designed specifically for these data that take advantage from their singular properties. As explained above, the variability in the feature vectors is not higher between participants of different classes than between participants of the same class, so standard distance-based classifiers such as support vector machine and nearest neighbor methods might not achieve good performance. This suggested us to replace the standard Euclidean distance between vectors in neighbor classifiers by a distance measure specifically focused to this type of problems. A detailed analysis of vectors with the same and different class labels shows that the non-coincidence in a certain feature is often more relevant for classification than the specific value of the difference. Therefore, we used a distance measure that evaluates just the number of features that differ between vectors, as explained in the following paragraph.

Since features are answers to questions in the SP2 questionnaire, they share the same set of possible values \(\mathcal {J}=\{ 1,\ldots ,J \}\), with \(J\) \(=\)5, so each feature vector can be defined as \(\textbf{x}=(j_1,\ldots ,j_I)\), where \(j_i \in \mathcal {J}\) is the value of the i-th feature, for \(i=1,\ldots ,I\), being I the number of features (e.g. I=20 for dataset avoiding, see Table 1). Let N be the number of training vectors, and \(\textbf{x}_n=(j_{n1},\ldots ,j_{nI})\) the n-th training vector, with \(n=1,\ldots ,N\). Let \(c_n \in \{ 1, \ldots , C\}\) be the class label of \(\textbf{x}_n\) and \(C\) \(=\)2 the number of classes, where \(c_n\) \(=\)1 or \(c_n\) \(=\)2 for each ASD or ASD/ADHD participant, respectively. We propose to use the following distance measurement:

$$\begin{aligned} d_n(\textbf{x})=d(\textbf{x},\textbf{x}_n)=I-\sum _{i=1}^I \delta (j_i,j_{ni}) \end{aligned}$$
(1)

between a test vector \(\textbf{x}\) and a training vector \(\textbf{x}_n\), where \(\delta (i,j)=1\) when \(i=j\) and zero otherwise. Using this distance, we considered the following two nearest neighbor classifiers, alternatives to the standard k-nearest neighbor (knn) classifier:

  1. 1.

    Nearest neighbor (nn classifier in Table 1 of the supplementary material), the standard K-nearest neighbor classifier with distance \(d_n(\textbf{x})\) defined above, tuning the number K of neighbors with values from 1 to 19 with step 2.

  2. 2.

    Nearest neighbors by class (nnc). The high variability between ASD vectors, alongside with the population unbalance favouring this class, causes that ASD/ADHD vectors often find ASD nearest neighbors, biasing prediction to this class. The proposed nnc method tries to compensate this bias by scaling and translating the distances between a test vector and the training vectors of both classes. We use the mean and deviations of distances for each class, in order to make a proper selection of nearest neighbors. For \(n=1,\ldots ,N\), let \(u_n\) be the lowest distance \(d_n(\textbf{x}_p)\) for \(c_p=c_n\), with \(d_n\) defined above. For each class \(k=1,\ldots ,C\), let \(\mu _k\) and \(\sigma _k\) be the mean and standard deviation of \(u_n\) for \(c_n=k\). Then, for class k the lowest distance \(D_k(\textbf{x})\) between its training vectors \(\{ \textbf{x}_n \}_{c_n=k}\) and the test vector \(\textbf{x}\) is evaluated. This distance \(\displaystyle D_k(\textbf{x})=\min _{c_n=k} \{ d_n(\textbf{x}) \}\) is translated and scaled by \(\mu _k\) and \(\sigma _k\), respectively, into \(\displaystyle D_k'(\textbf{x})=\frac{D_k(\textbf{x})-\mu _k}{\sigma _k}\) in order to be evaluated according to the typical distances between the training vectors of class k. The class prediction \(\displaystyle Y(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,min}\,}}}\ \{ D_k'(\textbf{x}) \}\) is the one with the lowest distance \(D_k'(\textbf{x})\).

5 Proposed feature population classifiers

The features are answers, with discrete values from 1 to 5, to questions in the SP2 questionnaire. To evaluate differences between participants over all the features jointly might not allow a reliable discrimination between both classes. On the contrary, it seems that some features might be much more relevant than others. The right prediction for each test vector might even depend on just one or few features, that however might not be the same for all the test vectors. This suggested us to evaluate separately each feature based on its value and on the number of training vectors of each class that had this value. The objectives are: 1) to perform a class prediction for each feature; 2) to evaluate the relevance of each feature for classification; and 3) to use these class predictions and feature relevances to make a global class prediction, i.e., over all the features. Following this idea, we propose several classifiers based on the population of training vectors of each class that have each value of each feature. In order to make the final prediction, these classifiers use different criteria that we will describe in the following paragraphs.

Let \(N_k\) be the number of training vectors of class k. For each feature (or input) \(i=1,\ldots ,I\), value \(j=1,\ldots ,J\), and class \(k=1,\ldots ,C\), let \(N_{ijk}\) be the number of training vectors \(\textbf{x}_n=(j_{n1},\ldots ,j_{nI})\) with \(c_n=k\) and \(j_{ni}=j\). Let \(P_{ijk}\) be the relative population of training vectors of class k whose i-th feature has value j, calculated as the ratio between \(N_{ijk}\) and \(N_k\):

$$\begin{aligned} P_{ijk}=\frac{N_{ijk}}{N_k}, \quad i=1,\ldots ,I; j=1,\ldots ,J; k=1,\ldots ,C \end{aligned}$$
(2)

For a test vector \(\textbf{x}=(j_1,\ldots ,j_I)\), let \(P_{ik}(\textbf{x})\) be the relative population of input i, value \(j_i\) and class k, so that \(P_{ik}(\textbf{x})=P_{ij_{i}k}\) for \(i=1,\ldots ,I\) and \(k=1,\ldots ,C\). According to this distribution of the training vector population among feature values and classes, the prediction \(y_i(\textbf{x})\) issued by feature i should be the class k that maximizes \(P_{ik}(\textbf{x})\):

$$\begin{aligned} y_i(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ P_{ik}(\textbf{x}) \right\} , \quad i=1,\ldots ,I \end{aligned}$$
(3)

Based on this, we can define the relevance measurement \(R_i\) of feature i as the average accuracy of this feature in the class prediction over the whole training set. For the training vector \(\textbf{x}_n=(j_{n1},\ldots ,j_{nI})\), feature i predicts class \(y_i(\textbf{x}_n)\) achieved replacing x by \(\textbf{x}_n\) in (3), so that the relevance \(R_i\) of feature i can be defined as:

$$\begin{aligned} R_i=\frac{1}{N} \sum _{n=1}^N \delta (c_n,y_i(\textbf{x}_n)) \end{aligned}$$
(4)

In order to perform a class prediction \(Y(\textbf{x})\) for a test vector \(\textbf{x}=(j_1,\ldots ,j_I)\) based on the predictions \(y_i(\textbf{x})\) and relevances \(R_i\) of the I features, several criteria were considered, leading to the following population-based classifiers:

  1. 1.

    Feature population voting (fpv). The predicted class \(Y_1(\textbf{x})\) is the most voted over the I features: \(\displaystyle Y_1(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ \sum _{i=1}^I \delta (y_i(\textbf{x}),k) \right\} \), where \(\delta (i,j)\) is defined in section 4. This approach was selected due to its simplicity, although the lack of an averaging procedure over features might lead to low classification performance. In case of tie, the predicted class k is the one with the largest sum of relevances over the features that predict class k, so that \(\displaystyle Y_1(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ \sum _{i \in \mathcal {A}_k(\textbf{x})} R_i \right\} \), where \(\mathcal {A}_k(\textbf{x})=\{ i=1,\ldots ,I: y_i(\textbf{x})=k \}\).

  2. 2.

    Feature population sum (fps). The predicted class \(Y_2(\textbf{x})\) is the one that maximizes the sum of relative populations \(P_{ik}(\textbf{x})\) for those features i with \(y_i(\textbf{x})=k\), so that \(\displaystyle Y_2(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ \sum _{i \in \mathcal {A}_k(\textbf{x})} P_{ik}(\textbf{x}) \right\} \). This sum over features that predict the same class performs averaging and might compensate some classification errors compared to the crisp decision used by fpv.

  3. 3.

    Maximum relevance feature (mrf). The prediction \(Y_3(\textbf{x})\) is the class k predicted by the feature with the highest relevance \(R_i\), so that \(\displaystyle Y_3(\textbf{x})=y_{p}(\textbf{x})\), where \(\displaystyle p = \underset{i=1,\ldots ,I}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ R_i \right\} \). The exclusive use of the most relevant feature p might lead to low performance due to the lack of averaging or smoothing over features.

  4. 4.

    Voting among 5 most relevant features (mrf5). The predicted class \(Y_4(\textbf{x})\) is selected by voting among the 5 features with the highest relevances, so that \(\displaystyle Y_4(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ \sum _{i \in \mathcal {B}} \delta (k,y_i(\textbf{x})) \right\} \), where \(\mathcal {B}\) is the set of the 5 most relevant features. The voting over the 5 most relevant features might provide an averaging procedure to compensate errors that is missing in mrf.

  5. 5.

    Feature relevance times population (frp). The predicted class \(Y_5(\textbf{x})\) is the one that maximizes the sum of \(P_{ik}(\textbf{x})\) multiplied by feature relevance \(R_i\) over the features \(i \in \mathcal {A}_k(\textbf{x})\), that predict class k. Thus, \(\displaystyle Y_5(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ \sum _{i \in \mathcal {A}_k(\textbf{x})} R_i P_{ik}(\textbf{x}) \right\} \). The relevance weighted sum over features might also provide an averaging procedure that avoid errors caused by non-reliable features.

  6. 6.

    Feature nearest neighbor (fnn). Instead of using the populations \(P_{ik}(\textbf{x})\) as the previous criteria, this method stores in set \(\mathcal {D}_{ijk}\) the indices n of the training vectors \(\textbf{x}_n=(j_{n1},\ldots ,j_{nI})\) of each class k for each value j of each feature i, so that \(\mathcal {D}_{ijk}=\left\{ n=1,\ldots ,N: c_n=k, j_{ni}=j \right\} \) for \(i=1,\ldots ,I; \quad j=1,\ldots ,J; \quad k=1,\ldots ,C\). For a test vector \(\textbf{x}=(j_1,\ldots ,j_I)\), let \(\displaystyle d_{li}(\textbf{x})=\min _{n \in \mathcal {D}_{ij_{i}l}} \{ d_n(\textbf{x}) \}\), for \(l=1,\ldots ,C; i=1,\ldots ,I\), be the lowest distance from x to the training vectors of class l with the same value \(j_i\) of feature i as \(\textbf{x}\), whose indices are included in \(\mathcal {D}_{ij_{i}l}\). The prediction \(y_i(\textbf{x})\) issued by fnn on feature i is the class l with the lowest distance \(d_{li}(\textbf{x})\) to training vectors in \(\mathcal {D}_{ij_{i}l}\).

    $$\begin{aligned} y_i(\textbf{x})=\underset{l=1,\ldots ,C}{{{\,\mathrm{arg\,min}\,}}}\ \{ d_{li}(\textbf{x}) \}, \quad i=1,\ldots ,I \end{aligned}$$
    (5)

    The global class prediction \(Y_6(\textbf{x})\) of fnn is selected by voting among features:

    $$\begin{aligned} Y_6(\textbf{x})=\underset{k=1,\ldots ,C}{{{\,\mathrm{arg\,max}\,}}}\ \left\{ \sum _{i=1}^I \delta (k,y_i(\textbf{x})) \right\} \end{aligned}$$
    (6)

The pseudo-code of the proposed population-based classifiers is included in algorithms 1 and 2 of the supplementary material.

6 Results and discussion

Table 2 Best kappa and accuracy (in %), best classifier and its language for each dataset
Fig. 2
figure 2

Box plot of kappa (in %) for each dataset, sorted by decreasing medians

We developed experiments comparing the standard and proposed classifiers (subsections 3 to 5) on the available datasets (subsection 2). Given the reduced number (52) of participants, or feature vectors, we used leave-one-out (LOO) cross validation [38]. The metrics of classification performance that we used are the Cohen kappa statistic [39] and the accuracy, both in %. Since this is a binary classification problem, other measurements such as sensitivity, specificity, positive predictivity, F1 and area under curve, among others, were also calculated (see their definitions in Appendix E of the supplementary material).

Table 6 of Appendix H in the supplementary material reports the whole results for all the classifiers and datasets. Table 2 compiles the best kappa and accuracy achieved on each dataset by some of the 42 standard classifiers listed by Table 1 of the supplementary material. The best result (60.5%) is achieved by fps on dataset avoiding, that seems to be the most suited for classification. This kappa value can be considered, according to [40], as a “substancial agreement between true and predicted class labelings”. In the datasets sensitivity, seeking and registration, the performance is about 15-20 points below avoiding, while results on total and especially on touch are fairly poor.

These results by datasets are confirmed by Fig. 2, that shows the box plot of the kappa values achieved by all the classifiers on each dataset. Besides achieving the highest kappa, dataset avoiding also achieves the largest upper box limit, much above the other datasets, and the largest median, although only slightly above sensitivity and registration. The total, touch and seeking datasets have even lower kappa medians (near 0%).

Table 3 Best kappa and accuracy (both in %) achieved by each classifier, sorted decreasingly, and dataset where the best result was achieved

Table 3 reports the best kappa and accuracy achieved by each classifier over all the datasets, sorted by decreasing performance, and the dataset where the best kappa was achieved. Despite we tried 50 classifiers, 42 standard and 8 proposed, the table contains 33 rows because many classifiers have several implementations, but only the best of them is reported in the table. Eleven of the 20 highest kappa values are achieved using dataset avoiding, that is clearly the best one. The low kappa values achieved by the majority of the standard classifiers in Table 3 show that this is a hard classification problem. These values range from 4.6% to 60.5%, while accuracy ranges from 63.5% to 84.6%. Due to class unbalance, relatively high values of accuracy indeed correspond to very low kappa. In some cases, higher kappa correspond to lower accuracy, such as e.g. in positions 4 and 5, where nbDiscrete achieves kappa of 55.2% and accuracy of 80.8%, while fnn achieves lower kappa (54.8%) but higher accuracy (84.6%). Since accuracy is biased by class unbalance, kappa is a metric more significant than accuracy.

The standard classifier that achieves the best kappa is discrete naive Bayes (nbDiscrete, 55.2%). This good result of Bayesian classification confirms the usefulness of probabilistic methods, as with fps, in this problem. Both the multivariate adaptive spline regression (earth) and rpart classification tree achieve kappa of 49.6%. The naive Bayes (nb) and kernel Fisher discriminant (kfd) also perform relatively well, with kappa of 41.4% and 34.5%, respectively. The remaining standard classifiers have lower kappa values. Neural networks: multi-layer perceptron (32.5%), extreme learning machine (32.2%) and probabilistic neural network (4.6%). Ensembles: random forest (32.2%), gradient boosted machine (14.2%), adaboost (21.8%) and bagging (31.9%). Support vector machine (24.8%). Classification tree (25.2%). Discriminant analysis classifiers: dlda, fda, lda, mda and qda (between 11% and 31%). The remaining Bayes classifiers are: model averaged naive Bayes (manb, 31.9%), semi-naive structure learning wrapper (nbSearch, 21.4%) and attribute weighting naive Bayes (awnb, 22.2%). Other classical methods with low performance are learning vector quantization (lvq, 16.6%) and Gaussian process regression (gpr, 6.2%).

Table 4 Best result (kappa and accuracy, in %) achieved by the best classifier (upper part) and by the best classifier with feature selection (lower part) for each dataset

Considering the proposed methods fpv, fps, mrf, mrf5, frp, nn, nnc and fnn, they are all within the first 13 highest kappa values in Table 3, overcoming the vast majority of the standard classifiers excepting nbDiscrete, earth, rpart and nb. Specifically, fps, fpv and frp achieve the three highest kappa values, and fnn is in the fifth position. Indeed, the 10 best classifiers include 6 proposed by us. The good result of fps (feature population sum, kappa 60.5%) reveals that classification based on populations of training vectors with each feature value is effective for this problem. The sum of populations in fps outperforms both the voting in fpv and the feature relevance in frp (feature relevance times population). This may happen because fps first sums the feature contributions (relative populations) of each class and then makes a prediction, while fpv first makes a class prediction for each feature and then sums the votes to make a final prediction. Thus, fps is able to compensate among features, being less prone to make mistakes than voting in fpv. The frp uses class populations and feature relevances, achieving less kappa than fps, so that relevances may be confusing classification. This is also confirmed by the lower positions, 9 and 12, achieved by mrf and mrf5, respectively. Besides, the voting among the 5 most reliable features in mfr5 reduces performance compared to mrf, that uses only the feature with the highest reliability. Appendix F in the supplementary material presents, for the avoiding dataset, the feature relevance values defined by section 5 and by other methods in the literature. These feature relevances are relatively coherent identifying the features that most influence classification.

The proposed fnn (feature nearest neighbor) also achieves a good (5th) position, with kappa 54.8%, outperforming nn and nnc. This fact suggests that information about class population helps the neighbor selection for this problem. The remaining proposed neighbor classifiers (nn and nnc, 47.4% and 40.5%, respectively) also outperform the standard neighbor methods (knn, 28.4%), so the distance \(d_n(\textbf{x})\) proposed in section 4 reveals more effective than standard Euclidean distance.

6.1 Feature selection

We also used 22 popular methods of feature selection (FS) listed in Table 3 of the supplementary material, including the feature relevance in (4), in order to find out whether some selected group of relevant features might achieve better performance than the whole set of features. To avoid an excessively large number of experiments, FS was applied with the best classifier, fps, and other 6 classifiers selected by their good performances: nbDiscrete, earth and rpart in R (positions 4, 6 and 7 in Table 3, respectively); kfd and mlp in Python (positions 13 and 14 in that table); and rforest in Matlab (position 15). In each LOO trial, the FS method is applied on the training set in order to rank features by decreasing importance. Backward and forward sequential feature selection (bsfs and fsfs, respectively) are well-known FS methods [41] that search sequentially a pre-specified number or features in order to maximize the performance of a classifier, that in the Matlab implementation used by us is a Bayesian classifier. The forward version (fsfs) starts from the scratch and iteratively adds features that raise the classifier performance until the number is reached. The backward version (bsfs) starts with all the I features and iteratively removes those when it raises the performance, stopping with the desired number of features. Both are executed for \(i=1,\ldots ,I\), selecting i features in iteration i. The importance of a feature is given by the number of times that it is selected.

Table 5 Confusion matrix achieved by fps-bsfs on dataset avoiding
Table 6 Performance metrics (in %) achieved by fps-bsfs on dataset avoiding (see text for details)

Once the feature ranking is created, the classifier is used to classify the first i features in the ranking, for \(i=1,\ldots ,I\), and the performance is evaluated for each value of i. Some FS methods (e.g. sfprc2/sfprfc, sfdrc2/sfdrfc and sfwec2/sfwefc) do not accept a number of features to be selected. Instead, they require another parameter, such as a threshold \(\alpha \) on the p-values of a chi-squared test or ANOVA F-value, so that only features with p-value \(>\alpha \) are selected. The select scoring percentile (sp) method uses a percentile of features, with values from 10% to 100%, with step 10, of the features.

Table 7 of Appendix H in the supplementary material reports the whole results achieved with the selected classifiers and feature selection methods. Table 4 compiles, for each dataset, the kappa and accuracy achieved by the best classifiers without FS (upper part, same values as in Table 2 above). The lower part of the table reports kappa and accuracy achieved using the best combination of classifier and FS method. The last row specifies the implementation language. The FS increases kappa from 60.5% to 68% in dataset avoiding using bsfs (backward sequential feature selection) selecting the 17 most relevant features, and accuracy raises from 82.7% to 86.5%. The FS methods also overcome the best classifier’s performances without FS in the remaining datasets. Figure 3 in the supplementary material shows the box plot comparing the the kappa achieved by the 7 selected classifiers without and with the best FS method over the 6 datasets. Note that feature selection increases the kappa values for all the classifiers, although only fps-bsfs overcomes 60%.

Table 5 reports the confusion matrix corresponding to the best result (kappa\(=\)68%, accuracy\(=\)86.5%), achieved by fps-bsfs on dataset avoiding. According to the scale defined by [40], this corresponds to a good agreement between true and predicted values. The ASD/ADHD class is considered as positive, so that TN (no. of true negatives)\(=\)33, FP (false positives)\(=\)5, FN (false negatives)\(=\)2 and TP (true positives)\(=\)12. This matrix reflects a reliable classification because the numbers outside the main diagonal are much below the numbers inside the main diagonal. Specifically, FP (with value 5) represents 41% of TP and 15% of TN, while FN (with value 2) represents 17% of TP and 6% of TN.

Table 6 reports other performance metrics for binary classification achieved by fps-bsfs: recall, a.k.a. true positive rate (TPR) and sensitivity; precision or positive predictive value (PPV); specificity; balanced accuracy; false positive rate (FPR); false negative rate (FNR); F1, equal to F-score for \(\beta \) \(=\)1; area under ROC curve (AUC); Youden index; and Mathews correlation coefficient (MCC), also known as Yule \(\phi \). Appendix E of the supplementary material defines these metrics.

The fps-bsfs correctly classifies 33\(+\)12\(=\)45 of 52 participants with an accuracy of 86.5% (Table 6) and a balanced accuracy of 86.3%. The prediction is wrong for the remaining 7 participants due to: 1) the inherent complexity of the classification problem, where similarity is large between feature vectors of both classes; and 2) machine learning classifiers can only learn approximately a classification problem, and can always make prediction errors. Recall is very high (86.5%) due to the low FN (2) compared to TP (12) in the confusion matrix (Table 5). Precision is lower (70.6%) because FP (5) is higher than FN. However, the specificity is higher (86.8%) because FP\(=\)5 is low compared to TN (33). This low FP also leads to a low FPR (13.2%). Since FN\(=\)2, only 2 participants of class ASD/ADHD are erroneously classified as ASD, giving a low FNR (14.3%). The F1, that combines recall and precision, is also high (77.4%). The Youden index and MCC are relatively high, 72.6% and 68.6%, respectively.

Fig. 3
figure 3

Receiver operating curve (ROC, left panel, the red square is the working point) and lift chart (right panel) of classifier fps-bsfs on dataset avoiding

Figure 3 (left panel) plots the receiving operating curve (ROC) of fps-bsfs on dataset avoiding. The curve contains points fairly near to the upper left corner, that identifies the perfect classification. In fact, the area under curve (AUC) in Table 6 is also high (84.2%). The red square is the working point corresponding to the confusion matrix of Table 5, located fairly near to the upper left corner. The right panel of Fig. 3 shows the lift chart, see [42] p. 265, that plots the cumulative percentage of ASD/ADHD vectors that are correctly labeled versus the percentage of samples tested. The perfect (resp. random) classification corresponds to the left (resp. right) side of the shadowed triangle, whose upper side has length equal to the percentage of vectors of the minoritary class (ASD/ADHD). The classifier fps-bsfs is represented by the black line inside this triangle, that is fairly near to the left side, excepting in the higher percentages of tested samples because 5 ASD participants, that represent 9.6% of the total, are erroneously predicted as ASD/ADHD.

Table 7 Classification of FPS with 17 features selected by BSFS on avoiding

Table 7 shows a case example of prediction made by fps-bsfs for a test vector. The participant is identified by a vector x composed of I=17 features selected by BSFS out of the 20 features in dataset avoiding. The names of the selected features are listed by the first column of Table 7. The feature values are listed in the second column. First, for each input the probabilities \(P_{ik}=P_{ij_i k}\) (2), calculated during training, are selected for k=1 for class ASD and k=2 for class ASD/ADHD. Here \(j_i\) is the i-th feature of x. The values of \(P_{ik}\) for \(i=1\ldots ,I\) and \(k=1,2\) are reported by columns 3 and 4. Then, each feature i predicts the class \(y_i(\textbf{x}) \in \{1,2\}\) with the highest \(P_{ik}\) as in (3). The prediction of each feature is listed in column 5, e.g. features SP2 and SP15 predict ASD/ADHD and ASD, respectively. Note that feature SP66 has no prediction because no training vector had the value 3, so for this test vector SP66 gives no information about class populations. For each class k, the probabilities \(P_{ik}\) are summed over the features i that predict class k. These sums are listed in the row “Probability sum” of the table, with values 1.29 and 5.31 for classes 1 and 2 (only bold values are summed for each class). The class probabilities \(P_k\) are \(P_1/(P_1+P_2)\) and \(P_2/(P_1+P_2)\) for \(k\) \(=\)1,2, with values 0.19 and 0.81. Finally, the predicted class is the one with the largest \(P_k\), with probability or “certainty” 0.81 (last row of the table).

Subsequently, the participant with feature vector x was predicted as ASD/ADHD because his SP2 questions 2, 5, 58, 65, 67, 70, 71, 72, 74, 75 and 81 had values that in the training are more probable for participants of class ASD/ADHD than of class ASD, while only SP2 questions 15, 61, 63, 64 and 68 had values more probable for participants of class ASD than of class ASD/ADHD. Summing the probabilities of the features that predict each class, ASD and ASD/ADHD achieve probabilities of 0.19 and 0.81 respectively, so the class with the highest sum (ASD/ADHD) was predicted with probability 0.81.

Table 8 Best kappa and accuracy (both in %) and classifier for each dataset: without FS nor unbalanced classification (upper part); with a FS method (middle part); with an unbalanced classification method (lower part)

6.2 Unbalanced classification

Given the class unbalance (ASD has 2.71 times more vectors than ASD/ADHD), we applied 19 methods for unbalanced classification (see Table 5 of the supplementary material) in order to explore whether they increase performance with respect to classifiers alone and to classifiers with feature selection. Some of the unbalanced methods are classifiers specifically devoted to unbalanced classification, while others are sampling methods that were used with the 7 selected classifiers selected in the previous sections: fps in Octave; nbDiscrete, earth and rpart in R; kfd and mlp in Python; and rforest in Matlab. The three Octave classifiers in the upper part of Table 5 of the supplementary material are designed for unbalanced classification [43]. The Matlab methods include random undersampling boosting (RUSBoost), an ensemble that tunes the number of base classifiers with prior probabilities determined from the class populations [44], and synthetic minority over-sampling technique, SMOTE [45], a well-known sampling method for unbalanced classification. In Python, we used the 7 classifiers combined with three SMOTE versions implemented by package imbalanced-learn [46]. We also used the SVM with class weights (named wsvm in Table 5 of the supplementary material) inversely proportional to the class population, in order to take into account the class unbalance. In R, the 7 classifiers were combined with several sampling methods implemented by the ROSE package [47], including under- and oversampling, randomly oversampling examples, ROSE [48] and several SMOTE variants.

Table 8 in Appendix H of the supplementary material reports the kappa of all the unbalanced classifiers and datasets. The lower part of Table 8 compiles the best results achieved on each dataset by an unbalanced classifier or by the selected classifiers combined with a sampling method. For comparison, the middle part of the table reports the best values achieved by these classifiers with feature selection (already listed in Table 4), and the upper part reports the best classification without feature selection nor unbalanced classification (already listed in Table 2). The performance using unbalanced classification is slightly higher than only classification (upper part of the table), excepting dataset seeking where it is lower, and it is clearly lower than using feature selection in all the datasets. Although unbalanced classification also achieves the best kappa (60.5%, equal to the upper part) for dataset avoiding and the same classifier (fps) as in upper and middle part, it does not overcome the globally best kappa (68%), achieved by fps with feature selection (bsfs). The reason might be the low degree of unbalance (class ASD only doubles population of class ASD/ADHD). The sampling methods smote and rose seem better than the others, achieving the best results in two datasets each one. Considering languages, four of six datasets achieve the best results with methods implemented in R. The classifiers designed for unbalanced classification, such as adaboostM1, samme, adaboostNC, rusboost and wsvm, did not outperform the original classification (upper part of Table 4) nor achieved the best result in any dataset.

6.3 Clinical discussion

Evidence supports an overlap of sensory processing difficulties in ASD and ADHD [17]. The early identification of the co-occurrence of these difficulties in these disorders may be useful to predict ADHD comorbidities in children and adolescents with ASD. This is even more relevant considering the increased negative impact of comorbid ASD/ADHD on children’s functioning outcomes compared to ASD or ADHD alone, see [8,9,10]. Using distinct machine learning methods, the current study aimed at predicting ADHD comorbidities in children and adolescents diagnosed with ASD through atypical sensory processing. The experiments showed that machine learning methods can be an important and useful approach contributing to the early detection of ADHD comorbidity with ASD considering sensory processing. Comorbid ADHD in ASD tend to be detected after the ASD diagnosis [49]. As ASD is commonly detected around 3.5 years old [50] comorbid ADHD diagnosis may take years being diagnosed from 4.8 years old [50] to 8 years old [51]. In this study, we used the SP2 questionnaire that can be applied from the age of three-years old. We found that not only SP2 was able to predict ASD/ADHD, but also to do it several years before the diagnosis is made.

The results showed that sensory avoiding, registration and sensitivity quadrants presented more ability to predict ASD/ADHD. These findings are in line with previous studies demonstrating that children with ASD/ADHD present increased overall sensory processing difficulties compared to children with ASD or ADHD alone [21, 22]. Indeed, these three sensory quadrants reflect a child’s increased difficulty in dealing with environment stimuli, which is in accordance with other findings [52, 53]. Particularly, sensory avoidance is indicative of children that retreat from situations and/or limit environmental sensory experiences (e.g., a child may refuse to being cuddled). Our results proved that selected questions from the avoiding quadrant of the SP2 questionnaire, filled by parents at low ages, selected by the backwards sequential feature selection, allow the feature population sum classifier to achieve an early ASD/ADHD comorbidity prediction with high accuracy (86.5%) and low false positive and negative rates (below 15%). The registration quadrant refers to difficulties related to later reactions to sensory experiences (e.g., taking longer to waking up in the morning). The sensitivity quadrant is associated with more intense reactions or responses to sensory stimuli that neurotypical children (e.g., easily startled by sounds).

Atypical touch processing and sensorial seeking quadrant were the two sensorial outcomes that were not able to discriminate between ASD/ADHD and ASD patients. This evidences that atypical tactile processing is one of the earliest and most commonly sensory alterations observed in ASD [52]. Besides, sensorial seeking (i.e., the need for additional sensory input from the environment, e.g., enjoying loud music) is more frequently reported in ADHD [54]. It may be possible that these two sensorial dimensions are more characteristic of ASD and ADHD alone, respectively, and accordingly not indicative of ADHD comorbidity in children and adolescents with ASD.

Our study suggests that atypical sensory processing is an important aspect to consider when characterizing and/or differentiating children with ASD, ADHD, and ASD/ADHD. Knowing that increased and overall sensory processing difficulties cascade into greater behavioral and adaptive problems, especially for children and adolescents with comorbid ASD/ADHD [7, 26], the distinct pattern of sensory processing difficulties in these disorders could assist an early and differential diagnosis. Because the SP2 questionnaire can be administered from early ages (i.e., starting from 3 years of age), atypical sensory processing can be used as an early indicator of comorbidity between these neurodevelopmental disorders as it encloses a possible predictive value of ADHD comorbidity with ASD. As a result, important implications for intervention programs can be draw. For example, altered sensory processing abilities could be considered in interventional approaches that include adapted sensory-based therapies for children with ASD, ADHD or ASD/ADHD. This can contribute to improve their adaptive functioning to environmental situations. The current study has limitations imposed by the low sample size. As well, eventual errors filling the answers to the SP2 questionnaire might bias the prediction. However, the impact of this bias on our experimentes are expected to be low because: 1) the experimental methodology employed, leave-one-out cross validation, is specifically designed for small-sample problems like this one, and it has been proven that it measures reliably the robustness of the classifiers; and 2) the errors in the answers, whose probability of occurrence is very low because the questionnaire has been previously validated and, if present, they will be both in the training and test data, so the evaluation of the classifier performance using leave-one-out cross validation already takes them into account.

7 Conclusion

In this study we have developed an algorithm to predict the presence of comorbid ADHD in youth and adolescents with ASD as young as 3 years old based on their sensory processing. To do so, we used the scores obtained in the questions of the Sensory Profile-2 (SP2) questionnaire. These questions are independent among them and have discrete integer values between 1 and 5. This kind of classification problems appear frequently in healthcare and other domains. Features were grouped into five standard combinations, of which four are sensory processing quadrants (avoiding, registration, seeking and sensitivity) and the other was a sensory modality (touch). Since the diagnosis of ASD/ADHD comorbidity is complex and usually performed at older ages, the number of participants enrolled in our study is small.

This prediction was performed using 42 standard classifiers of different families. We also proposed other 8 classifiers, that perform a prediction for a test data based on the class and feature values (supposed to be discrete and numeric) of the training data. Several proposed methods achieve the best performances on ADHD classification. Specifically, the feature population sum (fps) classifier achieved the highest kappa (60.5%) and accuracy (82.7%) using the avoiding quadrant. The best kappa achieved by standard classifiers is 55.2% (accuracy of 80.8%) using discrete naive Bayes, and 49.6% (78.8% of accuracy) using recursive partitioning trees (rpart) and multiadaptive regression splines (earth). Given the high number (86) of features used, and the interest to know what are the most relevant ones, we also used a collection of 22 feature selection methods combined with the best classifier (fps) and other six standard classifiers selected by their good performance. Using the backwards sequential feature selection (bsfs), fps raised its performance until kappa of 68% and accuracy of 86.5%, very near to perfect classification with low ratios of false positives (13.2%) and negatives (14.3%). We also used 19 methods of unbalanced classification, but they no longer increased the performance, probably due to the low level of class unbalance (three to one). For a future full-scale practical application, the best classification model (fps) must be trained using only the 17 questions selected by the best feature selection method (bsfs) from the quadrant avoiding of the SP2 questionnaire. Once trained, the classifier is ready to make a prediction for new participants using their scores for these questions.

From a clinical perspective, the SP2 questionnaire is a useful tool as it is completed by caregivers, who are in the strongest position to indicate the child’s response to sensory interactions that occur in everyday life, as early as three years of age. This information can assist early and more accurate ASD and/or ADHD diagnosis established by qualified professionals, and provides valuable information to a comprehensive assessment of the child and adolescent’s sensory strengths and challenges and, consequently, to develop effective treatment plans, interventions, and everyday remediation strategies to support children’s participation in all contexts (e.g., school). Future studies should validate the proposed models with a larger sample of participants. Equally, these models can be applied to other behavioral dimensions (e.g., social difficulties) where features are also discrete numeric answers to questionnaires, to further help more accurate diagnosis.

The co-occurrence of ASD and ADHD is well-documented. The overlap of symptoms between these disorders (e.g., inattention) poses challenges for an early and accurate diagnosis. Children and adolescents with comorbid ASD/ADHD experience greater difficulties in social skills [8], adaptive functioning [10] and quality of life [11] compared to children and adolescents with only one of these disorders, which enclosed long-term negative consequences [15]. This study contributes to the literature by showing that the comorbidity between ASD and ADHD involves greater sensory processing difficulties. In addition, it showed that atypical sensory responses may be an important differentiating factor between these disorders at early ages.

Supplementary information

The supplementary material includes the classifier list (Appendix A), hyper-parameter tuning of standard classifiers (Appendix B), pseudo-code of population-based methods (Appendix C), list of feature selection methods used (Appendix D), list of performance metrics for binary classification (Appendix E), feature ranking (Appendix F), list of unbalanced classification methods (Appendix G) and detailed results of the experimental work (Appendix H).