Introduction

East Africa has emerged as a leading oil and gas exploration hub, drawing global attention due to its extensive hydrocarbon reserves and promising geological formations. The region’s strategic significance is underscored by substantial discoveries in sedimentary basins such as the Mandawa Basin (Tanzania), Lake Albert Rift Basin (Uganda, Kenya), and the Rovuma Basin (Tanzania, Mozambique) as detailed by Zongying et al. (2013) and Purcell (2014). This surge in exploration activity, fueled by advancements in technology and the attraction of vast reserves, has positioned East Africa as a frontier for innovative exploration techniques and international investment. Major oil companies are engaged actively in assessing the region’s hydrocarbon potential, with estimates indicating reserves of up to 2.8 billion barrelsFootnote 1 of oil and 2.2 billion barrels of natural gas liquids from Tanzania’s basin alone (Brownfield, 2016). This influx of exploration activity underscores East Africa’s emergence as a vital player in meeting global energy demands, emphasizing the need for innovative exploration techniques and efficient resource assessment strategies.

Recently, many researchers have focused on utilizing new technology to minimize the time and cost of exploration due to the expected rise in the future energy demand. According to the International Energy Agency (IEA), energy demand is projected to increase by 25% by 2040, with fossil fuels remaining the dominant energy source and accounting for 75% of the total energy mix (IEA, 2021). This has led to the emergence of unconventional resources as a substitute. Basin modeling, a numerical simulation technique, plays a vital role in deciphering the complex subsurface geological processes that govern the formation and distribution of these valuable resources (Abdel-Fattah et al., 2017; Ehsan et al., 2023; Feng et al., 2023b). Thermal maturity (Tmax) prediction is a vital component of basin modeling as it assesses the extent to which organic matter has been converted inside a source rock, hence influencing its capacity to produce oil and gas (Farouk et al., 2023).

However, accurate Tmax estimation is essential in assessing and evaluating any unconventional hydrocarbon resources, and it is usually measured from core samples using geochemical analysis (Huijun et al., 2020; Kibria et al., 2020; Stokes et al., 2023). Maturity indices, such as maturity temperature (Tmax), are used widely for assessing the Tmax of a source rock (Zhang and Li, 2018; Yang and Horsfield, 2020). Tmax is a crucial Tmax index that can be used to estimate the maximum temperature reached by a source rock during the burial history of a basin (Albriki et al., 2022; Wu et al., 2023). Tmax is obtained through the pyrolysis process and corresponds to the S2 (remain potential of hydrocarbon generation) peak that results from the thermal breakdown of kerogen during temperature-programmed pyrolysis at temperatures between 300 and 600 °C (Synnott et al., 2018; Yang and Horsfield, 2020; Thana’Ani et al., 2022; Osukuku et al., 2023).

When estimating the maturity of drilled wells, geochemical methods like pyrolysis have long been thought to be the most reliable and accurate. However, numerous drawbacks are associated with this method, like time-consuming, operating costs, and inability to cover an extensive range of depth (Wood, 2018; İnan, 2023). Moreover, several studies have claimed that pyrolysis methods may expose samples to air for an extended period; measurements can often be inaccurate because the effect increases the likelihood of free organic matter oxidizing and escaping (Dembicki, 2022). As an alternative to geochemical methods, wireline logs offer an accessible and affordable data source and have become increasingly popular recently (Zhao et al., 2019; Malki et al., 2023).

Numerous researchers have shown that the evaluation of Tmax has been a primary concern in oil and gas exploration, and various conventional techniques have been implemented by different researchers to assess it (Gu et al., 2022; Hackley et al., 2022; Singh et al., 2022; Feng et al., 2023a; Thankan et al., 2023; Wu et al., 2023). These techniques include the bitumen reflectance (Hackley and Lünsdorf, 2018; Jubb et al., 2020; Adeyilola et al., 2022), thermal alteration index (Craddock et al., 2018; Deaf et al., 2022), Rock–Eval pyrolysis (Cheshire et al., 2017; Chen et al., 2019; Pang et al., 2020; Arysanto et al., 2022; Farouk et al., 2023; Sohail et al., 2024), and fluid inclusion analysis (Petersen et al., 2022). While conventional approaches to basin modeling and Tmax prediction have proven valuable, they have significant limitations as they provide discrete data that can be rigorous and lead to poor evaluation of source rock (İnan et al., 2017; Katz and Lin, 2021; Lohr and Hackley, 2021; Sadeghtabaghi et al., 2021; Safaei-Farouji and Kadkhodaie, 2022a). However, irrespective of the conventional solver adopted for Tmax computation, the procedure typically involves significant computational overheads and consumes time.

Recently, due to the advancement of technology, various machine learning techniques have become the focal point of researchers and have been adopted to predict the Tmax of source rock (Abdizadeh et al., 2017; AlSinan et al., 2020; Ehsan and Gu, 2020; Shalaby et al., 2020; Tariq et al., 2020; Amosu et al., 2021; Barham et al., 2021; Aliakbardoust et al., 2024; Li et al., 2024). Hybrid methods have been reported to be more accurate in predicting different source rock parameters (Ahangari et al., 2022; Safaei-Farouji and Kadkhodaie, 2022b; Saporetti et al., 2022; Mkono et al., 2023). However, a group method of data handling (GMDH) method was suggested by Mulashani et al. (2021) as an alternative method for predicting total organic carbon (TOC) from well logs. The methods include input factors such as neutron porosity, spontaneous potential, gamma-ray, resistivity log, sonic travel time, and bulk density. Compared to ANN and log R, the results demonstrated that the methods accurately estimate TOC from log data. However, the study of organic matter Tmax estimation was not presented in detail.

In addition, some studies have been reported to predict Tmax using hybrid methods (Tariq et al., 2020; Barham et al., 2021). In their research, Tariq et al. (2020) used a hybrid technique of artificial neural network–particle swarm optimization (PSO–ANN) to predict Tmax from well log. Another researcher, Barham et al. (2021), applied ANN coupled with principal component analysis (ANN–PCA) to predict Tmax from geophysical well logs. These methods have shown some limitations that lead to inaccurate estimation of Tmax (Table 1). For this, a hybrid of group method of data handling (GMDH) neural network and differential evolution (DE) algorithm (i.e., GMDH–DE) is proposed in this study to overcome the drawbacks of previously utilized hybrid methods used in predicting Tmax.

Table 1 Summary of previous hybrid methods’ strength and limitation in predicting Tmax

This study presents, for the first time, an integral technique of basin modeling with a novel hybrid GMDH–DE method to enhance the computational process, assess the kerogen type, and simplify the estimation of Tmax in source rocks. The GMDH–DE serves as an improved neural network model for estimating Tmax as a maturity index using geophysical well logs. During the training phase, the GMDH–DE exhibits a remarkable self-organizing characteristic that automatically adjusts model parameters and generates an ideal model structure. Unlike previous hybrid machine learning models for predicting Tmax, the GMDH–DE eliminates the need to manually adjust learning parameters to achieve optimal results. Hence, the performance of the proposed GMDH–DE model in forecasting Tmax is an adequate improvement compared to that of previously employed hybrid machine learning algorithms, namely PCA–ANN and PSO–ANN. Moreover, this study performed a sensitivity analysis to determine how much each input parameter affected the suggested GMDH–DE model in the estimation of Tmax. The results of this study ranked the GMDH–DE model as a reasonably new computational intelligent learning model for the reliable estimation of Tmax. This research contributes to advancing exploration techniques and efficient resource assessment strategies, making it a valuable asset for the oil and gas industry’s ongoing efforts to meet global energy demands.

Geological Setting

The Mandawa Basin is a sedimentary basin in southeastern Tanzania. It is bounded to the north by the Rufiji Trough, to the south by the Ruvuma Basin, to the west by the metamorphic basement, and to the east by offshore basins (Fig. 1). The basin covers an area of approximately 16,000 km2 (Fossum, 2020; Abay et al., 2021). The basin was formed during the Permo–Triassic rifting event, which resulted in the separation of East Gondwana from West Gondwana (Hudson, 2011; Hudson and Nicholas, 2014; Godfray and Seetharamaiah, 2019). The rifting event began in the Late Permian and continued into the Early Triassic. The sediments deposited during this time are known as the Pindiro Group, which comprises various lithologies, including sandstones, mudstones, and limestones (Gama & Schwark, 2022, 2023). The rifting event ended in late Triassic, and the Mandawa Basin entered a period of relative stability. During this time, the basin was filled with shallow marine sediments known as the Mavuji Group, which is composed of sandstones, mudstones, and limestones. In the Late Cretaceous, the Mandawa Basin was uplifted and eroded (Hou, 2015; McCabe, 2021). This uplift resulted in the formation of a series of cuestas and valleys. The cuestas are long, sloping ridges that are formed by resistant sandstones. The valleys are low-lying areas formed by softer mudstones and limestones (Einvik-Heitmann, 2016).

Figure 1
figure 1

Location of the Mandawa Basin and position of the Mbate, Mbuo, and Mita Gamma Wells (map modified from Hudson (2011) and Barth et al. (2016))

Moreover, according to the geological setting of the Mandawa Basin, it contained and exposed the Kilwa Group, a succession of Late Cretaceous to Paleogene age. The group comprises four formations: the Nangurukuru, Kivinje, Masoko, and Pande Formations (Nicholas et al., 2006). The Nangurukuru Formation seems to be the oldest. It is composed of variably lithified sandstones, mudstones, and shales, while the Kivinje Formation is composed of marine shales and mudstones that contain abundant fossils of planktonic foraminifera (McCabe et al., 2023). The Masoko Formation comprises shallow marine sandstones and mudstones with abundant fossils of benthic foraminifera and ostracods (Fossum et al., 2019). The youngest Pande Formation is composed of fluvial sandstones and mudstones with abundant fossils of land plants (Zhou et al., 2013). The source rock of the Mandawa Basin is the Mbuo Claystone in the late Triassic Pindiro Group and Nondwa evaporites in the early Jurassic Pindiro Group (Maganza, 2014) (Fig. 2).

Figure 2
figure 2

Lithostratigraphic details of the Pindiro Group in the Mandawa Basin (modified from Gama & Schwark, 2022)

Methodology

Data Description and Pre-processing

This study was conducted on three wells in the Mandawa Basin. The source rocks in the study area are composed mainly of Jurassic shale and Triassic claystone, both from the Nondwa and Mbuo Formations of the Pindiro Group (Mshiu et al., 2022; Gama & Schwark, 2023). The input variables used to develop the model included the well log suite of deep lateral resistivity log (LLD), neutron porosity (NPHI), sonic travel time (DT), gamma-ray (GR), spontaneous potential log (SP), and bulk density log (RHOB). The holdout validation method was used to split the dataset into two parts: 70% of the data were used to train the model (data from Mbate Well and Mbuo Well), while 30% were used to validate the model’s performance (data from Mita Gamma Well).

During data processing, feature selection was conducted to remove outliers that have the potential to compromise the accuracy of an estimating model or diminish its predictive performance. The relative impact of the input parameters was evaluated using the Pearson correlation coefficient (R), thus:

$$b = \frac{1}{n}\sum\nolimits_{i = 1}^{N} {b_{i} }$$
(1)

where \(R_{a,b}\) is the Pearson correlation coefficient of variables a and b, \(\overline{a} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{N} a_{i}\) is the mean of \(a\), and \(\overline{b} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{N} b_{i}\) is the mean of \(b\). The Ra,b is calculated to measure the linear association between two variables. It ranges from − 1 to 1, with − 1 indicating a strong negative correlation and 1 indicating a strong positive correlation. The coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. This calculation can be done for both normal and binary responses and can also be extended to fuzzy numbers (Cohen et al., 2009; Zhou et al., 2016).

The measured Tmax and well log data were normalized to a scale from 0 and 1 to reduce redundancy and improve data integrity. The normalization process was performed as:

$$X_{{{\text{NORM}}}} = \frac{{x - x_{{{\text{min}}}} }}{{x\;{\text{min}}_{{{\text{max}}}} - 1}}$$
(2)

where \(x\) is the original value, \(X_{{{\text{NORM}}}}\) represents the normalized value in a dataset, \(x_{{{\text{max}}}}\) is the maximum value, and \(x_{{{\text{min}}}}\) is the minimum value. Table 2 presents the statistical features for the three well suites: Mbate, Mbuo, and Mita Gamma.

Table 2 Statistical features of the input datasets

Geochemical Analysis

Eighty-three core samples from the Mandawa Basin were collected for geochemical analysis. Fifty-nine core samples from Mbuo and Mbate Wells were used as training data, and the remaining 24 core samples from Mita Gamma Well were used as testing data. The depth intervals of the core samples were 1058–2135 m in Mbate Well, 1661–3145 m in Mbuo Well, and 1630–2150 m in Mita Gamma Well (Fig. 3). The samples were collected and brought to a laboratory for examination. A sample weighed 67.5 g and was subjected to crushing, sieving, and subsequent extraction and analysis using Rock–Eval pyrolysis, which employed a 25 °C min − 1 temperature schedule, with pyrolysis oven temperatures exceeding 750 °C and oxidation oven temperatures above 800 °C. The Rock–Eval pyrolysis results for Tmax and hydrogen index (HI) are presented in Table 3.

Figure 3
figure 3

Lithostratigraphic column of the Mbuo Well

Table 3 Rock pyrolysis data of Tmax and HI of the Mandawa Basin

Basin Modeling

Basin modeling is a tool used to predict maturation, hydrocarbon generation, expulsion, and migration in exploration geology (Abdel-Fattah et al., 2017). It involves building 1D models to analyze burial and temperature histories, as well as maturity and hydrocarbon generation and expulsion. This study utilized 1D basin modeling through PetroMod software (version 2012) to analyze the Tmax of the Pindiro Group. Essential parameters, including formation names, depths, thicknesses, and deposition ages, were used as inputs (Allen & Allen, 2013; Ahmed et al., 2019). Tmax and vitrinite reflectance were measured for model calibration. A constant heat flow of 64 mW/m2 was employed as described by Wygrala (1989), while the Burnham and Sweeney (1989) kinetic model was utilized due to the oil/gas-prone nature of the source rock. Relative petroleum system elements were assigned to each formation, along with TOC and HI parameters as input.

Back-Propagation Neural Network (BPNN)

Wang et al. (2019) reported the back-propagation approach as a supervised learning algorithm commonly used in neural networks. This approach adjusts the network weights to minimize the error between estimated and actual output (Titus et al., 2022). It is based on the gradient descent method that calculates the error gradient in response to the weight using the chain rule (Wu & Tong, 2022). It has been shown that the bias concept often works as a set of weights where the signals are sent in opposite directions during the back-propagation learning phase (Sun et al., 2021; Dai, 2023). BPNN was built as a way to solve the multilayer perceptron training problem. However, the two main improvements of the BPNN were the addition of a differentiable function at each node and the internal network weight change due to back-propagation error after each training epoch (Che Nordin et al., 2021).

Group Method Data Handling (GMDH)

GMDH is the association of a multilayer algorithm that generates a network of layers and nodes by utilizing several inputs from the analyzed data stream (MolaAbasi et al., 2021). It includes probabilistic, analogs complexing parametric, rebinarization, and clusterization techniques. Modeling of complex processes, function approximation, nonlinear regression, and pattern recognition are the core applications of GMDH (Lal and Datta, 2021). The self-organizing inductive propagation algorithm is a technique that can solve complex problems (Roshani et al., 2020; Lv et al., 2023). In addition, it is possible to derive a mathematical model from data samples, which can then be used for pattern recognition and identification.

Most GMDH algorithms employ polynomial reference functions. Volterra’s series function, the discrete analog of the Kolmogorov–Gabor polynomial, can describe a generic relation of output–input (Nelles, 2020).

$$u = a + \mathop \sum \limits_{i = 1}^{n} b_{i} x_{i} + \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{n} c_{ij} x_{i} x_{j} + \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{n} \mathop \sum \limits_{k = 1}^{n} d_{ijk} x_{i} x_{j} x_{k} + ...$$
(3)

where \(\left\{ {x_{1} ,x_{2} ,x_{3} ...} \right\}\) represents the inputs, \(\left\{ {a, b, c, d...} \right\}\) are the coefficients of the polynomials, and \(u\) is the output node.

GMDH Optimized by Differential Evolution (DE)

The GMDH–DE method is a novel approach for solving optimization problems, particularly those involving nonlinear systems. This method combines the strengths of the GMDH algorithm and DE algorithm to produce efficient and reliable solutions (Onwubolu, 2008). The GMDH algorithm is a self-organizing method that generates a hierarchical structure of models, starting with simple linear models and gradually building up to more complex nonlinear models (Aljarrah et al., 2022). On the other hand, the DE is an innovative parallel direct search optimization technique introduced by Price and Storn (1995). It uses a population for each generation made up of NP parameter vectors. The DE was reworked to address permutative issues even though it was initially intended for continuous domain space formulation (Storn & Price, 1995; Pourghasemi et al., 2020). The DE configuration is usually expressed in DE/x/y/z form, given that x is the perturbation solution, y is the difference vector’s number used to modify x, and z represents the recombination operator used, such as exp for exponential and bin for binomial. The GMDH–DE can effectively handle complex nonlinear relationships and improve predictive performance. The basic equation for the method is:

$$x* \, = {\text{ argmin}}\;F\left( x \right)$$
(4)

where x* is the optimal solution, F(x) stands for the objective function, while the population of solutions is represented by x.

In the GMDH–DE, the process starts with the creation of an initial population of candidate models using the GMDH algorithm. These models are then evaluated based on their fitness, typically using a performance metric like mean squared error or correlation coefficient, to determine their effectiveness in predicting Tmax. The DE algorithm is then applied to evolve and optimize the parameters of the candidate models, such as the coefficients of the polynomial regression, to further improve accuracy. This iterative process continues until a satisfactory model with optimized parameters is obtained, providing a reliable prediction of Tmax in geological formations. Table 4 summarizes the workflow steps followed by the GMDH–DE to predict Tmax.

Table 4 Workflow of the GMDH–DE algorithm

Results and Discussion

1D Basin Modeling Analysis

A change in heat flow ranging from 50 to 70 mW/m2 leads to a depth difference of approximately 1 km for the 100 °C isotherm, often considered as the lower limit for the oil generation window. This variation in heat flow across a basin can significantly influence the maturity stages of potential source rocks, resulting in considerable differences in their thermal evolution. It is essential to calibrate the thermal and maturity history in basin modeling by utilizing borehole temperature data and the vitrinite reflectance (Ro) measurements of source rocks (Hantschel & Kauerauf, 2009). According to 1D burial profiles, the source rock from the early Jurassic period shows a maximum burial depth that is comparable to the current day (2001–2287 m). As shown in Figure 4a, modeling and calibration data agreed well. Based on the Sweeney and Burnham (1990) classification, the analysis revealed that the Mbuo Formation’s source rock displayed vitrinite reflectance levels of 0.50–0.68% Ro (Fig. 4a), indicative of temperatures ranging from 90 to 103.3 °C (Fig. 4b). This suggests that the source rock spans the immature to mature stages of hydrocarbon generation, specifically within the gas–oil window.

Figure 4
figure 4

Tmax history showing (a) the fitness of calibrated and measured Ro; (b) temperature vs. depth

Evidence from the Mbuo Well indicates that the Mbuo Formation’s base has been heated to 103.3 °C throughout 0.17 Ma, having descended to a depth of 2287.69 m (Fig. 5). The generation started during the late Triassic to the early Jurassic in both Mbuo Claystone and Mbuo Sandstone and continues up to recent. Other overlying formations such as the Nondwa evaporites (intercalated with shales) and minor claystone in the Mihambia Formation are immature based on the modeling result. The measured data for Ro and temperature are provided in the Appendix along with different inputs used for basin modeling. The beginning of the immature stage (0.50% Ro) of the Mbuo Formation was noted in the Mbuo Well at a depth of 1728 m during the Paleogene period (62 Ma). At a depth below 1965 m, during the middle Paleogene (42 Ma), the source rock of interest had its early oil window (0.56% Ro). In the Neogene period (0.69 Ma), at a depth of 2287.7 m, the primary oil window began (Fig. 6)

Figure 5
figure 5

Burial depth history with temperature variation curve for the Triassic source rock in the Mandawa Basin

Figure 6
figure 6

Burial depth history with calibrated vitrinite reflectance for the Triassic source rocks in the Mandawa Basin

GMDH–DE Model Development

The GMDH–DE model comprised six input neurons and two hidden layers, namely, h1, h2, h3, and h4 in the first layer and v1 and v2 in the second layer. The output of the model was represented as y.

Figure 7 presents a neural network structure of the proposed model in predicting Tmax. The equations for the layer of the neural network model needed to provide the Tmax estimation are presented in Table 5.

Figure 7
figure 7

The GMDH–DE network structure

Table 5 Proposed equations for Tmax estimation

Performance Indicators

The GMDH–DE, GMDH, and BPNN models were coded and implemented in MATLAB R2022a on an AMD Ryzen 5 5600U with Radeon Graphics 2.30 GHz running Windows 10 operating system. The coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE) were the evaluation metrics used to assess the performance of the estimation models. The values of R2 vary between 0 and 1; a model is more effective when its R2 value is higher, and when a model’s R2 score is higher than 0.8 and close to 1, it is regarded as effective (Chicco et al., 2021; Mulashani et al., 2022). At the same time, RMSE is a measure of the differences between predicted values and observed values. Excellent model accuracy is defined by RMSE of < 10%, good model accuracy by RMSE of 10–20%, fair model accuracy by RMSE of 20–30%, and poor model accuracy by RMSE of > 30% (Yao et al., 2021; Hussain et al., 2023). Moreover, MAE is a metric used to measure the average magnitude of errors in a regression model; a lower MAE indicates better performance, as it represents a smaller average error between estimated and actual values (Ali et al., 2023). The R2, RMSE, and MAE mathematical expressions can be presented, respectively, as follows (Chong et al., 2022; Ramos et al., 2023):

$$R^{2} = \left( {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {a_{i} - \overline{a}} \right)\left( {P_{i} - \overline{p}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {a_{i} - \overline{a}} \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {P_{i} - \overline{p}} \right)^{2} } }}} \right)^{2}$$
(5)
$${\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {a_{i} - P_{i} } \right)^{2} }}{n}}$$
(6)
$${\text{MAE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{a_{i} - P_{i} }}{{a_{i} }}} \right|$$
(7)

where \({P}_{i}\) is the estimated Tmax value from each model, \({a}_{i}\) represents the actual value of Tmax measured from core samples, \(\overline{a}\) and \(\overline{P}\) are the true mean and estimated mean values for Tmax, respectively, and n represents the number of samples.

Hyperparameters Tuning

Hyperparameter tuning refers to adjusting the parameters of a machine learning model to optimize its performance. These parameters, known as hyperparameters, are set before training and not learned during training (Pravin et al., 2022). Hyperparameter tuning is essential because the performance of a machine learning model can be significantly affected by the values of its hyperparameters (Yang & Shami, 2020). It involves experimenting with different combinations of hyperparameters to find optimal values that produce the best performance. In this study, hyperparameters optimization was done using the DE algorithm, a parallel direct search technique for determining the structure of polynomial neurons in the GMDH. The DE generates improved values for the hyperparameters in each model loop and then inputs them to the GMDH to assess the model’s performance on the testing data. The performance of the GMDH was re-evaluated in the following phase. If it was satisfactory, the optimization was terminated; otherwise, the process continued until the stopping criteria were met, and the optimal hyperparameters were obtained. The hyperparameter configuration yielding optimal results comprised a population size of 70 individuals, a mutation rate of 0.7, and a cross-over rate of 13. This configuration fostered a balanced exploration of the solution space, facilitating the discovery of robust models. The choice of two layers with 17 neurons each effectively captured complex relationships within the data, enhancing the model’s capacity to discern patterns indicative of Tmax. By setting a stopping criterion of 300 iterations, the model achieved convergence while mitigating the risk of overfitting. Additionally, employing a moderate learning rate of 0.1 ensured stable and efficient training, allowing the model to effectively learn from the data without diverging or stagnating. Overall, the setting optimally balanced exploration and exploitation, thereby facilitating accurate estimation of the Tmax index in the GMDH–DE model. The optimal hyperparameter setting for the GMDH–DE estimation model is shown in Table 6.

Table 6 Hyperparameter settings for the GMDH–DE model

Estimation of Tmax during Training Performance

The training results summarized in Table 7 showcase the superior performance of the GMDH–DE model in estimating Tmax compared to traditional models like GMDH and BPNN. This is evidenced by its notably higher R2 of 0.995 (Fig. 8), indicating strong correlation between estimated and observed values, coupled with lower error metrics including RMSE (0.004) and MAE (0.006) (Fig. 9) and a substantially shorter computational time of 0.2 seconds. The success of the GMDH–DE model can be attributed to its adeptness in mitigating overfitting, a common challenge in model training. The choice of a moderate learning rate (0.1) and a stopping criterion of 300 iterations effectively regulated the model’s complexity and prevented it from overfitting noise in the data, as discussed above in the context of the hyperparameter settings. Moreover, in the geological setting of the Mandawa Basin, where variations in Tmax are influenced by complex interactions of organic matter and geological processes, the GMDH–DE model’s capacity to capture intricate patterns and nonlinear relationships between input variables and Tmax is particularly advantageous. Its ability to adaptively select features and construct hierarchical models aligns well with the complex nature of geological systems, thereby facilitating more accurate and robust predictions compared to the simpler architectures of GMDH and BPNN. Consequently, the GMDH–DE model emerges as the preferred choice for Tmax estimation in such a geological environment, offering superior predictive performance and computational efficiency.

Table 7 Value of R2 for the fitted equations per model during the training phase
Figure 8
figure 8

Cross-plots of actual Tmax vs. Tmax predicted by (a) GMDH–DE, (b) GMDH, and (c) BPNN during training

Figure 9
figure 9

RMSE and MAE of GMDH, GMDH–DE, and BPNN models for Tmax estimation during training

Estimation of Tmax During Testing Performance

The results of the Tmax estimation during model testing (Table 8) reveal notable performance differences among the models considered. The GMDH–DE model exhibited superior performance, achieving R2 of 0.970 (Fig. 10), low RMSE of 0.017, and minimal MAE of 0.025 (Fig. 11), all within a remarkably short computational time of 0.5 seconds. Contrastingly, the traditional GMDH and BPNN models demonstrated inferior performance, with lower R2 values accompanied by higher RMSE and MAE, and longer computational times. The efficacy of the GMDH–DE model can be attributed to its hyperparameter settings, notably the learning rate and stopping criterion, which contributed to the model’s generalizability. By employing a learning rate of 0.1 and a stopping criterion of 300 iterations, the GMDH–DE model effectively balanced the trade-off between model complexity and overfitting, allowing it to generalize well to unseen data. This is particularly crucial in geological settings such as the Mandawa Basin, characterized by diverse and intricate geological processes influencing Tmax. Additionally, the GMDH–DE model’s adaptive and self-organizing nature enabled it to capture complex nonlinear relationships inherent in geological data, thereby outperforming traditional models like GMDH and BPNN. Moreover, the success of the GMDH–DE model in Tmax estimation is linked to the capacity to reveal the hierarchical connections within the data through the multilayer structure and evolutionary optimization process. This enables the model to effectively leverage the geological features specific to the Mandawa Basin, thus enhancing its predictive accuracy. Overall, the GMDH–DE model’s robust hyperparameter settings, coupled with its adaptive nature and ability to capture complex geological processes, culminated in its superior performance compared to traditional models in estimating Tmax in the Mandawa Basin.

Table 8 Value of R2 for the fitted equations per model during the testing phase
Figure 10
figure 10

Cross-plots of actual Tmax vs. Tmax predicted by (a) GMDH–DE, (b) GMDH, and (c) BPNN models during the testing

Figure 11
figure 11

RMSE and MAE of GMDH, GMDH–DE, and BPNN models for Tmax estimation during testing

Comparison with Previous Studies

Generally, the results of the proposed GMDH–DE were further compared with the previously developed hybrid models of PSO–ANN and PCA–ANN, which were used in the estimation of Tmax (Table 9). The GMDH–DE performed better than both models suggested by Tariq et al. (2020) and Barham et al. (2021) during training by obtaining higher R2 of 0.995, while the PSO–ANN and PCA–ANN had R2 of 0.917 and 0.88, respectively. Likewise, during testing, the GMDH–DE performed better by obtaining higher R2 of 0.9703, followed by PSO–ANN with R2 of 0.918 and PCA–ANN with R2 of 0.8518 (Fig. 12).

Table 9 Comparison of the proposed model with the previously developed hybrid models for predicting Tmax
Figure 12
figure 12

Comparison of models from the present study, Tariq et al. (2020) and Barham et al. (2021), for predicting Tmax in terms of R2

SHAP (SHapley Additive exPlanations)

In this study, the GMDH–DE model estimated Tmax and provided valuable insights into feature relevance through SHapley Additive exPlanations (SHAP) values. The SHAP values calculate each feature’s average marginal contribution to the model’s prediction for every combination of features that may be present (Kannangara et al., 2022; Zhao et al., 2022). The SHAP parameter importance in Fig. 13 highlighted the substantial impact of the SP parameter on the GMDH–DE model’s Tmax estimation, with mean SHAP value of 4.29. Additionally, the DT, GR, RHOB, and LLD parameters had a moderate impact on Tmax estimation, as indicated by their mean SHAP values of 1.76, 1.26, 1.04, and 0.91, respectively, reflecting their role in conveying information about clay content in the Wangkwar Formation. NPHI contributed the least to Tmax estimation, with mean SHAP value of 0.35. Moreover, Figure 14 illustrates that an increase in SP, DT, and GR led to an increase in Tmax, while higher values of RHOB, LLD, and NPHI resulted in a decrease in Tmax.

Figure 13
figure 13

Input parameter importance for Tmax estimation

Figure 14
figure 14

Input parameter influence for Tmax estimation

Assessment of Kerogen Type and Maturity Stage

Based on the results estimated by GMDH–DE, kerogen classification diagrams were constructed using the HI vs. Tmax plot utilized by the earliest researcher to determine the maturity stage and kerogen type (Fig. 15). Overall, the findings indicated that most of the analyzed samples from the Triassic–Jurassic source rocks in the Mbate and Mbuo Wells are typically plotted in the immature zone of Types I to III kerogens with Tmax of < 435 °C belonging to the gas–oil generation window and signifying the incapability of the rocks to generate hydrocarbons (Tissot & Welte, 2013; Al-Areeq et al., 2018). These findings correspond to the one attained from basin modeling analyzed from the PetroMod. Their HI values justify this in the 52–1017 mg HC/g TOC range. Moreover, the results revealed that very few samples from Mbate and Mbuo Wells are plotted in mature zones of Types II to III kerogens as indicated by their Tmax of 435–460 °C and HI of 13–285 mg HC/g TOC. In Figure 15, most samples from the Mita Gamma Well are plotted in the mature zone of the Types II to III kerogens field as indicated by higher Tmax of 440–460 oC, which are in line with the result from the classification suggested by Peters (1986) and Peters and Cassa (1994). In contrast, only a few samples are plotted in an immature zone of Types I to III kerogens as indicated by the Tmax of < 435 °C (Fig. 15).

Figure 15
figure 15

Cross-plots of HI vs. Tmax estimated by GMDH–DE model to show the basin’s maturity stage and kerogen type

Conclusions

The study has shown the proposed GMDH–DE Tmax model, as an independent novel approach, can be adopted for rapid real-time assessment of Tmax values of the organic matter of source rock. Therefore, based on this study, the following conclusions can be drawn:

  1. 1.

    One-dimensional Tmax modeling suggests that the lower Jurassic Mbuo source rocks entered the gas–oil window in late Triassic times and reached the expulsion onset during the early Jurassic. The Tmax thermal maturation and basin modeling vitrinite reflectance showed that Mbuo source rocks are immature to mature.

  2. 2.

    The Tmax estimation using the GMDH–DE model was compared to that using GMDH and BPNN. The GMDH and BPNN models underestimated the Tmax values significantly. In contrast, the GMDH–DE outperformed the other models with estimates very close to the measured values. Therefore, the model accurately and precisely estimated the organic matter of the source rock and can be used in different case studies for positive results.

  3. 3.

    The results of the proposed GMDH–DE model were compared with those of the previously developed hybrid models of PCA–ANN and PSO–ANN; the former model performed better than the latter models. The sensitivity analysis also showed that well log parameters of LLD, RHOB, and GR had the most significant contribution to the performance of the GMDH–DE model in Tmax estimation.

  4. 4.

    The GMDH–DE model outperformed the GMDH, BPNN, PSO–ANN, and PCA–ANN models in predicting Tmax values from well logs. Therefore, exploration and development of oil and gas resources might be significantly facilitated using the proposed hybrid GMDH–DE technique.

  5. 5.

    Source rocks analysis showed that the Mbate, Mbuo, and Mita Gamma Wells have fair-to-good generation potential, as indicated by HI and Tmax values. The HI values characterize kerogen Types II and III. The wells lie in the immature-to-mature window zone indicated by the HI vs. Tmax plot. Therefore, it could be expected that the wells may have generated oil and gas.