1 Introduction

The use of mathematical modeling allows researchers to solve problems in describing the distribution of pollutants. One of the effective methods for assessing the spatial distribution of pollutants is the LUR method, which consists of constructing a mathematical model of such distribution based on experimental data and geographic information systems (GIS) data. GIS is an information system designed for collecting, storing, analyzing, and graphically visualizing spatial data and related information about the objects represented. The created models are used to assess the level of pollution at those points where measurements were not carried out.

The first name of the LUR method was regression mapping [1]. Researchers now call this method land use regression. Different researchers apply this approach in different ways, the method continues to develop and change, but each study has common basic stages of work [1,2,3,4,5,6,7,8,9,10,11,12,13]. Sampling locations are selected in such a way that, as a result of monitoring, the entire spectrum of possible concentrations in the city is obtained. Next, for each measurement location, a number of geographic variables are calculated that are expected to be associated with the distribution of contaminants. Geographic variables are used that describe the location of measurement points, type of land use, building density, traffic intensity indicators, and more. Regression analysis is then performed to determine the relationships between the measured concentrations and the resulting geographic variables. The result is a regression equation that can be used to estimate contaminant concentrations anywhere in the city.

Unlike interpolation methods, the LUR method, in addition to data on measured concentrations, uses various available GIS data to construct the impurity distribution surface, the inclusion of which in the analysis significantly improves the model, the spatial resolution of the modeling results. This allows to limit ourselves to a small number of measurements. The choice of the number of sampling sites is limited by the physical and material capabilities of the researcher and the availability of the necessary measuring equipment. No rigorous methodology has been developed for determining the number of sampling sites. Among previous studies, the number of samples varied from 20 to 150. The number of samples should depend on local conditions, expected variations in measured concentrations, and the size of the area in which the study is being conducted. To model the levels of distribution of impurities in a large city, it is reasonable to use from 40 to 80 sampling sites [7]. This number of measurement locations is determined by the required amount of data for statistical analysis and does not depend on the size of the city. For example, in Canada [8], an algorithm was used to determine sampling locations, which took into account information about the transport network and places of residence of the population being studied. The idea is to place more sampling points in areas where there is presumably greater spatial heterogeneity in contaminant concentrations (as measured by road density) and to identify areas of the city with the largest number of study participants living there.

For the city under study, a geoinformation model is created, a geographic database that includes data on the distribution of highways, various types of land use, topographical parameters of the area, population density, and more. Using GIS technologies, predictor variables that describe measurement locations are calculated based on the obtained geographic database.

Typically, researchers are able to convert the collected information into 50–150 different geographic variables. The final LUR model typically includes between 3 and 8 variables. Since it is a priori unknown which variables have the greatest relationship with the level of pollution in a given locality, and it is not known what size the predictor buffer zone should be to better describe the relationship with pollution, a large set of potential predictor variables is initially calculated. For example, study [4] included 55 potential predictors, study [8] included 140 predictors.

The “classic” LUR assumes using regression as an interpolator. In this work, we propose to use models based on artificial neural networks to construct the spatial distribution of a feature. By combining LU ideology and machine learning, we hope to obtain the most accurate predictions of a feature spatial distribution.

The most commonly used ANN in environmental research is the multilayer perceptron (MLP). Due to its widespread use, this type of network is well developed and has shown its high performance. The structure of an MLP network is described by several values related to the number of neurons in the layers: input layer, hidden layer and output layer. Perceptrons are widely used in studies of the distribution of chemical elements in soil [14,15,16,17,18,19,20]. Many researchers have used MLP for digital soil mapping [21,22,23] and have proven the superiority of MLP models over geostatistics.

Convolutional neural network (CNN) is a special architecture of artificial neural networks proposed by Yann LeCun in 1988 and aimed at effective pattern recognition, part of deep learning technologies (LeCun et al., 1989). Such networks are widely used by researchers to detect various patterns, process large amounts of data, in classification problems, etc. [24,25,26,27,28,29,30,31,32,33,34].

The purpose of this study is to build accurate models of the spatial distribution of dust concentration in the snow cover of an urbanized area using predictors based on the Land Use ideology to create such models. We compared the accuracy of models based on traditional regression and artificial neural networks. Based on the results of the models, we constructed maps of the spatial distribution of dust in the snow cover of an urban area.

2 Materials and methods

2.1 Data

For forecasting, we used data on the dust content in the snow cover (mg/l) in the city of Novy Urengoy (southern part), Yamalo-Nenets Autonomous Okrug, Russia (Fig. 1).

Fig. 1
figure 1

Schematic map of the sampling site (Google Earth)

The total number of collected samples was 150. For the study, the data were divided into a training set (120 pieces), which was used to obtain regression coefficients and to train the neural network, and a test set (30 pieces), which was used exclusively to test the quality of the models’ prediction. The actual location of sampling points was determined during testing directly on site, based on the need to take samples in areas with undisturbed (by visual signs) snow cover. Snow samples were taken in accordance with current regulatory documents.

The entire vertical section of the snow cover was analyzed, with the exception of the lower 2–3 cm (to avoid contamination of the samples with soil particles). Samples were taken using the method of a square envelope with a side of 2 m (4 cores at the corners of the envelope and 1 in the center). The dimensions of the envelope could vary slightly depending on the size of the found area with undisturbed snow cover. The sample mass was brought to a value sufficient for chemical analysis (at least 3 kg) by adding snow cores collected inside the envelope. The selected samples were packed in double plastic bags, each of which was marked with a sample number in accordance with the sampling scheme, and recorded in a field journal indicating the situational features of the location of the sampling point. To ensure the safety of the samples from melting (with a possible increase in air temperature), the latter were tightly packed in corrugated cardboard boxes. The boxes were then sealed, labeled and submitted for chemical analysis to an accredited analytical laboratory. Dust concentration values in snow samples were used for modeling.

2.2 Methods

In this work, we applied the ideology of searching for predictors of the LUR method, on the basis of which six predictive models were created.

1. LUR is a standard prediction method based on a multiple regression equation for which the predictors are roads that fall within the buffer zones around each sampling point.

2. LUR + XY—geographic coordinates of points have been added to the standard predictors.

3. MLP (Multilayer perceptron) with predictors selected for the LUR method.

4. MLP + XY—adding coordinates to predictors.

5. CNN (Convolution neural network) with predictors for the LUR method.

6. CNN + XY—adding coordinates to predictors.

Figure 2 shows the flow diagram of the study.

Fig. 2
figure 2

Study flowchart

2.2.1 Land use regression

The ArcGIS geographic information system was used to construct buffer zones around each point. Figure 3 shows a fragment of a city map.

Fig. 3
figure 3

Buffer zones around geolocations

It shows two measurement locations. To characterize the density of roads of various types, the area of parks, industrial, residential areas and other parameters, so-called circular buffer zones are formed in the coordinates of measurement locations—circles of different radii. Then, using GIS, the length of roads falling inside the circular buffer zone, the area of park areas, etc. are calculated. Thus, modern GIS technologies make it possible to easily obtain many variables with any size of buffer zones. The choice of zone sizes should be based on information about the likely distribution of impurities. However, when there is no clear data on the required buffer zone sizes to characterize certain parameters, a large number of variables with different buffer zone sizes can be calculated. Dimensions can be from very small (several meters, as far as the spatial resolution of the input data allows) to very large (about 1 km; the substantive meaning of variables with a larger radius is lost).

To eliminate the influence of duplicate data falling into buffer zones of different sizes, we propose to use disjoint rings rather than circles. In total, 5 buffer rings with a radius of 0–100, 100–200, 200–300, 300–400 and 400–500 m were built for each geopositions (Fig. 3). On the New Urengoy territory we have identified two types of roads: main (roads) and secondary (subroads). For each type of road, an intersection with all buffer rings was obtained and the area of these intersections was calculated (Fig. 4).

Fig. 4
figure 4

Construction of buffer zones in the form of non-intersecting rings using the example of geolocation 90

As a result, 10 predictors were obtained, 5 for each type of road (for example, a predictor with the prefix r100_200—the area of intersections of main roads with rings with an internal radius of 100 m and an external radius of 200 m for all 120 geopositions of the training set; a predictor with the prefix sr400_500—area of intersections of minor roads with rings with an inner radius of 400 m and an outer radius of 500 m for all 120 geolocations of the training set). Based on the obtained predictors, a correlation matrix was constructed, from which redundant variables were preliminarily determined (see Table 1).

Table 1 Correlation matrix for determining redundant variables

The variance inflation factor (VIF) is a measure of the degree of multicollinearity in regression analysis. Multicollinearity exists when there is a correlation between multiple independent variables in a multiple regression model. This may negatively impact the regression results. Thus, the variance inflation factor can estimate how much the variance of a regression coefficient is inflated due to multicollinearity.

$${\text{VIF}}_{i} = \frac{1}{{1 - R_{i}^{2} }},$$
(1)

where R2 is the uncorrected coefficient of determination of the i-th variable. The higher the VIF, the higher the likelihood that multicollinearity exists, and further research is needed (VIF equal to 1—variables are not correlated; VIF from 1 to 5—variables are moderately correlated; VIF greater than 5—variables are highly correlated).

Based on the calculations, we compiled a table of variance inflation factors (Table 2).

Table 2 Variance inflation factors

Based on Table 2, four predictors (highlighted in bold) were retained for further work: r0-100, r200-300, r400-500, sr0-100.

2.2.2 Multilayer perceptron

As the first type of ANN, we used a multilayer feed-forward perceptron with the Levenberg–Marquardt learning method. The network structure was determined through computer simulation. The input MLP layer is composed using sample points; the hidden layer consists of several neurons, and the output layer represents the concentration of dust in the corresponding sample. The number of neurons in the hidden layer is selected based on the minimum mean square error of the prediction for the training set. The optimal parameters for the MLP model turned out to be a network with 8 neurons in the hidden layer with the Levenberg–Marquardt learning method and the hyperbolic tangent activation function.

The best network for the MLP + XY model was with nine neurons in the hidden layer with Levenberg–Marquardt learning method and hyperbolic tangent activation function.

2.2.3 Convolutional neural network

CNN is a type of deep learning neural network architecture commonly used in computer vision. Computer vision is a branch of artificial intelligence that allows a computer to understand and interpret an image or visual data. CNN is an extended version of ANN, which is primarily used to extract an object from a grid-like dataset. For example, visual datasets such as images or videos, where data patterns play an important role. The convolutional layer applies filters to the input image to extract features, the pooling layer downsamples the image to reduce computation, and the fully connected layer makes the final prediction. The network is trained through backpropagation and gradient descent.

Advantages of CNNs: are good at detecting patterns and features in images, video and audio signals, are robust to translation, rotation and scale invariance, can process large amounts of data and achieve high accuracy. Disadvantages of CNN: training is computationally expensive and memory intensive, may be prone to overtraining if not enough data are used, limited interpretability, difficult to understand what the network has learned.

The best network for the CNN model was obtained with two convolutional layers, one fully connected layer and a regression layer at the output of the network. The activation function in all cases was relu layer. In the first convolutional layer, the filter size was 2 × 2 with a number of 30. The second convolutional layer was 1 × 1 with a number of 10. The fully connected layer was with 50 neurons.

The optimal network for the CNN + XY model contained two convolutional layers, a fully connected layer and a regression layer. The first convolutional layer: filter size 2 × 2 with a number of 20, the second layer: filter size 2 × 2 with a number of 10. The size of the fully connected layer was 150 neurons.

2.3 Accuracy assessment

The accuracy of the forecast was evaluated by the following indicators: mean absolute error (MAE), root mean square error (RMSE), mean square relative error (RMSRE), normalized root mean squared error (NRMSE), Willmott's agreement index (IA1, IA2) and standard deviation. The indices IA1 and IA2 are standardized indicators of the degree of model prediction error and range from 0 to 1, where the value 1 indicates a complete match, and 0 indicates a complete lack of agreement [35, 36].

$${\text{MAE}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} |P(t_{i} ) - O(t_{i} )|}}{n},$$
(2)
$${\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {P(t_{i} } \right) - O(t_{i} ))^{2} }}{n},}$$
(3)
$${\text{RMSRE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\frac{{P(t_{i} ) - O(t_{i} )}}{{O\left( {t_{i} } \right)}}} \right)^{2} } ,$$
(4)
$${\text{NRMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} O\left( i \right) - P\left( i \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} O\left( i \right)^{2} }}} ,$$
(5)
$${\text{IA}}1 = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} |P(t_{i} ) - O(t_{i} )|}}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {\left| {P(t_{i} ) - \overline{O}} \right| + \left| {O(t_{i} ) - \overline{O}} \right|} \right)}},$$
(6)
$${\text{IA}}2 = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {P(t_{i} } \right) - O(t_{i} ))^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {\left| {P(t_{i} ) - \overline{O}} \right| + \left| {O(t_{i} ) - \overline{O}} \right|} \right)^{2} }},$$
(7)

where \(O({t}_{i})\) and \(P({t}_{i})\) are the represent measured and predicted values, respectively, \(\overline{O}\) and \(\overline{P}\) are the mean measured and predicted values, respectively, \(n\) is the number of measurements, and \({t}_{i}\) is the value in the corresponding geolocation.

3 Results and discussion

Table 3 shows the descriptive statistics on dust concentration in snow cover.

Table 3 Descriptive statistics on dust concentration in snow cover (mg/l)

Table 4 shows the accuracy of the models. The best values are in bold.

Table 4 Summary table of model performance indicators

The summary table shows that ANN-based models turned out to be more accurate than regression-based models (classic LUR). They have the smallest errors, the best indicators of model quality (indices IA and IA2) and the highest correlation coefficient. Thus, the best ANN-based model had an MAE that was 26% less than the best regression model. RMSE, RRMSE, and NRMSE for the best ANN models were lower by 19%, 12.5%, and 19%, respectively. At the same time, the IA and IA2 indices were higher by 3% and 14%, respectively, and the correlation coefficient was higher by 4%.

Below are the dependences of the predicted values of dust concentration on the measured ones for all six compared models (Fig. 5).

Fig. 5
figure 5

Dependence of the predicted dust concentration on the measured value for the compared models (confidence bounds—95%)

A Taylor diagram [37] was used to evaluate the predictive ability between modeled and observed snow dust concentrations. Statistical data allow us to evaluate how accurately the proposed forecasting methods reproduce the observed data. The chart combines the root mean square, correlation coefficient, and standard deviation of all forecast models. Markers for each model are plotted on the diagram in polar coordinates. Radial distances from the origin represent standard deviations and azimuth positions represent correlation coefficients between measured and predicted distributions. Concentric circles represent RMS. The Taylor diagram provides a clearer distinction between each model built compared to the conventional graph presentation of forecasting (Fig. 6).

Fig. 6
figure 6

Taylor diagram

The diagram shows that all models turned out to be close in terms of accuracy indicator values. The most accurate models were MLP and CNN + XY, which had the smallest error and the largest correlation coefficient with a standard deviation very close to the measured values.

To obtain a more detailed forecast, a regular grid was constructed, the values of which range from the minimum to maximum values of the original data set. The grid spacing was 50 m. Using all trained models, a forecast was made for a given grid of points. Based on the results obtained, maps of the spatial distribution of dust in the snow cover in the study area were constructed (Fig. 7).

Fig. 7
figure 7

Maps of dust concentration distribution in snow cover for different models

The figure shows that the most significant concentration of dust in the snow cover is at the intersections of large roads, which is obviously associated with the most significant number of vehicles passing along them per unit time. We can also see how the dust concentration changes depending on the distance from the roadway. It can be seen that after about 100 m the concentration drops by a factor of 2, and at a distance of about 500 m it becomes background.

4 Conclusion

In this study, we proposed to use an improved land use methodology to construct ring spatial variables for modeling and mapping spatial distribution of dust in snow cover of an urbanized area. The land use regression method consists of building a mathematical model based on experimental data and geographic information systems (GIS) data. Using LU, we created two models based on classical regression and four based on artificial neural networks (two based on multilayer perceptron and two based on convolutional neural networks). Data for the study were obtained during screening of snow cover in the city of Novy Urengoy (Russia). To assess the quality and accuracy of the models, we used eight indices and a Taylor diagram to visually represent the results of the models. ANN-based models turned out to be more accurate than regression-based models (classic LUR). The best ANN-based model had an MAE that was 26% less than the best regression model. RMSE, RRMSE, and NRMSE for the best ANN models were lower by 19%, 12.5%, and 19%, respectively. At the same time, the IA and IA2 indices were higher by 3% and 14%, respectively, and the correlation coefficient was higher by 4%. Also, based on the results obtained, we constructed detailed maps of the spatial distribution of dust in the snow cover in the study area. The maps clearly showed road intersections with the most severe dust pollution and showed in great detail the spatial distribution of dust in the snow cover. Also, using the maps, you can accurately see the zone of influence of roads on the distribution of dust concentration.