1 Introduction

Images captured in real-world haze scenes often suffer from unclear visual artifacts due to the presence of dust, smoke, or other atmospheric particles. These visual artifacts pose formidable challenges to high-level computer vision tasks, such as object detection and tracking [1, 2], person re-identification [3, 4], or semantic segmentation [5, 6]. Consequently, the image dehazing task has become the forefront of research in recent years.

The common haze effect can be represented by the atmosphere scattering model [7]:

$$\begin{aligned} I(x)=J(x)t(x)+A(1-t(x)), \end{aligned}$$
(1)

where I(x) represents the observed haze image, J(x) represents the clean background, t(x) represents the transmission map, and A represents the global atmospheric light. Based on the above atmosphere scattering model, early methods utilize artificial assumptions or condition priors to recover haze-free backgrounds [7,8,9,10,11,12,13]. For instance, the Dark Channel Prior (DCP) method [7] introduces the concept of low-intensity values in at least one channel. Despite their theoretical appeal, prior-based methods are less effective in practical application scenarios because theoretical assumptions and actual data often have large deviations.

The era of big data has ushered in a profound transformation in the field of image dehazing. Various large-capacity models with complex network structureshave been proposed to pursue continuous performance improvement [14,15,16,17,18,19]. However, these large-capacity networks come with their own set of challenges. Their huge demands on memory and computational resources makes them impractical for deployment on resource-constrained platforms, such as mobile terminals and embedded devices. This apparent limitation has led to the emergence of lightweight dehazing models. Overall, lightweight dehazing models can be divided into two categories. The first category, termed “parameter-first” models [20,21,22], operates with a small parameter count, which is essential for memory-constrained environments. However, due to too few learnable parameters, the weakening of its feature extraction ability will inevitably lead to poor dehazing performance. The second category of lightweight model is characterized by the pursuit of faster inference speed [23]. They generally adopt a symmetric network structure to improve real-time efficiency. Although the symmetrical network structure is beneficial for hardware acceleration, these models still have a large number of parameters.

It is worth noting that although the technical routes of the above two lightweight models are not consistent, they ignore two common and critical problems. First, the distinct levels of haze concentration in real-world scenarios give rise to varying degrees of image detail degradation. Popular convolutional blocks or encoder-decoder components struggle to comprehensively extract image features within the bounds of a limited parameter space. Second, different features in haze images have completely different weighted information. Current lightweight dehazing models tend to process features and pixels uniformly, lacking the additional flexibility to handle different levels of information.

To handle these issues, the key to lightweight image dehazing lies in the learning of discriminative features. We believe that such features need to be omni-scale, because more complex and richer features spanning multiple scales are beneficial to exploit the maximum potential of the model, especially for lightweight neural networks. To this end, this paper presents a novel Lightweight Omni-Scale Dehaze Network (LOS-DehazeNet). Our LOS-DehazeNet implements omni-scale feature learning by building two core architectural components: the Omni-Scale Feature Aggregation (OSFA) block and the Omni-Scale Hybrid Attention (OSHAtt) block. The OSFA block can produce multi-scale convolutional streams and utilize two novel mapping operations to achieve rich feature fusion and information interaction. The first mapping operation processes feature streams in a cascaded fashion, which can provide more flexibility for information flow between branches and dramatically improve the receptive field of our OSFA block. Another parallel mapping operation dynamically integrates the scale-independent features by a novel shared aggregation gate module. The above information interaction process achieves an adaptive fusion of the omni-scale features, enabling a thorough feature parsing of hazy images.

Besides the OSFA block, another core component in LOS-DehazeNet is the Omni-Scale Hybrid Attention (OSHAtt) block. The OSHAtt block is built on a dual residual architecture, equipped with three levels of attention mechanisms: channel, spatial, and pixel level. This architectural choice enables the model to inject different types of information with different levels of importance, thereby paying more attention to critical features hidden in hazy images. Meanwhile, the dual residual connection increases potential information flow propagation between the OSFA block and the OSHAtt block, which is beneficial to the performance improvement of our lightweight model.

Extensive experiments are conducted on benchmark datasets to validate the effectiveness and superiority of our LOS-DehazeNet. As shown in Fig. 1, our model achieves the state-of-the-art metric values and fastest inference speed. Notably, our LOS-DehazeNet only has 110k parameters, achieving a 5\(\times \) compression compared to the latest lightweight dehazing model GRFANet [23].

Fig. 1
figure 1

Comparison results (i.e., PSNR vs. Parameters and PSNR vs. Time) on the challenging SOTS(Indoor) dataset. One can see that our LOS-DehazeNet achieves state-of-the-art metric values and the fastest inference speed

The contributions of this work are summarized from three aspects:

- We propose a novel lightweight dehazing network LOS-DehazeNet, which achieves state-of-the-art metric values and the fastest inference speed. LOS-DehazeNet has a 5\(\times \) smaller memory footprint compared to the latest lightweight model GRFANet, making it a better choice for resource-constrained environments.

- To the best of our knowledge, we are the first to systematically explore the concept of omni-scale feature representations in image dehazing. This innovative approach offers new perspectives on lightweight dehazing network design, providing a promising avenue for future research in the field.

- We present two novel architectural blocks: the Omni-Scale Feature Aggregation (OSFA) block and the Omni-Scale Hybrid Attention (OSHAtt) block. These architectural innovations are critical to striking a balance between optimizing network capacity and preserving model performance.

The rest of this paper is organized as follows. We conclude a brief overview of single image dehazing, lightweight network design, and attention mechanism in Section 2. Section 3 describes the details of our LOS-DehazeNet. Section 4 presents comprehensive experimental details and comparisons. Last, Section 5 closes with a summary and future work.

2 Related work

2.1 Prior-based dehazing methods

Prior knowledge plays a crucial role in image processing. For instance, NSS-based QMC [24] explores potential causality between nonlocal self-similarity and low-rank property of color images, effectively solving the problem of color image inpainting problem. Different from the low-rank approximation, image dehazing methods generally use statistical knowledge to model haze representation. For instance, He et al. [7] find that the haze image has low-intensity values in at least one channel and introduce the concept of the dark channel prior. However, dark channel priors are often unreliable when scene objects resemble atmospheric lights. To address this limitation, Xie et al. [11] propose a multi-scale retinex algorithm in the YCbCr color space. Zhu et al. [9] design a color attenuation prior based on a linear model of scene depth. Berman et al. [12] introduce a non-local method operating on individual pixels in the haze image. Besides, another technical route is to restore hazy images through information fusion strategies. Specifically, Ancuti et al. [8] apply the white balance and the contrast-enhancing procedure to original hazy images. Lin et al. [13] combine the fuzzy inference system and the neural network filter to eliminate the hazy effect. Singh et al. [25] evaluate depth maps from hazy images using a gradient profile prior. Nonetheless, due to the complexity of various objects, hand-crafted priors often prove to be inaccurate in real-world scenarios, making these methods unsuitable for practical applications.

2.2 Learning-based dehazing networks

In the era of big data, learning-based networks have become a research hotspot in academia and industry. Cai et al. [14] first construct a shallow CNN DehazeNet, which facilitates the development of dehazing networks. Subsequently, researchers introduce a variety of neural networks with different structures to enhance dehazing performance. Ren et al. [16] propose a gated fusion network by adopting a fusion-based strategy. Chen et al. [17] introduce smoothed dilation techniques to remove the gridding artifacts and leverage a gated sub-network to fuse the deep features. Liu et al. [26] incorporate dual residual connections to enhance model representation. Except for simply designing the network structure, some novel mechanisms and operations are also introduced. For instance, Liu et al. [27] propose a novel GriddehazeNet that considers three aspects: pre-processing, backbone, and post-processing. Huang et al. [28] introduce a self-filtering block and a self-supporting module to remove redundant features and recover image details. Dong et al. [18] incorporate the strengthen-operate-subtract boosting strategy to restore the haze-free image progressively. Although the above deep networks achieve great dehazing performance, they require high memory and computational cost. To address this challenge, lightweight models are proposed, falling into two categories. The first category, termed “parameter-first” models [20,21,22], operates with a small parameter count, which is essential for memory-constrained environments. For instance, Li et al. [20] propose an all-in-one dehazing network by re-formulating the atmospheric scattering model. Zhang et al. [21] comprise encoders at three scales and a fusion module to directly learn the haze-free image. Ullah et al. [22] propose a light-dehazeNet based on estimating the transmission map and the atmospheric light. However, due to too few learnable parameters, the weakening of its feature extraction ability will inevitably lead to poor dehazing performance. The second category of lightweight model is characterized by the pursuit of faster inference speed [23]. This type of model focuses more on real-time performance. Specifically, Yi et al. [23] design an encoder-decoder structure similar to U-Net. This symmetrical structure enables fast inference speed, but it still has a large number of parameters.

Fig. 2
figure 2

The whole architecture of our LOS-DehazeNet. LOS-DehazeNet is constructed by two \(3\times 3\) convolution layers, a feature extraction module, and an attention enhancement module. Among them, the feature extraction module consists of several omni-scale feature aggregation blocks, and the attention enhancement module contains four omni-scale hybrid attention blocks

2.3 Lightweight network design

The growing demand for practical applications has driven the development of lightweight neural networks. SqueezeNet [29] designs a fire module. Compared with AlexNet [30], SqueezeNet reduces 50\(\times \) parameters and achieves similar accuracy on ImageNet. Xception [31] splits the input channels and introduces depth-wise convolution and point-wise convolution to compress model parameters. MobileNets [32, 33] introduce an inverted residual bottleneck and utilize neural architecture search to construct lightweight models for mobile devices. ShuffleNets [34, 35] design a channel shuffle operation for deep feature fusion. GhostNet [36] considers the redundancy of features, thus designing a linear operation to replace primary convolution layers. VGNetG [37] gives several novel guidelines and proposes a down-sampling block and a half-identity block. However, these networks, although successful in various computer vision tasks, are often unsuitable and too complex for image dehazing. In contrast, this paper adopts a simplified approach in the form of omni-scale feature aggregation blocks, removing unnecessary operations and layers to improve efficiency, such as down-sampling, batch normalization, or channel split.

2.4 Attention mechanism

Attention mechanisms plays a key role in enhancing feature representations within networks, focusing computational resources on critical components of the signal. For instance, SENet [38] adaptively re-calibrates channel-wise features by modeling the dependencies between channels. SKNet [39] adopts a dynamic selection mechanism to integrate multi-scale input information. CBAM [40] proposes a general module that simultaneously fuses channel and spatial attention. SANet [41] designs a channel shuffle operation to realize the information communication between different sub-features. SCA-CNN [42] considers the limitations of spatial attention, thereby incorporating spatial and channel-wise attentions for the image captioning task. PDANet [43] incorporates both spatial and channel-wise attentions into a CNN for visual emotion regression, which jointly considers the local spatial connectivity patterns along each channel and the interdependency between different channels. SCConv [44] designs a spatial reconstruction unit and a channel reconstruction unit. The combination of spatial attention and channel attention effectively solves the problem of feature redundancy. While these mechanisms have proven effective, they still exhibit high spatial and computational complexity. Consequently, they are not suitable for directly constructing lightweight dehazing models. Furthermore, image dehazing involves a pixel-by-pixel image restoration process. Most existing attention mechanisms primarily focus on the interplay between channel information and spatial information, overlooking the intrinsic connections between individual pixels. Therefore, introducing pixel information to lightweight models becomes crucial as it significantly contributes to maximizing model performance.

3 Methodology

In this section, we will first provide a refined description of the design principles and the overall structure of our LOS-DehazeNet. Then, we will show the implementation details of two crucial core blocks, namely the omni-scale feature aggregation block and the omni-scale hybrid attention block. Finally, we will discuss our choice of the loss function.

3.1 Architecture overview

Based on (1), we first redefine the representation of the dehazed image:

$$\begin{aligned} J(x)=I(x)\times \frac{1}{t(x)}-A\times \frac{1}{t(x)}+A, \end{aligned}$$
(2)
Table 1 Detailed configurations of the proposed LOS-DehazeNet
Fig. 3
figure 3

The architecture of the proposed OSFA block. The OSFA block is a residual bottleneck, equipped with different depthwise separable convolutions and a shared aggregation gate. The green arrow represents the proposed cascading mapping operation, and the orange arrow represents the paralleling mapping operation

This equation clearly shows that estimating the transmission map t(x) and the atmospheric light A are two key steps in the dehazing process. However, these steps require the design of dedicated sub-networks to accomplish certain sub-tasks, which is not in line with the original intention of designing lightweight models. Given that, we simplify the haze model as follows:

$$\begin{aligned} J(x)=I(x)-H(x), \end{aligned}$$
(3)

where J(x) represents the dehazed image, I(x) represents the observed hazy image, and H(x) represents all haze noise. Based on (3), we construct an end-to-end lightweight model LOS-DehazeNet to directly learn the residual mapping between J(x) and I(x), as shown in Fig. 2.

LOS-DehazeNet adopts a cascaded network architecture, containing two \(3\times 3\) convolution layers, a feature extraction module, and an attention enhancement module. The first \(3\times 3\) convolution layer extracts shallow features from the input image, and then the feature extraction module further processes these features in a omni-scale manner. Notably, we abandon the upsampling/downsampling operations commonly used in dehazing networks [23], because they can lead to imbalance channel widths and inconsistent feature map sizes. This phenomenon is not conducive to optimizing the capacity and inference speed of lightweight models [34]. Therefore, the generated feature maps are set to a constant size. Next, the attention enhancement module will focus on valuable features. The dual residual architecture and three different levels of attention mechanisms can add additional flexibility when dealing with different types of information. Table 1 lists more detailed configurations of our LOS-DehazeNet. The following sub-sections will present the implementation details of our OSFA block and OSHAtt block.

3.2 Omni-scale feature aggregation block

Previous dehazing networks are generally constructed by original convolutional blocks or encoder-decoder components. They have to stack many blocks for sufficient feature extraction, resulting in high memory and computational cost. To handle this issue, we propose a novel Omni-Scale Feature Aggregation (OSFA) block, as shown in Fig. 3.

The macro structure of our OSFA block still follows the design principles of lightweight blocks [32, 35], which is a residual bottleneck equipped with depthwise separable convolution. Given an input, we use pointwise convolution to adjust feature dimension at first. Then, we leverage depthwise convolutions with various kernel sizes (ranging from \(3\times 3\) to \(9\times 9\)) to extract reliable texture features and comprehensive semantic information. It is worth noting that we do not use larger kernels (e.g., \(11\times 11\)) because they can not well balance the model performance and parameters [45, 46].

The OSFA block also includes two new mapping operations: cascading mapping and paralleling mapping, as indicated by the green and orange arrows in Fig. 3. These two mapping operations can achieve rich feature fusion and information interaction within the OSFA block. Among them, cascading mapping sequentially concatenates feature maps from different depthwise convolutional layers, and the connection order follows the principle of kernels from small to large. The benefits of this design are two-fold.

First, the structural redundancy problem in multi-branch blocks can be mitigated. It is known that simply stacking ordinary multi-branch blocks would produce multiple “similar scales” in feature maps [47], while the cascading mapping can provide more flexibility for information flow between branches. In other words, more multi-scale information on the previous feature map will be preserved. Second, the cascading mapping operation greatly improves the receptive field of the OSFA block. This property is more beneficial for the network to learn the global features of the image. Specifically, the most common practice for ordinary multi-branch modules is to directly use a large convolution kernel of \(9\times 9\) or stack four \(3\times 3\) layers. Either way, the maximum size of the receptive field in a module is only \(9\times 9\). For our OSFA block, the receptive field R can be formulated as:

$$\begin{aligned} R=\sum _{k=1}^n l_k\times l_k, \quad s.t.\quad l_k=l_{k-1}+((f_k-1)\times \prod _{i=1}^{k-1} {s_i}) \end{aligned}$$
(4)

where, \(l_k\) denotes the receptive field of layer k, \(f_k\) denotes the kernel size, \(s_i\) denotes the stride. In the OSFA block, we set \(f_1=3\), \(f_2=5\), \(f_3=7\), and \(f_4=9\). For simplicity, we assume \(s_i=1\). After calculation, the ultimate receptive field of the OSFA block is \(21\times 21\). Compared with the ordinary multi-branch module, the size of the receptive field is expanded by 5.4 times.

In addition to the cascading mapping operation, we further design a paralleling mapping operation. The parallel mapping operation can dynamically integrate the scale-independent features generated by convolution kernels of different sizes. As shown in Fig. 3, the parallel mapping operation embeds a novel shared aggregation gate (SAG) module. Aggregation gate have been proven effective in other computer vision tasks, such as person re-identification [48]. It utilizes multi-layer perceptrons to dynamically combine the outputs of different streams. In other words, different weights can be assigned to different scales based on the input image. And for our SAG module, it is a tiny network containing two pointwise convolutions. They implement compression and expansion process of channel dimensions respectively. It is worth noting that there are clear differences between our SAG module and the modules in OSNet [48]. The input to the parallel mapping operation comes from different integrated features. This can be thought of as a special iterator of different scale features. Furthermore, pointwise convolution can better reduce the number of parameters of our OSFA block compared to fully connected layers. This design makes a better trade-off between parameter counts and model performance, which is more beneficial for lightweight neural networks.

3.3 Omni-scale hybrid attention block

Lightweight neural networks generally suffer from some performance degradation due to their limited number of learnable parameters. In this case, incorporating attention mechanisms becomes a good choice to improve performance. Existing methods generally use channel attention or spatial attention or both of them. However, these attention mechanisms still exhibit high spatial and computational complexity, which are not suitable for directly constructing lightweight dehazing models. Furthermore, image dehazing involves a pixel-by-pixel image restoration process. Most existing attention mechanisms primarily focus on the interplay between channel information and spatial information, overlooking the intrinsic connections between individual pixels. To this end, we design a novel Omni-Scale Hybrid Attention (OSHAtt) block, as shown in Fig. 4. OSHAtt block makes a dual connection between compact convolutional layers and omni-scale attention mechanisms (i.e., channel attention, pixel attention, and spatial attention). This combination can increase potential information flow propagation. The less critical information, such as thin haze region, can be first bypassed through the dual residual connections. Then, the subsequent attention modules would learn more critical features on different dimensions. Therefore, our OSHAtt block successfully alleviates the performance degradation problem of lightweight dehazing models, allowing its performance to be close to that of deep models while maintaining a low number of parameters.

Fig. 4
figure 4

The overall architecture of the proposed OSHAtt block. The OSHAtt block is a dual residual bottleneck equipped with the \(3\times 3\) depthwise convolution and the attention modules (i.e., channel attention, pixel attention, and spatial attention)

Channel Attention (CA) The role of our channel attention mechanism is to assign varying weights to different channel features, as the deep features generated by the feature extraction module hold varying degrees of importance in the channel dimension. As shown in Fig. 5, our approach begins by employing an average pooling layer to condense channel-wise global information. Subsequently, a pointwise convolution operation, accompanied by a LeakyReLU activation function, is utilized to generate an intermediate feature map. Finally, this intermediate feature map undergoes the Sigmoid function to compute the weights. The aforementioned steps in the channel attention process can be summarized as follows:

$$\begin{aligned} CA(x)=\delta (PConv(NA(PConv(Avg(x))))) \end{aligned}$$
(5)
$$\begin{aligned} F_{CA}=X\otimes {CA(x)} \end{aligned}$$
(6)

where x represents an input feature map, Avg stands for the average pooling function, PConv denotes the pointwise convolution operation, NA represents the LeakyReLU non-linear activation function, \(\delta \) signifies the Sigmoid function, \(F_{CA}\) corresponds to the output of channel attention, and \(\otimes \) indicates the element-wise product.

Fig. 5
figure 5

The architecture of the Channel Attention (CA) module. The CA module includes an average pooling operation, several pointwise convolution layers, activation functions, and element-wise product operations

Pixel Attention (PA) In addition to channel attention, pixel attention is another vital component within our OSHAtt block. Haze images often exhibit complex backgrounds, making it challenging to extract crucial features of small objects. In such scenarios, incorporating pixel-level attention is benefical for the model to grasp fine-grained details and information from haze images. As shown in Fig. 6, we take the input \(F_{CA}\) (which is the output of the channel attention mechanism CA) and pass it through two consecutive pointwise convolution layers. LeakyReLU and Sigmoid are used as activation functions. The output of the pixel attention (PA) can be computed as follows:

$$\begin{aligned} PA(F_{CA})=\delta (PConv(NA(PConv(F_{CA})))) \end{aligned}$$
(7)
$$\begin{aligned} F_{PA}=F_{CA}\otimes {PA(F_{CA})} \end{aligned}$$
(8)

where \(F_{CA}\) signifies the input to the pixel attention module, PConv represents the pointwise convolution operation, NA stands for the LeakyReLU non-linear activation function, \(\delta \) denotes the Sigmoid function, \(F_{PA}\) corresponds to the output of the pixel attention, and \(\otimes \) indicates the element-wise product.

Fig. 6
figure 6

The architecture of the Pixel Attention (PA) module. The PA module only includes several pointwise convolution layers, activation functions, and element-wise product operations

Spatial Attention (SA) For haze images, the distribution of haze across the spatial domain is often non-uniform. The presence of high-frequency components or dense haze regions increases the complexity of image reconstruction. To address this challenge, we have designed a spatial attention mechanism that accounts for the varying importance of different spatial locations. This spatial attention mechanism serves as an auxiliary component to enhance the performance of our LOS-DehazeNet. As shown in Fig. 7, we initially set \(F_{PA}\) as the input to the spatial attention module (SA). Subsequently, we perform a concatenation operation between the results obtained from average pooling and maximum pooling. Finally, we apply pointwise convolution and the Sigmoid activation function to generate the attention map. The output of the spatial attention mechanism (SA) can be expressed as follows:

$$\begin{aligned} SA(F_{PA})=\delta [PConv[Concat[Avg(F_{PA});Max(F_{PA})]]] \end{aligned}$$
(9)
$$\begin{aligned} F_{SA}=F_{PA}\otimes {SA(F_{PA})} \end{aligned}$$
(10)

where \(F_{PA}\) represents the input to the spatial attention module, Avg and Max denote the average pooling and maximum pooling functions, respectively. Concat represents the concatenate operation, PConv represents the pointwise convolution operation, \(\delta \) denotes the Sigmoid function, \(F_{SA}\) corresponds to the output of spatial attention, and \(\otimes \) indicates the element-wise product.

Fig. 7
figure 7

The architecture of the Spatial Attention (SA) module. The SA module includes two pooling operations (average and maximum), a pointwise convolution layer, the sigmoid function, a concatenate operation, and element-wise product operations

3.4 Loss function

Generally, dehazing networks that rely on implementing sub-tasks require hybrid loss functions [21]. However, this approach tends to increase the complexity of hyper-parameter fine-tuning. In contrast, our LOS-DehazeNet relies on direct supervision between haze images and clean backgrounds, ensuring effective network training with simplicity. To achieve this, we default to using the simple L1 loss. Compared to the L2 loss, the L1 loss is more favorable for recovering image details, which is crucial for lightweight networks [49]. In summary, the loss function is formulated as follows:

$$\begin{aligned} Loss=\Vert Y_{pre}-Y_{gt} \Vert \end{aligned}$$
(11)

where, \(Y_{pre}\) represents the image predicted by our LOS-DehazeNet, and \(Y_{gt}\) stands for the ground truth.

4 Experiment

4.1 Dataset and metrics

For a fair comparison, we select the dehazing benchmark dataset RESIDE [50] for our experiments. The RESIDE dataset comprises synthetic hazy images from indoor and outdoor scenarios. The Indoor Training Set (ITS) contains a total of 1,399 clean backgrounds and 13,990 hazy images. The Outdoor Training Set (OTS) includes 2,061 clean backgrounds and 72,135 hazy images. During testing, we utilize Synthetic Object Testing Set (SOTS) as the testing dataset. SOTS contains 500 image pairs for both indoor and outdoor scenarios. Furthermore, we assess the model’s generalization ability using a real-world dataset HSTS. Last, we conduct subjective visual evaluations using real-world hazy images. For objective indicators, we use Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity (SSIM) to evaluate the network performance quantitatively.

4.2 Experimental settings

In our experimental analysis, we evaluate the performance of LOS-DehazeNet along with 12 representative methods: DCP [7] (TPAMI’10), DehazeNet [14] (TIP’16), AODNet [20] (ICCV’17), GFN [16] (CVPR’18), GCA [17] (WACV’19), GDN [27] (ICCV’19), DuRN [26] (CVPR’19), MSBDN [18] (CVPR’20), FAMEDNet [21] (TIP’20), SSDN [28] (NEUCOM’21), LightDehaze [22] (TIP’21), GRFANet [23] (APIN’22). Among them, AODNet [20], FAMEDNet [21], LightDehaze [22], and GRFANet [23] are lightweight models. During training, we augment the image pairs by applying random rotations of 90, 180, and 270 degrees and flipping them horizontally. The patch size is set to \(240\times 240\), and the batch size is set to 12. The iteration step is set to \(5\times 10^5\). We employ the ADAM optimizer to train our LOS-DehazeNet, initially setting the learning rate to \(1\times 10^{-3}\). Subsequently, we adjust the learning rate using a cosine annealing strategy. All codes are carried out on Python 3.6 and PyTorch toolbox [51]. The NVIDIA GTX2080Ti GPU is used for computation acceleration.

Table 2 Average PSNR and SSIM comparison evaluated on the SOTS benchmark dataset

4.3 Results on synthetic datasets

The experimental results are listed in Table 2. As listed, our LOS-DehazeNet achieves either the first or second-best metric values compared with other state-of-the-art models. Notably, for indoor scenes, our LOS-DehazeNet outperforms GRFANet [23] by a substantial margin, with a PSNR improvement of over 2.6 dB. This remarkable improvement can be attributed to the omni-scale feature learning mechanism embedded in our network. Even in more complex outdoor scenarios, our model still performs well in terms of performance metrics, demonstrating its stability and reliability. It’s worth highlighting that our LOS-DehazeNet boasts a modest parameter count of only 110k, representing a \(5 \times \) reduction in network parameters compared to GRFANet. For other lightweight dehazing networks, such as AODNet [20], FAMEDNet [21], and LightDehaze [22], although they have fewer parameters, their performance drops significantly. For subjective evaluations, we conduct a visual quality comparison of dehazed images in Fig. 8. Clearly, our LOS-DehazeNet is better at generating realistic image details and maintaining color fidelity. Overall, our LOS-DehazeNet is more effective than other existing lightweight models, and it also makes a better trade-off between dehazing performance and network parameters.

Fig. 8
figure 8

Visual quality comparison of different models for SOTS dataset. From (a)-(h): (a) the haze images, and the dehazing results of (b) GCANet [17], (c) DuRN [26], (d) AODNet [20], (e) LightDehaze [22], (f) GRFANet [23], (g) our LOS-DehazeNet respectively, and (h) the ground-truth

Fig. 9
figure 9

Visual quality comparison of different models for real-world haze images. From (a)-(g): (a) the real-world haze images, and the dehazing results of (b) GCANet [17], (c) DuRN [26], (d) AODNet [20], (e) LightDehaze [22], (f) GRFANet [23], and (g) our LOS-DehazeNet, respectively

4.4 Results on real haze images

In addition to synthetic datasets, we also leverage the real-world dataset HSTS and some real-world images to validate the generation ability of our LOS-DehazeNet. To ensure a fair comparison, we choose three representative lightweight models for quantitative evaluation, namely AODNet [20], LightDehaze [22], and GRFANet [23]. These models are retrained using the ITS dataset and directly tested on the HSTS dataset. The results are listed in Table 3. Specifically, our LOS-DehazeNet achieves the best quantitative metrics. In terms of PSNR, our LOS-DehazeNet gains over 3 dB compared with GRFANet [23]. SSIM has also been improved from 0.94 to 0.97. We also use some real-world hazy images for subjective visual evaluation, as shown in Fig. 9. Compared with other models, our LOS-DehazeNet can better remove the haze artifacts and avoid color distortions. These results collectively highlight the strong generalization ability of our LOS-DehazeNet.

Table 3 Generalization evaluation on real-world haze dataset HSTS
Fig. 10
figure 10

Inference time comparison of different state-of-the-art dehazing networks. Among them, AODNet [20], FAMEDNet [21], and GRFANet [23] are lightweight models

4.5 Inference time analysis

Last, we conducted an evaluation of the inference time for different models. For a fair comparison, we normalize the resolution of testing images to \(620\times 460\), as this is the most common resolution in dehazing datasets. The results are presented in Fig. 10. On average, our LOS-DehazeNet requires only 0.02s to process a single haze image, which achieves the fastest inference speed compared with other deep or lightweight networks. This result illustrates the efficiency and the promising practical value of our LOS-DehazeNet.

4.6 Ablation study

In this section, we conduct sufficient ablation studies to further demonstrate the effectiveness of our LOS-DehazeNet. Initially, we dissect the structure of our OSFA block and OSHAtt block individually. Subsequently, we explore network width adjustments to accommodate specific requirements in unique scenarios. For a fair and focused comparison, all ablation studies are carried out using the SOTS(Indoor) dataset.

4.6.1 OSFA block configuration analysis

Kernel Configuration In this part, we discuss the influence of different kernel configurations. To assess this, we construct four LOS-DehazeNet variants by introducing ordinary convolutional layers, single-scale depthwise separable convolution, and multi-scale depthwise separable convolution. The experimental results are presented in Table 4. First, the ordinary convolutional modules, despite having the most parameters, do not yield the highest PSNR and SSIM values. This finding underscores that ordinary convolution modules often contain redundant parameters, making them unsuitable for constructing lightweight dehazing networks.

Table 4 Performance comparison of different kernel configurations

Moreover, multi-scale convolution kernels outperform their single-scale counterparts. Taking one step further, both LOS-DehazeNet\(_{1\_3\_5\_7}\) and LOS-DehazeNet\(_{5\_7\_9\_11}\) exhibit inferior performance compared with LOS-DehazeNet\(_{3\_5\_7\_9}\). This suggests that convolution kernels that are either too large or too small may not be conducive to image restoration tasks. Larger kernels struggle to capture intricate image details, while smaller kernels may struggle to comprehend global information adequately. The above results highlight the importance of diverse feature extraction for lightweight networks. At the same time, the choice of kernel size is also a crucial consideration as it directly affects the balance between model performance and number of parameters.

Feature Fusion Strategy Essentially, the cascading and paralleling mapping operations within our OSFA block serve as specific feature fusion strategies. Commonly, addition and concatenation operations are widely employed for integrating deep features of different scales. In light of this, we initially introduce these two operations into our OSFA block. Then, we verify the effectiveness of combining cascading and paralleling mapping operations. The experimental results are listed in Table 5. As listed, our feature fusion strategy demonstrates significant performance advantage. When compared to the addition and concatenate operations, our feature fusion strategy improves PSNR by more than 2 dB. Furthermore, the combination of cascading and paralleling mapping operations proves more beneficial in enhancing network performance compared to utilizing these two mapping operations independently. These results illustrate that an effective feature fusion strategy is important for lightweight networks because it enables stronger information interaction, thereby unlocking the greater potential of lightweight networks.

Table 5 Performance comparison of different feature fusion strategies

Lightweight Blocks Analysis In the process of designing lightweight networks, building efficient blocks has become a mainstream approach. In this subsection, we perform a performance comparison between our OSFA block and some mature lightweight modules. We choose three lightweight modules for this comparison: the shuffle unit, mixconv, and Xception block. These modules have a similar design philosophy to our OSFA block. Specifically, the shuffle unit employs channel shuffling to fuse features, mixconv explores multi-scale convolutions, and the Xception block incorporates a multi-branch architecture. Building upon these three modules, we created three variants of LOS-DehazeNet, namely ShuffleNet-like, MixNet-like, and Xception-like. In these variants, we replaced the feature extraction modules with the shuffle unit, mixconv, and Xception block, respectively, while keeping the remainder of the network structure unchanged. The experimental results are listed in Table 6. In comparison to these three variants, our LOS-DehazeNet achieves a PSNR improvement of over 4.8 dB and enhances SSIM from 0.95 to 0.98. The above results demonstrate that our OSFA block has better feature extraction ability in the context of image dehazing tasks. What’s more, the omni-scale feature learning mechanisms embedded in our OSFA block may serve as valuable guidance for designing future lightweight dehazing networks.

Table 6 Performance comparison of different lightweight modules

4.6.2 OSHAtt block configuration analysis

The attention block is another core component in our LOS-DehazeNet. We first decompose the complete scheme of our OSHAtt block into different parts.

- w/o: without any attention mechanism.

- CA, PA, SA: only a single attention mechanism, i.e., channel attention, pixel attention, or spatial attention.

- CA&PA, CA&SA, PA&SA: a combination of two different levels of attention.

- OSHAtt: our omni-scale hybrid attention block.

Fig. 11
figure 11

Intuitive analysis of the omni-scale hybrid attention block

As depicted in Fig. 11, the introduction of any form of attention mechanism proves beneficial for enhancing network performance. Nonetheless, different attention mechanisms do not contribute uniformly to improving performance. For instance, spatial attention appears to be more effective in enhancing the PSNR value of restored images, while pixel attention tends to contribute more to the improvement of SSIM. This experimental result illustrates the limitations of the single-level attention mechanism. Conversely, our OSHAtt block, which integrates various levels of attention mechanisms, yields substantial performance gains. Compared with the baseline model without any attention mechanism, our OSHAtt block enhances PSNR by 0.87 dB and SSIM by 0.0034. These results and analyzes confirm the importance of integrating different attention mechanisms to improve model dehazing performance, especially in the design of lightweight neural networks.

Furthermore, we also perform a performance comparison between our OSHAtt block and other mature attention blocks, including SE [38], CBAM [40], SA [41], and SK [39]. The relative results are presented in Table 7, revealing that our OSHAtt block achieves the highest performance, with SK ranking second. Specifically, our OSHAtt block outperforms SK by more than 0.5 dB while also reducing the number of learnable parameters by nearly 20k. The above results demonstrate the effectiveness and superiority of our OSHAtt block.

Table 7 Performance comparison of other mature attention mechanisms

4.6.3 Width multiplier analysis

Although the architecture provided in Table 1 already guarantees high performance and low network capacity, there are scenarios where we need smaller and faster models for specific applications. To tailor LOS-DehazeNet to specific requirements, we simply introduce a scaling factor, denoted as \(\alpha \), which uniformly multiplies the number of channels at each block. We refer to this scaling factor as the “width multiplier”. In this section, we compare our LOS-DehazeNet with FAMEDNet [21] and LightDehaze [22]. For a fair comparison, we set the width multiplier \(\alpha \) to 0.25 and 0.375, ensuring that the overall parameter amounts of LOS-DehazeNet and the comparison models remains nearly identical. Table 8 lists the relative results. Compared with FAMEDNet, our LOS-DehazeNet(\(\alpha \) \(=\)0.25) gains over nearly 2 dB in PSNR, and the SSIM improves from 0.91 to 0.95. For LightDehaze, our LOS-DehazeNet(\(\alpha \) \(=\)0.375) also achieves very competitive performance with almost the same number of parameters. The above results demonstrate that LOS-DehazeNet has great potential for efficient deployment in resource-constrained devices.

Table 8 Performance comparison of different width multipliers
Fig. 12
figure 12

Visual display of some bad dehazing examples. From (a)-(c): (a) the haze images, (b) bad dehazing examples, and (c) the ground-truth

4.7 Some bad dehazing examples

The previous experimental results and analysis verify the effectiveness and efficiency of our LOS-DehazeNet. In this subsection, we present some instances of dehazing failures, which we believe can inspire future research in this field. As shown in Fig. 12, we find that these failure cases share common characteristics. In some light-colored areas like walls and floors, our model tends to produce slight artifacts in the dehazed images. We attribute this phenomenon to two possible reasons: First, the light-colored characteristics of haze itself are similar to the pixel distribution of light-colored areas in the image, which increases the difficulty for the model to accurately detect haze. Second, uneven haze has a greater impact on light-colored areas in the image, leading to excessive dehazing. In the context of future research on lightweight dehazing networks, we advocate exploring adaptive learning methods based on neural architecture search, which may provide a more effective solution.

5 Conclusion and future work

This paper proposes a novel end-to-end lightweight network LOS-DehazeNet for single image dehazing. We introduce the omni-scale feature learning to construct two novel blocks: the omni-scale feature aggregation block and the omni-scale hybrid attention block. A series of experimental results verify the effectiveness and efficiency of our LOS-DehazeNet. In particular, LOS-DehazeNet outperforms the latest lightweight dehazing models in both metrics and inference speed, while also achieving a significant 5\(\times \) reduction in network parameters. In the future, we plan to design more efficient dehazing networks by leveraging neural architecture search techniques.