Introduction

In recent years, deep learning has been continuously developed. Computer vision has been widely used in aquaculture, such as aquaculture seedling counting (Xue et al. 2023), aquatic product measurement, and fish behavior detection (Hong et al. 2014). Semantic segmentation is one of the crucial directions in deep learning research. It has been widely used in underwater vision tasks, for example, the length measurement of a fish body is achieved by image segmentation to get the outline of the fish and then extract the top center line of the fish body (Zhao et al. 2022). To obtain a more accurate boundary estimation of the fish, segmentation of the fish in the image was performed by Mask R-CNN (Garcia et al. 2020). The length of the shrimp was assessed using an image segmentation-based technique (Harbitz 2007). However, the aquatic product measurement estimation methods used above are based on the original image segmentation network to obtain the outline of the aquatic product segmentation, and the length estimation is performed pixel by pixel. The models used are not proposed for underwater scenarios, and the boundaries of aquatic products obtained from image segmentation are blurred, resulting in significant numerical errors when measurements are performed later.

As image segmentation is widely used in various scenarios, many scholars have proposed image semantic segmentation models for underwater scenarios. For example, an approach based on image contours and joint loss function gives high accuracy in underwater public datasets (Chicchon et al. 2023). A U-Net-based semantic segmentation method for underwater rubbish images is proposed to provide adequate localization information for underwater robots (Wei et al. 2022). Based on the DeepLabv3 + model, the Unsupervised Colour Correction Method (UCM) is incorporated to improve the images’ quality and obtain higher boundary features (Liu and Fang 2020). Facing the lack of underwater semantic segmentation style datasets, a semantic segmentation of underwater images (SUIM) dataset is proposed, and a SUIM-Net model is presented (Jahidul, et al. 2020). Hambarde et al. (Hambarde et al. 2021) proposed an end-to-end underwater generative adversarial network (UW-GAN) for depth estimation of single underwater images. Dudhane et al. (Dudhane et al. 2020) proposed a novel end-to-end deep network for underwater image restoration using a channeled color feature extraction module, a dense residual feature extraction module, and a custom loss function. Liu et al. (Liu et al. 2022) proposed an underwater enhancement method based on object-guided dual-adversarial contrast learning to solve the problem that the enhanced image may not be conducive to detection effectiveness. Patil et al. (Patil et al. 2019) proposed a new unpaired motion saliency learning method for foreground segmentation in underwater videos by frame-by-frame motion saliency estimation using several initial video frames and corresponding frames. However, although it focuses on object contour accuracy in the literature mentioned above (Chicchon et al. 2023), it is semantic segmentation of high-resolution images. Literature (Wei et al. 2022) proposes a semantic segmentation method for underwater rubbish images based on U-Net, which focuses more on improving the segmentation accuracy of the network on the target. Literature (Liu and Fang 2020) improves segmentation boundary features by unsupervised color correction methods, but the number of parameters of this network is too large. In the literature (Hambarde et al. 2021; Dudhane et al. 2020; Liu et al. 2022; Patil et al. 2019), the detection accuracy is improved by the underwater image enhancement or restoration method, which does not improve the object contour boundary of underwater semantic segmentation. The above method does not solve the problem of boundary blurring in the image segmentation of aquatic products, which leads to the poor measurement accuracy of aquatic products. Therefore, when facing the semantic segmentation task of aquatic product images, a network which can achieve high accuracy with clear and precise boundaries is needed.

To address the problems of aquatic product boundary blurring and segmentation accuracy in the semantic segmentation of underwater images, we propose an Underwater Image Semantic Segmentation Network model(UISS-Net). Firstly, the backbone network uses an auxiliary feature extraction network to improve the network feature extraction capability. Secondly, up-sampling feature fusion is performed using the multi-scale feature fusion network (MSFFN) proposed in this paper to enhance the input of shallow semantic information. Then, a channel attention mechanism is added to the feature fusion. Finally, a combination of the cross-entropy loss function and dice loss function is used as the loss function in this paper. Validation is performed on publicly available SUIM and Deep Fish datasets (Laradji et al. 2020) to evaluate the model performance. The experimental results show that the model proposed in this paper has higher detection accuracy and boundary accuracy when facing underwater images.

The subsequent work is organized as follows: the “Materials and methods” section presents the datasets used and the structure of the proposed UISS-Net model. The “Analysis of experimental results” section presents the data results of the model on the SUIM and Deep Fish public datasets. The “Conclusion” section draws the conclusion.

Materials and methods

Materials

Image datasets

In this paper, the performance of the model is validated by using the dataset SUIM proposed by Islam et al. (2020) for semantic segmentation of underwater images and the dataset Deep Fish proposed by Saleh et al. (2020) for underwater scenes. The SUIM dataset contains 1525 natural underwater images with their real semantic labels and a test set of 110 images collected during ocean exploration and human–robot cooperation experiments. The dataset is pixel-level annotated for eight object classes: background (waterbody) (BW), human divers (HD), aquatic plants and sea-grass (PF), wrecks or ruins (WR), robots (RO), reefs and invertebrates (RI), fish and vertebrates (FV), and sea-floor and rocks (SR). The Deep Fish dataset consists of approximately 40,000 underwater images collected from 20 Australian tropical marine environment habitats. The dataset initially contained only classification labels, and 300 fish semantic segmentation labels were added later. In this paper, the training and test sets are divided according to the ratio of 9:1. The images of the two datasets are shown in Fig. 1.

Fig. 1
figure 1

Dataset images. The first two rows of data are from SUIM dataset images and the last two rows of data are from Deep Fish dataset images

Environment setup

The experimental graphics card is NVIDIA GeForce RTX 4060Ti, PyTorch version 1.7, with 100 training epochs, and the optimizer was the Adam optimizer. Pre-training was performed on the ImageNet 1 K dataset using the transfer learning method, where the pre-training weights did not freeze the initial weights during the first five epochs of training. The number of batches is set to 6, the initial learning rate is set to 0.0004, the weights decay to 0.0001, the momentum is set to 0.5, and the learning rate uses a “COS” strategy that gradually decays with the number of iterations.

Experimental evaluation criteria

The semantic segmentation evaluation metrics are all calculated based on the confusion matrix as shown in Table 1.

Table 1 Classification confusion matrix results

In order to comprehensively evaluate the network architecture performance, we use mean intersection over union (mIoU), mean pixel accuracy (mPA), and accuracy as evaluation metrics. The equations for accuracy and mIoU are as follows:

$${\text{Accuracy}}=\frac{\text{TP+TN}}{\text{FP+FP+FN+TN}}$$
(1)
$${\text{mIoU}}=\frac{1}{N}\times \sum \left(\frac{\text{TP}}{\text{TP+FP+FN}}\right)$$
(2)

where N denotes the number of categories, TP denotes actual cases (number of pixels correctly predicted by the model as positive cases), FP denotes false-positive cases (number of pixels incorrectly predicted by the model as positive cases), and FN denotes false-negative cases (number of pixels incorrectly predicted by the model as negative cases).

The mPA is the proportion of the number of pixels correctly classified for each class calculated separately, then averaged cumulatively. Assuming that P is the accuracy of each category pixel, the mPA is given as follows:

$${\text{mPA}}=\frac{{\text{Sum}}(P)}{N}$$
(3)

Methods

In semantic segmentation, the Backbone Network is the basis of the semantic segmentation model, which is responsible for extracting features (e.g., edges, texture, and shape) from the input image. A strong backbone network improves the model ability to generalize to different datasets, making the model adaptable to different semantic segmentation tasks. Cross-entropy loss measures the gap between the model prediction and the true labels, prompting the model to classify each pixel point more accurately. Dice loss emphasizes the overlap between predicted and real regions rather than simple classification correctness, which helps the model capture the full region of interest better while ensuring accuracy. Therefore, good backbone feature extraction network and loss function can greatly mention the boundary contour accuracy of semantic segmentation.

The overall architecture of the UISS-Net network in this paper references the U-shaped framework of U-Net (Ronneberger et al. 2015). The overall network architecture is shown in Fig. 2.

Fig. 2
figure 2

UISS-Net network architecture

In Fig. 2, the encoder part performs feature extraction using an auxiliary network to obtain richer semantic information. The decoder part performs attentional feature fusion on the obtained feature map.

Encoder

In the face of tasks such as target detection or semantic segmentation in underwater scenes, low light and turbid water in underwater images are the main reasons for poor extracting features, and the difference between high-level and low-level semantic information is significant. Therefore, the encoder part of the UISS-Net network is mainly composed of the main backbone network and the auxiliary network.

The main backbone network uses ResNet50 as the base model. Also, pre-training weights of ResNet50 on the ImageNet 1 K dataset are used to give the network a faster convergence rate. The auxiliary network uses the lightweight GhostConv (Han et al. 2020) to generate more feature maps through cheap operations. GhostConv first performs a 1 × 1 convolution aggregating informative features between channels and then uses grouped convolutions to generate new feature maps. In order to reduce the amount of network computation, GhostConv divides the traditional convolution into two steps. Firstly, the traditional convolution generates feature maps with smaller channels to reduce the computation. Then, based on the obtained feature maps, it reduces the computation through the cheap operation, generates new feature maps, and splices the two groups of feature maps together to obtain the final. The traditional convolution operation is a combination of convolution-batch normalization BN-nonlinear activation. In contrast, linear transformation or cheap operation refers to ordinary convolution without batch normalization and nonlinear activation. The structure of the ghost module is shown in Fig. 3.

Fig. 3
figure 3

Ghost module structure

The use of auxiliary networks can enhance the robustness of the model and make it better adaptable to the input data of the underwater scene. At the same time, when the main backbone network lacks feature extraction capability in the feature extraction process, the auxiliary backbone network can make up for the missing part of the main backbone network and has the effect of multi-scale feature extraction. The auxiliary backbone network can learn a different representation of the features in the primary backbone network, resulting in a better generalization of the model. The overall structure of UISS-Net is shown in Table 2.

Table 2 UISS-Net overall structure

Decoder

The UISS-Net network structure is a symmetrical U-shaped structure. Five bilinear interpolation samples were performed at up-sampling, respectively. As the high-level feature map has richer contextual information, the low-level feature map has richer spatial detail information. For both high-level and low-level features, simple feature fusion methods will ignore the feature map information diversity, leading to lower segmentation accuracy. Therefore, this paper proposes a feature fusion method for MSFFN derived from FPN (Lin et al. 2017) and PANet feature fusion architecture, as shown in Fig. 4.

Fig. 4
figure 4

Convergence of network architecture features

The feature fusion method used in this paper is shown as d in Fig. 4. At the time of up-sampling, the features of this layer are fused with the encoder output features of the next layer by linear interpolation. The combination of pre-feature fusion through up-sampling expands the sensory field so that the model can better understand the input feature map context information.

In order to obtain better semantic information about the features, this paper incorporates the SE channel attention mechanism in the feature fusion. The weight of each channel of the feature map is obtained when performing feature fusion; then, this weight is used to assign a new weight value to each feature, thus allowing the neural network to focus on specific feature channels. The channels of the feature map that are useful for feature fusion are boosted, and the feature channels that are not very useful for the current task are suppressed.

Loss function

The loss function of UISS-Net uses the binary cross-entropy loss function and dice loss function (Milletari, et al. 2016). Equation 4 is the cross-entropy loss (CE_Loss) function, and Eq. 5 is the dice edge segmentation loss function. The cross-entropy function performs well in dealing with multi-classification problems and can effectively measure the difference between the model output and the actual labels. Furthermore, the cross-entropy function can help the model converge faster and achieve better performance with better robustness. However, cross-entropy loss requires high model stability and is sensitive to the sample equilibrium distribution. The dice loss function performs well in scenarios with severe imbalances of positive and negative samples and focuses more on mining the foreground region during training. Therefore, we use a combination of the cross-entropy loss function and the dice loss function to improve the convergence speed of the model when it is used to improve the model for dataset unevenness.

$${L}_{CE}=-\sum_{i=0}^{N}{y}_{i}{\text{ln}}\left(\sigma \left({x}_{i}\right)\right)$$
(4)
$$\begin{array}{l}\mathrm{Dice}\,\mathrm{coefficent}=\frac{2TP}{2TP+FP+FN}\\L_{\mathrm{dice\,loss}}=1-\mathrm{Dice}\,\mathrm{coefficent}\end{array}$$
(5)

The inputs to the cross-entropy loss function used in this paper are the model training outputs and the segmentation labels, as shown in Eq. 6.

$${L}_{1}={L}_{\text{CE}}\left(\text{GT,output}\left(x\right)\right)$$
(6)

where L1 is the primary network loss function, GT is the accurate semantic tag, output(x) is the obtained segmentation result, and LCE is the computation process of cross-entropy loss.

The segmentation loss function input used in this paper is the model training output with a pre-processed actual boundary image, as in Eq. 7.

$${L}_{2}={L}_{\mathrm{dice\, loss}}\left(G{T}_{({\mathrm{seg}})},{\mathrm{output}}\left({x}_{\mathrm{seg}}\right)\right)$$
(7)

where L2 is the edge loss, GT(seg) is the edge label obtained from the true semantic labels from the real semantic labels, output(x(seg)) is the edge extracted from the edge extracted from the segmentation result, and Ldice loss is the computation procedure of dice loss.

UISS-Net’s loss function is shown as follows:

$$L={L}_{1}+{L}_{2}$$
(8)

Analysis of experimental results

In order to validate the performance of different major feature extraction networks in UISS-Net, ablation experiments were conducted in the validation set. In the experiment, the VGG backbone network output dimension is different from the auxiliary network, so the matching process is performed on the VGG backbone network output dimension. The experimental results are shown in Table 3.

Table 3 Ablation experiments with different master feature extraction networks

Although ResNet50 is more complex and requires more computational resources than models such as ResNet18, the training efficiency has been greatly improved with increased computational power and optimization algorithms. Compared to the VGG network, ResNet50 can capture more complex and subtle image features through its deep residual network structure. Therefore, ResNet50 is used in the UISS-Net main feature extraction network.

In order to verify the effectiveness of the improved loss function, ablation experiments are carried out for different loss function combination methods, and the results are shown in Table 4.

Table 4 Ablation experiments with different loss functions

The ablation experiments in Table 4 show that combining the cross-entropy loss function with dice loss reduces the model dependence on a single loss function, improves the model robustness, and improves the model generalization ability.

In order to verify the method proposed in this paper, the backbone network of UISS-Net is replaced with ResNet50 as the baseline model. On the SUIM dataset, the auxiliary feature extraction module (AFEM), the multi-scale feature fusion module (MSFFM) proposed in this paper, the SE attention mechanism (SEA), and the boundary loss function (BLF) are added to the baseline model step by step, respectively. The results of the ablation experiments are shown in Table 5.

Table 5 UISS-Net ablation experiments on the SUIM dataset

Table 5 shows that the auxiliary feature extraction network can significantly improve the model’s performance. Also, the subsequent addition of the three modules can improve the model’s performance. Finally, mIoU, mPA, and accuracy reach 72.09%, 80.37%, and 86.93%, which are 9.68%, 7.63%, and 0.98% higher than those of the baseline model. Figure 5 illustrates the loss of the UISS-Net model on the SUIM dataset.

Fig. 5
figure 5

Comparison of Loss of training results from U-Net (ResNet50) and UISS-Net models

From Fig. 5, it is shown that the model proposed in this paper has faster convergence than U-Net. Figure 6 shows explicitly the mIoU and mPA for each classification of SUIM data.

Fig. 6
figure 6

shows the mIoU and mPA of the UISS-Net model for the SUIM validation dataset

In order to further analyze the performance of the proposed method, we compared it with U-Net, FCN, SegNet, Deeplab, and so on. These semantic segmentation models are compared on SUIM and Deep Fish datasets, respectively.

Table 6 gives the mIoU of the SUIM dataset and the intersection over union (IoU) for each category, including the eight categories of water body background (waterbody) (BW), human divers (HD), aquatic plants and sea-grass (PF), wrecks or ruins (WR), robots (RO), reefs and invertebrates (RI), fish and vertebrates (FV), and sea-floor and rocks (SR).

Table 6 Comparison between the SUIM dataset and current mainstream models

Table 6 shows that the UISS-Net network achieves 72.09% of the mIoU indicator on the SUIM test set, which has a significant advantage over the classical networks such as U-Net, SegNet, and PSPnet. Comparing the backbone network replaced by the U-Net and Deeplab networks, the proposed network has the highest accuracy, and the mIoU indicator is far superior to the current mainstream networks.

Table 7 shows the results of comparing the Deep Fish dataset with the current mainstream semantic segmentation models. Among them, our proposed UISS-Net model with mIoU of 95.05% has better segmentation results. The mIoU is improved by 12.3% compared to the traditional FCN semantic segmentation model.

Table 7 Comparison with current mainstream models in the Deep Fish dataset

We use the validation set of the SUIM dataset to perform predictions in the U-Net and UISS-Net models, respectively, whose results are shown in Fig. 7. From Fig. 7, it can be seen that UISS-Net can segment the underwater scene with high detection and recognition accuracy, and at the same time, the boundary precision and accuracy of the segmented underwater objects are higher than those of the traditional algorithms.

Fig. 7
figure 7

Prediction results of different models on the SUIM validation set

Conclusion

This paper proposes an Underwater Image Semantic Segmentation Network (UISS-Net) for underwater scenes. Firstly, the use of an auxiliary feature extraction network is proposed to solve the problem of feature extraction difficulty caused by turbidity and insufficient light in underwater scenes. Second, an inverted multi-scale feature fusion approach is proposed to solve the problem of the difference in semantic information between the higher and lower layers of the network. Then, a channel attention mechanism is added to the feature fusion to prevent the loss of important feature information. Finally, a combined loss function is used to improve the model’s accuracy with the problem of sample imbalance. The proposed network in this paper has an efficient backbone feature extraction module for the UISS-Net model compared to the methods in the literature, thus improving the semantic information extraction capability of the model. The proposed MSFFM feature fusion approach combines features from different levels, thus enhancing the model’s ability to understand the data and make decisions. Combining the loss functions improves the model generalization ability to reduce the risk of overfitting.

The proposed UISS-Net model achieved 72.09% and 80.37% for mIoU and mPA in the SUIM dataset and 95.05% for mIoU in the Deep Fish dataset. Additional modules are embedded due to the backbone network and feature fusion of UISS-Net. Therefore, the number of model parameters is large and less efficient to process for mobile deployed devices (e.g., underwater robots). However, the proposed network solves the problems of poor performance of existing semantic segmentation models in underwater scenes and rough segmentation boundaries, which is of great significance to the study of semantic segmentation of underwater images.