1 Introduction

The 3D object detection is a fundamental task in autonomous driving perception. To achieve accurate environmental perception, current sensors primarily consist of LiDAR and camera, both with distinct data characteristics. LiDAR data is sparse but offers precise 3D spatial information about the surroundings. In contrast, pixel-level image data provides rich semantic and texture information but lacks precise 3D spatial information. The effective integration of these two complementary features is essential for achieving accurate and reliable perception.

Current LiDAR and camera fusion methods fall into two categories: Query-based and LSS-based methods. Query-based methods include Futr3d [1], DeepFusion [2] and TransFusion [3]. Futr3d [1] and TransFusion [3] used LiDAR data to initiate bounding box predictions, and then utilized attention mechanisms to fuse object queries with relevant image features to achieve multi-modal fusion. DeepFusion [2] projected LiDAR features and image features into multiple query features and utilized cross-modal attention between the two modalities.

Compared with Query-based methods, LSS-based methods mainly adopt convolutional operations. LSS-based methods include BEVFusion [4, 5] and UVTR [6]. The image branches of LSS-based models all followed the Lift-splat-shot [7] (LSS) approach. The LSS approach employed a specific network to predict the depth distribution of each pixel in the image, then established a mapping between 3D and 2D space, ultimately obtaining the BEV feature through outer product and pooling operations. BEVFusion [4, 5] employed a dual-branch structure to combine point cloud BEV feature with image BEV feature. They treated foreground and background, geometric and semantic information equally in the process of fusing features, which is simple and easy to understand. However, they inadequately integrated depth and semantic features from the point cloud, which limits the performance of image branch. To enhance the image branch’s performance, methods like UniDistill [8] and UVTR [6] employed distillation techniques to fuse point cloud and image features. Similarly, BEVFusion4D [9] introduced an attention-based camera view transform module, which can improve camera BEV features under the supervision of point cloud depth. The core concept of UniDistill [8], BEVFusion4D [9], and UVTR [6] is using the point cloud as a supervisory feature to achieve spatial fusion in the image branch.

In summary, the methods mentioned above fuse multi-modal BEV features through either the dual-branch structure or by utilizing the attention mechanism. However, their spatial fusion remains insufficient and indirect. In addition, most LSS-based models solely relied on single-frame image features to predict depth distribution and conducted multi-modal spatial fusion solely on BEV features. To effectively enhance the depth and semantic information of image features in LSS-based models, a view transform module called DFLSS was proposed, inspired by BEVDepth [10]. DFLSS fuses LiDAR depth into image features to improve BEV features. It achieves multi-modal fusion by incorporating the depth features of the point cloud during the prediction of depth distribution, thereby enhancing the sparsity and positional information of the BEV feature. The BEV feature obtained by LSS is depicted in picture (a) of Fig. 1, while the BEV feature obtained by DFLSS is shown in picture (c) of Fig. 1. It is evident that DFLSS yields a richer and denser BEV feature compared to LSS.

In recent research on BEV perception, temporal fusion strategies have attracted increasing attention for their potential to improve the accuracy, continuity, and robustness of models, particularly in scenarios with occlusion issues. However, there are few methods [9, 11, 12] to achieve temporal fusion on multi-modal data. LIFT [11] marked an initial attempt to exploit spatiotemporal information in a fusion framework. It aggregated multi-frame multi-modal data through a classical transformer structure. However, the fusion system suffers from computational overload due to the global self-attention mechanism. BEVFusion4D [9] also used an attention mechanism in the temporal fusion of multi-frame BEV features, but it doesn’t significantly increase the computational load of the model. Another notable approach is 4D-Net [12], which introduced spatiotemporal convolutional layers to explicitly model temporal dependencies within a 4D tensor, striking a balance between capturing temporal dynamics and computational efficiency. Despite these advancements, there is still plenty of room for further research and innovation in developing temporal fusion strategies designed for multi-modal BEV perception tasks.

In this paper, a straightforward and adjustable multi-modal temporal fusion module called 3DMTF was introduced to improve the spatial information of BEV and incorporate temporal details. 3DMTF not only expanded the receptive field of BEV features but also enhanced their richness. The general process of 3DMTF is as follows: First, we obtained multi-frame spatial-fused BEV features using DFLSS. Second, the multi-frame spatial-fused BEV features were aligned in both space and time by leveraging the self-motion information of the vehicle. Finally, these aligned multi-frame BEV features were input into the 3DMTF to generate a multi-modal temporally-fused BEV feature. 3DMTF has several advantages. Firstly, it ensures alignment in space and time, preserving important spatial relationships and capturing temporal connections. Second, it is conceptually easier to implement and adapt than complex transformer architectures, making it easier to integrate into existing systems and more practical. Third, 3DMTF does not require an additional preprocessing module, thus maintaining the simplicity of 3DMTF. The BEV feature obtained by 3DMTF is shown in picture (d) of Fig. 1. It can be observed that picture (d) has more obvious and denser features than picture (c).

The main contributions of this paper can be summarized in three aspects:

  1. (1)

    We proposed DMFusion, a framework designed for 3D object detection in bird’s-eye view. DMFusion leverages multi-frame multi-modal data to efficiently integrate in both spatial and temporal dimensions.

  2. (2)

    We proposed DFLSS, a view transform module based on the merging of point cloud depth, which fuses the semantic information of the image and the spatial positional information of the point cloud. DFLSS enriches the BEV features and enhances the position information, thereby improving the accuracy of detection results. Moreover, DFLSS is applicable to all LSS-based BEV detection frameworks and exhibits good versatility.

  3. (3)

    We designed a temporal fusion module called 3DMTF to fuse multi-frame multi-modal information. 3DMTF enhances the BEV feature representation after spatial fusion, fills the gaps in the original BEV and incorporates temporal information. Our flexible temporal fusion module can be integrated into other single-frame-based BEV detection frameworks.

Fig. 1
figure 1

Visualization of BEV features. Picture (a) depicts the BEV feature derived from image data using the LSS method. Picture (b) depicts the BEV feature derived from point cloud data. Picture (c) illustrates the BEV feature obtained through multi-modal data and the DFLSS. Picture (d) showcases the BEV feature obtained by utilizing multi-frame features and the temporal fusion module 3DMTF

2 Related works

2.1 3D object detection with single modality

3D object detection with single modality can be categorized into LiDAR-based and camera-based approaches. LiDAR-based methods are primarily categorized into two types: Point-based methods and Voxel-based methods.

2.1.1 LiDAR-based methods

Point-based methods: These methods usually took raw point cloud as input and then use multiple network layers to extract features. PointRCNN [13] utilized PointNets [14] as point cloud encoders, then generated recommendations based on extracted semantic and geometric features, and refined these coarse recommendations through a 3D ROI pooling operation. Point-GNN [15] designed a graph neural network to detect 3D objects and encoded the point cloud within a fixed radius around neighboring graphs. These point-based approaches are usually computationally expensive due to the disordered nature of point cloud.

Voxel-based methods: These methods first converted point cloud into voxels and utilized voxel encoding layers to extract voxel features. SECOND [16] proposed a new sparse convolutional layer to replace the original computationally intensive three-dimensional convolutional layer. PointPillars [17] converted point cloud into pseudo images and applied 2D CNN to generate final detection results. The recent CT3D [18] designed a channel converter architecture to establish a 3D object detection framework with minimal manual design.

In addition, some works [19,20,21] projected the point cloud onto the range view (RV) and employed algorithms similar to 2D target detection to detect objects in the range view. This approach differs from the mainstream method that utilizes the bird’s-eye view (BEV). RV-based methods can be implemented using ordinary 2D convolution, while BEV-based methods often require complex voxelization operations and sparse convolution.

2.1.2 Camera-based methods

There are three camera-based methods for 3D object detection. The first one is to perform 3D detection from the perspective view (PV). FCOS3D [22] used an additional 3D regression branch to expand the capabilities of the camera-based detector. The second set of methods, inspired by DETR [23], includes DETR3D [24], Petr [25], and PetrV2 [26]. These methods devised a query-based mechanism that implicitly learns 3D position information, facilitating the establishment of a connection between the image and the 3D space. Additionally, they designed a corresponding query-based detection head capable of matching queries with detected objects. The third set of methods, inspired by LiDAR-based detectors such as LSS [7] and CaDDN [27], explicitly estimated the depth distribution of a 2D image. They employed a view transform module to convert camera features from perspective view to bird’s-eye view. These methods extended OFT [28] to 3D object detection. Additionally, CaDDN [27] added explicit depth estimation supervision to the view transform module. Recent research [29, 30] also employed multi-head attention for the view transform module.

2.2 3D Object detection with multi-modalities

The features generated by LiDAR and cameras were complementary. Therefore, the integration of information from both garnered significant attention in the field of 3D object detection. An earlier method, PointPainting [31], associated image semantics with input points, facilitating interaction between modalities. This point-level fusion approach was adopted by other work, such as Pointaugmenting [32] and Epnet [33]. Recent methods like FUTR3D [1] and TransFusion [3] delved into DETR-like structures, utilizing attention mechanisms to enforce adaptive feature relationships. However, such query-based fusion methods focused solely on objects and were not easily applicable to tasks like BEV map segmentation. DeepInteraction [34] introduced bilateral representation to achieve bidirectional information fusion. This approach learned and maintained representations of various modalities, thereby fully utilizing their unique features in the object detection process. Additionally, there were works like BEVFusion [4, 5] and UVTR [6], which converted images into 3D space using view transform modules and employed effective frameworks for modality fusion. Our proposed fusion method built upon these works, enhancing the straightforward and efficient fusion techniques.

2.3 Temporal fusion in 3D object detection

In recent years, temporal fusion strategies have received growing attention in the field of bird’s-eye view perception. BEVFormer [29] utilized transformer mechanisms to generate BEV features and incorporated deformable attention mechanisms for temporal fusion. Compared to BEVFormer [29], PolarDETR [35] employed a similar temporal fusion approach, but PolarDETR [35] placed greater emphasis on the fusion of query features representing the target objects. Additionally, the method [36] optimized BEVFormer [29] through dynamic deformable attention mechanisms. The BEV query in method [36] can use the attention mechanism with learnable parameters to extract spatial features from the views of multiple cameras. Furthermore, UniFusion [37] unified the fusion of multi-frame temporal information and multi-view spatial information, adapting to learn the weights of each perspective. However, it cannot utilize features from high-resolution images. In contrast to the transformer-based strategies, BEVDet4D [38] and the method [39] both employed CNN-based temporal fusion methods. BEVDet4D [38] generated multi-frame BEV features by using LSS [7], then aligned these multi-frame BEV features, and finally achieved cross-dimensional temporal fusion to obtain the final BEV feature. Besides, BEVStereo [40] and SGM3D [41] adopted stereo-based image feature fusion. Leveraging continuous frames with pose information, these methods created a binocular view for improved depth estimation, thus enhancing detection effectiveness.

3 Method

This section describes our DMFusion, a novel end-to-end framework for 3D object detection. We first formally introduce the problem and describe a baseline method in Sect. 3.1. Then we provide an overview of DMFusion in Sect. 3.2 and introduce its components in detail, including the features extraction, the depth-integrating view transform module (DFLSS), the multi-frame temporal fusion module (3DMTF), the LiDAR-temporal fusion module and the loss function in Sects. 3.3-3.7, respectively.

3.1 Problem formulation and baseline

Our goal is to develop a framework taking multi-modal data as input and predict the bounding box of the object. The input data includes multi-view RGB camera data \(I_{\text {camera}} \in {\mathbb {R}}^{N_c\times {H}\times {W}\times 3}\) and point cloud data \(I_{\text {LiDAR}} \in {\mathbb {R}}^{P\times {5}}\), where \(N_c\), \(H\), \(W\) denote number of cameras, image height, image width, and \(P\) denotes the number of points. The information of each point consists of its 3D coordinates, reflectivity, and ring index. The output information of the detection framework mainly includes the category of the object, the position information of the bounding box, as well as the orientation and speed information.

We first establish a baseline method based on BEVFusion [5]. As shown in Fig. 2, the previous multi-modal fusion detection framework BEVFusion [5] separates the features of the camera from LiDAR through the establishment of the dual-branch fusion architecture, then it sets the BEV features as uniform representation and uses a dynamic fusion module to combine these features, which facilitates the fusion procedure and enhances robustness. However, this insufficient spatial fusion limits the performance of the image branch, leading to issues such as low accuracy and false positives in practical detection scenarios. The key to solving this problem lies in leveraging the precise distance information (depth) inherent in point cloud and the rich semantic information in images. Effectively integrating the semantic content from images with the point cloud depth information is imperative to achieve precise spatial fusion.

Fig. 2
figure 2

The fusion framework of baseline model. The C stands for concatenate operation

Fig. 3
figure 3

The overall pipeline of our spatiotemporal fusion framework. It consists of feature extraction, spatial fusion, and temporal fusion stages. For spatial fusion, we proposed a depth fusion view transformer (DFLSS) that fuses LiDAR depth information into the BEV features of the camera branch. For temporal fusion, we proposed a temporal fusion module (3DMTF) to enhance the position and temporal information. The multi-frame spatial-fused BEV features are aggregated through the 3DMTF to achieve spatiotemporal feature interaction. The C stands for concatenate operation

3.2 DMFusion overview

Our approach, DMFusion, is illustrated in Fig. 3. The process of DMFusion unfolds as follows: First, features are extracted from LiDAR and cameras using two dedicated feature extraction networks. Next, the spatial and temporal information are fused using the depth fusion view transform module and multi-frame temporal fusion module, resulting in a temporal-fused BEV feature. Subsequently, the BEV feature of the LiDAR branch is concatenated with the temporal-fused BEV feature, and a convolutional-based fusion module is utilized to obtain the LiDAR-temporal fused BEV feature. Finally, popular detection head modules from earlier works are applied on the LiDAR-temporal fused BEV feature. These detection heads include the anchor-based head [42] and anchor-free head [3].

3.3 Features extraction

In the feature extraction phase, we adopt the dual-branch approach inspired by the previous method BEVFusion [4] for processing multi-modal data. For image feature extraction, multi-view RGB camera data \(I_{\text {camera}}\) with dimensions \({N_c\times {H}\times {W}\times {3}}\) is used as input. We start with a basic two-dimensional backbone network and integrate a Feature Pyramid Network (FPN) for multi-scale features. This process yields the multi-view image features, denoted as \(F_{\text {img}}\) with dimensions \({N_c \times C \times {h} \times {w}}\). Here, \(N_c\) represents the number of cameras, C stands for feature dimension, and \({h}\) and \({w}\) are the height and width of image features. For point cloud feature extraction, LiDAR data \(I_{\text {LiDAR}}\) with dimensions \({P\times {5}}\) is used as input. We choose the Voxelnet [43] processing method to partition the three-dimensional point cloud into a specific number of voxels. After random point sampling and normalization, voxel-wise features are obtained by utilizing multiple Voxel Feature Encoding (VFE) layers for local feature extraction. Subsequently, a network composed of sparse three-dimensional convolutions is utilized to effectively generate features in the BEV space. The size of BEV features is uniformly set to \(180 \times 180\).

Fig. 4
figure 4

The 2D-3D view transform module in baseline model

Fig. 5
figure 5

Visualization of projection results. Picture (a) shows the image data from a certain camera, while picture (b) depicts the corresponding point cloud depth projection result. Pixels of different colors represent different depth values

3.4 View transform module integrating depth

Recent approaches [4,5,6] utilized the 2D-3D view transform module to convert image features into BEV features, then they connected the LiDAR and camera BEV features in the same spatial dimension, fusing them via concatenation and convolution. We improved the original view transform module to achieve the spatial fusion. The original view transform module is shown in Fig. 4. Specifically, the original view transform module takes multi-view image features as input and predicts the depth distribution and context feature of pixels on the image features based on the extrinsic parameters of the camera. Subsequently, it computes the outer product of the depth distribution and context features. This outer product is then mapped into a predefined point cloud for feature rendering, which results in the generation of pseudo-voxels denoted as \(V \in {\mathbb {R}^{X \times Y \times Z \times C}}\). Finally, the BEV features are obtained by pooling these pseudo-voxels.

In BEVFusion [4, 5], the only supervision to the depth prediction network of view transform module comes from the detection loss. However, considering the complexity of monocular depth estimation, relying solely on detection loss is inadequate for effectively guiding the depth prediction network. To better leverage the depth features of point clouds and achieve enhanced spatial fusion, we not only utilize point cloud depth information as supervisory features but also integrate point cloud depth features as inputs to our view transform module.

Depth Integration Before inputting the point cloud data into the view transform module of the image branch, coordinate system transformation and downsampling processing operations are required for the point cloud data. Specifically, we start by projecting the point cloud data \(P \in {\mathbb {R}^{N \times D}}\) into the image coordinate system. Then we use the camera’s internal and external parameters to project \(P \in {\mathbb {R}^{N \times D}}\) into six cameras, resulting in the point cloud depth information \({I_{depth}} \in {\mathbb {R}^{N_C \times D \times H \times W}}\). Here, \({N_C}\), D, H, and W represent the number of cameras (equal to 6), the depth obtained after projecting the point cloud into the camera perspective, and the image’s height and width. The projection process is as Eq. 1 shown. The visualization of the projection result is shown in Fig. 5.

$$\begin{aligned} {[u,v,d,{\textbf {1}}]^T} = I*{Q_{g2c}}*{Q_{e2g}}*{Q_{l2e}}*{[{X_{LiDAR}},{Y_{LiDAR}},{Z_{LiDAR},{\textbf {1}}}]^T} \end{aligned}$$
(1)

where the \(u\) and \(v\) represent the pixel coordinates within the projected depth image, and \(d\) corresponds to the depth value associated with the pixel at position (\(u\), \(v\)). \(Q_ {g2c}\), \(Q_ {e2g}\), \(Q_ {l2e}\) represent the quaternion rotation transformation from global coordinate system to camera coordinate system, from ego coordinate system to global coordinate system, and from LiDAR coordinate system to ego coordinate system, respectively. \({X_{LiDAR}}\), \({Y_{LiDAR}}\) and \({Z_{LiDAR}}\) represent the location information of point cloud data, the \(*\) stands for the element-wise product of two matrices.

Next, a downsampling network is applied to the point cloud depth information \({I_{depth}}\) to match the size of the image feature, resulting in downsampled depth features \({F_{depth}} \in {\mathbb {R}^{Nc \times C_d \times h \times w}}\). Parameter \(C_d\) represents the channel number dimension. Finally, the downsampled depth features \({F_{depth}}\) and image features \({F_{img}}\) will be utilized as the two inputs for DFLSS.

Depth Supervision The depth supervision method of DFLSS follows the approach of BEVDepth [10], where point cloud information is converted into one-hot vectors to supervise the predicted depth distribution on image features. By the way, since the dataset we utilize includes key frames and non-key frames, in our DFLSS module design, we only perform in-depth supervision on key frames. Non-key frames and key frames share the DFLSS module. Fig. 6 depicts our designed depth fusion view transform module DFLSS.

Fig. 6
figure 6

Depth fusion view transform module. It comprises two depth prediction networks and a depth fusion network. Voxel Pooling unifies all point features into a single coordinate system and projects them onto the BEV feature map. The dimensions of the depth prediction features, depth offset features, and final depth features are all \({Nc \times d \times h \times w}\). \({N_C}\) represents the number of cameras, d represents the number of bins in the depth distribution. In DMFusion, the configuration of depth distribution is set to: 1 to 60 meters with a bin size of 0.5 meters

Fig. 7
figure 7

The framework of our temporal fusion module 3DMTF. 3DMTF mainly includes a 3D convolutional temporal fusion module that extracts temporal features, as well as an encoder backbone module and an encoder FPN module. 3DMTF utilizes the BEV features from the previous n frames as input, where A represents the alignment operation in Eq. 2

DFLSS first applies a depth prediction network on image features \({F_{img}}\) to obtain context features \({F_{context}} \in {\mathbb {R}^{Nc \times C_d \times h \times w}}\) and depth prediction features \({F_{pre}} \in \) \({\mathbb {R}^{Nc \times d \times h \times w}}\). In the training process, the depth prediction features \({F_{pre}}\) are supervised by the one-hot vectors from point cloud depth. Then, the image features \({F_{img}}\) are concatenated with downsampled point cloud depth features \({F_{depth}}\) and inputted into a depth offset network to predict a depth offset feature \({F_{offset}}\). Subsequently, the depth offset feature \({F_{offset}}\) is fused with the depth prediction feature \({F_{pre}}\) to obtain the final depth feature \({F_{final}}\). Then, we perform an outer product between the final depth feature \({F_{final}}\) and the context feature \({F_{context}}\). Finally we use voxel pooling to obtain the spatial-fused BEV feature with dimensions \({X \times Y \times C}\), where \({X}\), \({Y}\), \({C}\) represent the size of the BEV grid and feature dimensions.

3.5 Multi-frame temporal fusion module

In autonomous driving scenarios, historical frames often contain similar object features and information to the current frame. Aggregating multi-frame information enhances and enriches the BEV feature, which can improve the algorithm’s ability to judge the target’s movement speed. Moreover, multi-frame temporal fusion can solve the occlusion problem and enhance the robustness of the algorithm. Currently, few studies have exploited temporal cues in multi-modality-based 3D object detection. Existing paradigms [4, 5] exhibit relatively poor performance in predicting time-related targets, such as speed. In order to enhance and enrich the BEV feature representation of individual frames, thereby improving the model’s performance on temporal tasks and enhancing its adaptability, we designed a temporal fusion module based on 3D convolution, termed 3DMTF. The structure of 3DMTF is depicted in Fig. 7. The 3DMTF includes a 3D convolutional temporal fusion module and a BEV encoder module.

Fig. 8
figure 8

The framework of our 3D convolutional fusion module. It consists of 3D convolutional layers, and the C stands for concatenate operation

Our temporal fusion algorithm is performed after acquiring BEV features through DFLSS. Due to the continuous movement of the vehicle, the coordinates of different timestamps are misaligned. The MgtaNet [44] utilized deformable convolutional to align the features of multiple timestamps. This method is adaptive, but it treats dynamic and static objects equally, which adds extra parameters to the model. Therefore, we use internal and external parameters for self-motion compensation. First, we obtain the multi-frame BEV features and the self-motion information of each frame through DFLSS. Secondly, we use the self-motion information to align the historical frames to the current frame. The alignment process between historical frames and current frames is shown in Eq. 2. This alignment method can align static objects to their current frame positions, while dynamic objects remain unaligned for subsequent feature learning using the temporal fusion module. Finally these aligned BEV features are then processed in the 3D convolutional fusion module for further fusion.

$$\begin{aligned} {F^\prime }(T - 1,{P^{e(T)}}) = T_{e(T-1)}^{e(T)}F(T - 1,P^{e{(T-1)}}) \end{aligned}$$
(2)

where \(F(T - 1,{P^{e(T-1)}})\) denotes the BEV feature at time T-1 with the ego position \({P^{e(T-1)}}\), \(T_{e(T-1)}^{e(T)}\) denotes the intermediate conversion self-motion information, which encompasses rotation and translation information. This self-motion information facilitates the conversion of the ego position of the vehicle at time T-1 to the ego position at time T. After the conversion of self-motion information, we can obtain \(F^\prime (T - 1,{P^{e(T)}})\), which signifies the BEV feature at time T-1 with the ego position \({P^{e(T)}}\) at time T.

The 3D convolutional fusion module is shown in Fig. 8. It adopts a structure similar to the residual network and mainly includes three 3D convolutional layers. These layers are aimed at dynamically learning the receptive field to effectively capture the salient features of moving objects within aligned BEV features and extract temporal information across multi-frame features. Initially, we concatenate the multi-frame BEV features and obtain \({F^{combine}}\). Subsequently, multiple 3D convolutional layers are employed to extract correlation and temporal information between multi-frame features. Finally, we concatenate the temporal information with \({F^{combine}}\) to obtain \({F^{update}}\), which is then input into the subsequent BEV encoder module. Our BEV encoder module consists of the encoder backbone module and the encoder FPN module. Unlike BEVDet4D [38], since DFLSS has implemented spatial fusion on BEV features, we do not need additional preprocessing modules used in BEVDet4D [38] to optimize BEV features.

3.6 LiDAR-temporal fusion module

In the nuScenes dataset, the camera scanning frequency is 12Hz, the LiDAR scanning frequency is 20Hz, and the number of LiDAR beams is 32. Therefore, the current mainstream method is to load the LiDAR data of 10 sweeps in history and process it to obtain LiDAR BEV. Following the method adopted in BEVFusion [4], after achieving temporal fusion, it is necessary to fully utilize BEV features from the point cloud branch. Therefore, we fused the BEV feature from LiDAR and the BEV feature from 3DMTF. To optimize the fusion process and prioritize critical features, we employed the Squeeze-and-Excitation (SE) Block, a fundamental component designed to enhance inter-channel dependencies in convolutional neural networks. This block functions as follows: it first uses adaptive global average pooling to capture global channel-wise information, followed by a 1x1 convolutional layer to learn channel-specific weights. Subsequently, a Sigmoid activation scales the output to form channel-wise activation weights. During the forward process, these weights are used to multiply the input feature map, thereby introducing a channel attention mechanism. This enables the network to emphasize informative channels and improve feature representation for enhanced model performance. Our fusion module can be described as Eq. 3 and shown in Fig. 9.

$$\begin{aligned} F_{fused} = {f_{seblock}}({f_{conv}}([{F_{Temporal}} ,{F_{LiDAR}}])) \end{aligned}$$
(3)

where \(F_{Temporal}\) represents the BEV features after temporal fusion, \(F_{LiDAR}\) represents the BEV features of the LiDAR branch, \(f_{conv}\) denotes the 2D convolutional operation, and \(f_{seblock}\) denotes the Squeeze-and-Excitation operation. Finally, \(F_{fused}\) represents the final fused BEV features.

Fig. 9
figure 9

The framework of the LiDAR-Temporal fusion module. It includes a simple convolutional layer and channel attention module, and the C stands for concatenate operation

3.7 Loss function

We define the training cost as the weighted sum of the classification loss, regression loss, 3D IoU loss, and depth supervision loss:

$$\begin{aligned} \text {Loss} =&\{ \xi _1 L_{cls}(p,\hat{p}) + \xi _2 L_{reg}(b,\hat{b}) \nonumber \\&+ \xi _3 L_{IoU}(b,\hat{b}) + \xi _4 L_{\text {depth}}(y,\hat{y}) \} \end{aligned}$$
(4)

where \(L_{cls}\) denotes the focal loss [45], \(L_{reg}\) denotes the L1 loss between the predicted boxes’ centers, scale, angle, velocity, and the ground truth boxes, and \(L_{IoU}\) represents the IoU loss between the predicted boxes and ground truth boxes. The depth estimation loss \(L_{\text {depth}}\) is defined as Eq. (5). In this equation, \(L_{\text {depth}}\) signifies the computed loss for depth estimation. It is calculated as the sum of binary cross-entropy losses between predicted depth values (\(\hat{y}i\)) and ground truth labels (\(y_i\)), normalized by the number of foreground pixels (N) and weighted by \(\xi _4 \). The \(\log \) function computes the natural logarithm, and \(1 - y_i\) represents the complement of the true depth labels. \(\max (1.0, \text {f}{\mathrm{{g}}{mask}})\) ensures normalization to prevent division by zero and is based on the count of foreground pixels. \(\xi _1\), \(\xi _2\), \(\xi _3\), and \(\xi _4\) denote the coefficients of the individual cost terms.

$$\begin{aligned} L_{\text {depth}} =&\frac{1}{\max (1.0, \mathrm{{f}}{\mathrm{{g}}_{mask}})} \sum _{i=1}^{N} \left( -y_i \cdot \log (\hat{y}_i) \right. \nonumber \\&\left. - (1 - y_i) \cdot \log (1 - \hat{y}_i) \right) \end{aligned}$$
(5)

4 Experiments

In this section, we begin by presenting our experimental setups and the details of implementation. Subsequently, we showcase the performance of DMFusion to demonstrate the effectiveness of the proposed framework. Finally, we conduct an ablation study to illustrate the advantages of our two proposed modules.

4.1 Experimental setup

We conducted comprehensive experiments on a large-scale autonomous-driving dataset for 3D detection, namely nuScenes [46]. The dataset comprises 1000 scenarios, with 700, 150, and 150 scenes allocated for training, validation, and testing, respectively. Each frame of data includes six image data with surrounding views and one LiDAR point cloud. The dataset contains up to 1.4 million annotated 3D bounding boxes across 10 categories. For the 3D detection task, we report several performance metrics, including the nuScenes Detection Score (NDS), mean Average Precision (mAP), and five True Positive (TP) metrics: mean Average Translation Error (mATE), mean Average Scale Error (mASE), mean Average Orientation Error (mAOE), mean Average Velocity Error (mAVE), and mean Average Attribute Error (mAAE).

4.2 Implementation details

Table 1 Evaluation Results on the nuScenes validation set. We did not use BEV data enhancement when training DMFusion, so when we compared it with previous methods, we also compared the results of other models [4] that did not use BEV data enhancement
Table 2 Performance comparison on the 10 class of the nuScenes test set

Our network is implemented using PyTorch [47] with the assistance of the open-source MMDetection3D [48] framework. We conduct DMFusion with Dual-Swin-Tiny [49] as 2D backbone for image-view encoder. CenterPoint [42] and TransFusion-L [3] were selected as our LiDAR stream and 3D detection heads. In order to facilitate the data enhancement processing and feature extraction, and maintain the same baseline comparison with other models, we set the image size to \(448\times 800\) following the official LiDAR setting of voxel size as (0.075m, 0.075m, 0.2m). Our training consists of two stages: i) We first train LiDAR flow and camera flow with multi-view image input and LiDAR point cloud input. Specifically, we train two streams according to the official configuration in MMDetection3D ; ii) We then train DMFusion for another 6 epochs, inheriting the weights from the two trained streams. The initial learning rate size is 1e-4, the strategy for the learning rate is “step”, and the optimizer is “AdamW”. Our experiments were trained on 4 A100s. Note that when multi-view image inputs are involved, no data augmentation (i.e. flipping, rotation or CBGS) is applied and we freeze the weights of the camera branches during training and during testing we follow MMDetection3D in the setup of only LiDAR detector without any additional post-processing.

4.3 Comparing with previous methods

Table 1 and Table 2 show that our results are comparable to previous competing methods on the validation and test datasets. Respectively, our model without temporal fusion (DMFusionS) outperforms previous methods and achieves detection scores of 68.93% mAP, 71.04% NDS on the validation set and 68.54% mAP, 71.14% NDS on the test set. Our model with DFLSS and temporal fusion (DMFusion) achieves detection scores of 69.32% mAP, 71.21% NDS on the validation set and 68.96% mAP, 71.32% NDS on the test set. Moreover, since our multi-frame fusion module is implemented in the ego coordinate system, both our image branch and LiDAR branch are trained and evaluated in the ego system, which results in the LiDAR branch detection model trained in the ego system experiencing a certain degree of performance degradation compared to the model trained in the LiDAR system. Specifically, the mAP of transfusion-L we reproduced in the LiDAR system is 62.34%, and the NDS is 68.06%. Compared with the results of the official released model, the mAP has dropped by 2.56% in mAP and 1.84% in NDS respectively. Despite the limitations in performance of the LiDAR branch, our approach still achieves a higher mAP compared to current methods.

Table 3 Performance and Generalization of our view transform module against the previous approach on nuScenes val set

4.4 Ablation Studies

To illustrate the effectiveness of each module in DMFusion, we perform an ablation study on them.

Table 4 Comparison of the truth positive (TP) indicator of our view transform module with previous methods on the nuScenes val set[7] methodology
Table 5 Comparison of category-specific detection accuracy between our DMFusionS and the baseline model BEVFusion\(\dag \) on the nuScenes val set
Fig. 10
figure 10

Comparison of the detection results on images using model with LSS and the model with DFLSS. The blue bounding box is the ground truth and the green bounding box is the prediction. The picture (a) is the visualization result using LSS, and the picture (b) is the visualization result using DFLSS

Table 6 Comparison of the truth positive (TP) indicator of our temporal fusion module with different lengths of frames on the nuScenes val set

In order to understand the effect of our proposed DFLSS module, we choose two methods for comparison. One is the commonly used view transform module LSS in BEVFusion [4], the other is to add the depth loss supervision method in BEVDepth [10] on the basis of LSS, and conduct research on the two detection heads CenterPoint and TransFusion-L. As shown in Table 3, our results show consistent improvements across all settings and improve LSS results by 3.24% mAP and 1.86% NDS in CenterPoint and 1.92% mAP and 1.42% NDS in TransFusion-L. DFLSS improve \(\text {LSS+Depthloss}\) results by 1.82% mAP and 0.49% NDS in CenterPoint and 0.77% mAP and 0.87% NDS in TransFusion-L. In terms of the true positive mean index, as Table 4 shows, our DFLSS module showed a decline compared to LSS and \(\text {LSS+Depthloss}\). The decline is most noticeable in the Average Orientation Error (mAOE) and Average Velocity Error (mAVE) indicators. This means that our method generalizes well in common configurations and shows superiority over previous LSS due to the full integration of LiDAR depth.

In addition, we have provided a detailed comparison of detection performance metrics for 10 types of objects in the nuScenes dataset in Table 5. Notably, upon implementing DFLSS, it is evident that, among these 10 object types, most of the performance metrics for our model without temporal fusion (DMFusionS) surpass those of \(\mathrm{{BEVFusio}}{\mathrm{{n}}^\dag }\).

We visualized the detection results on images using DFLSS and LSS respectively in Fig. 10. In the detection results obtained using the LSS method, some erroneous detection boxes and inaccurate detection of bus boxes are observed due to the insufficient fusion of semantic information between point cloud and image features. However, with the DFLSS method, this situation is improved. By incorporating depth information and semantic features from the point cloud during image depth prediction, the fusion process is enhanced, resulting in more accurate detection.

Results of the proposed temporal fusion module 3DMTF is illustrated in Table 6. We set different heads for comparison. In experiments with Centerpoint as the head, our baseline without temporal fusion obtained scores of 63.6% mAP and 68.17% NDS respectively. Gradually increasing the length of the input frame will lead to up to 2.64% mAP and 2.1% NDS. In TransFusion-L, our baseline without temporal fusion achieved scores of 68.16% mAP and 70.17% NDS respectively.By gradually increasing the length of the input frame, we observe gains of up to 0.6% in mAP and 0.41% in NDS. This indicates that our method is capable of aggregating historical information over a long time span, thereby improving the performance of the model. Particularly notable are the improvements in time-related indicators, as shown in Table 6, where our multi-frame fusion method significantly reduces the model’s Average Velocity Error (mAVE) index.

In addition, we also conducted ablation experiments with different lengths of frames after adding the DFLSS module, as shown in Table 6. After introducing DFLSS and temporal fusion, our model using Centerpoint head reached 66.49% mAP and 70.41% NDS, our model using TransFusion-L head reached 69.32% mAP and 71.21% NDS.

Time cost is indeed crucial for autonomous driving technology. To illustrate the impact on the time cost when the image size value changes, we present the time cost when the number of frames and image size changes are different in Table 7.

We visualized the detection results on the image using single-frame and multi-frames in Fig. 11. It can be seen that the use of multi-frames improves the position inspection and orientation angle detection of distant objects, at the same time, it handles the occlusion problem better (the situation of the two cars in the picture).

Table 7 Impact on time cost when image size and lengths of frames changes
Fig. 11
figure 11

Comparison of the detection results on images using the single-frame model and the multi-frames model with 3DMTF. The blue bounding box is the ground truth and the green bounding box is the prediction. The picture (a) is the visualization result using single-frame, and the picture (b) is the visualization result using 3DMTF

4.5 Qualitative results

Figure 12 shows qualitative results for multi-class object detection for multiple scenes in the nuScenes dataset. Clearly, our model outputs very accurate 3D bounding boxes in most cases. Furthermore, to demonstrate the effectiveness of our method, we also visualized the detection results of DMFusion and BEVFusion as well as DMFusionS on the nuScenes validation set. DMFusion and BEVFusion, DMFusionS both use the Centerpoint [42] head as the detection head. As shown in Fig. 13, DMFusion and DMFusionS improve the direction angle of the prediction box and simultaneously improve the accuracy of detection, which proves the effectiveness of our proposed framework.

Fig. 12
figure 12

Example performance results for six view images from the nuScenes validation set. The blue bounding box is the ground truth and the green bounding box is the prediction

Fig. 13
figure 13

Qualitative comparison of detection results for nuScenes val set. We visualize them via BEV maps. The blue bounding box is the ground truth and the green bounding box is the prediction. The first row is the result of our model DMFusion, the second row is the result of DMFusionS (model without temporal fusion), and the third row is the result of BEVFusion. The orange circle shows our better detection performance than BEVFusion

4.6 Limitations

In this paper, the performance of the point cloud branch in ego coordinate system is not as good as that in the LiDAR coordinate system. This discrepancy has a noticeable impact on the overall performance of DMFusion. Furthermore, our multi-frame temporal fusion module relies on convolutional techniques, and introducing an attention mechanism into the network architecture could potentially lead to a more effective capture of information from historical frames. At the same time, our approach endeavors to enhance the perceptual capabilities of vehicles in autonomous driving scenarios. It exhibits cross-domain applicability, making it applicable in systems that require autonomous navigation and environmental perception capability.

5 Conclusion

In this paper, a novel spatiotemporal fusion framework, namely DMFusion, is proposed for 3D object detection task. Within this framework, a view transform module called DFLSS is proposed to integrate depth information from point clouds. DFLSS addresses the sparsity of BEV features and optimizes their spatial information, thereby enhancing object detection. In the temporal fusion part, a straightforward and easily adaptable temporal fusion module called 3DMTF is introduced, which effectively aligns and integrates multi-frame features. 3DMTF enriches the BEV feature both in positional and temporal aspects, leading to an improvement in the model’s performance on time-related metrics. Extensive experiments demonstrate the effectiveness of these two modules. On the nuScenes test benchmark, DMFusion achieved a notable improvement of 1.42% in mAP and 1.26% in NDS compared to the baseline model. It is anticipated that DMFusion can serve as a source of inspiration and contribute to future research in the field of multi-sensor fusion.