1 Introduction

Multi-Object Tracking (MOT) plays an essential role in video understanding. It aims to detect and track all objects of specific classes frame by frame, while maintaining a unique identifier for each object. MOT is fundamental to numerous applications, including but not limited to autonomous driving [1], video surveillance [2], sports and dance analysis [3, 4], and animal surveys [5]. Recent works [6,7,8] indicate that the tracking-by-detection (TBD) paradigm remains the optimal solution for the tracking task. The TBD algorithm mainly consists of a detector and a tracker. The advantage of the TBD paradigm is that the detector and tracker can be effectively decoupled, allowing research to focus on the data association algorithm. Typically, an off-the-shelf detector is used for detection, followed by a specially designed tracker that associates the detection boxes. Although the MOT studies [7,8,9,10,11,12] have greatly advanced with the development of sophisticated object detectors [13,14,15,16], existing TBD algorithms still exhibit significant limitations: (1) Ineffectiveness in distinguishing objects with similar appearances. (2) Difficulty in effectively tracking objects exhibiting large-range motion, diverse gestures, and other irregular movements. (3) Frequent ID switching during frequent crossovers.

Inspired by these challenges, we propose WDTtrack to track multiple objects with irregular motions and indistinguishable appearances, following the tracking-by-detection paradigm. Initially, Yolov7-w6 is employed to provide reliable detection boxes. Then, a Kalman filter, modified by the BoT-SORT algorithm, is used to better exploit the historical trajectory information, with the state represented by an eight-tuple instead of the traditional seven-tuple. Furthermore, the Centroid Triplet Loss reid (CTL) model is utilized as a feature extractor to discern the differences in appearance embeddings between objects. In addition to focusing on appearance embeddings, we also consider motion clues. The Wider Bounding Box (W-BBox) is designed to extend the horizontal search space, enhancing tracking performance during fast and non-linear object movements. Additionally, we introduce the Direction Bank (DB), which robustly denotes the velocity direction of objects with non-linear motion. In these ways, WDTtrack not only facilitates the recognition of appearance embeddings but also leverages spatial proximity by measuring both velocity directions and the Interaction of Unions (W-IoU) between W-BBoxes. Finally, the Tracklet Recovery Mechanism (TRM) is presented to alleviate ID switching caused by occlusion and to support the maintenance of long-term tracking by associating new and lost (or removed) tracklets.

To conclude, the key contributions of our work to the object tracking community are as follows:

(1) Aiming at the occlusion and motion blur, the Wider Bounding Box (W-BBox) is proposed to expand the horizontal matching space of high-score bounding boxes for alleviating ID switching issue caused by these factors.

(2) To tackle frequent ID switching driven by similar appearances, non-linear motion, and frequent crossovers, the Direction Bank (DB) is introduced to extract velocity direction feature. This enhances the consistency of velocity directions and strengthens discriminative features of objects with similar appearances.

(3) To alleviate both tracking loss and ID switching caused by occlusion, the Tracklet Recovery Mechanism (TRM) is designed to help recover trajectories of occluded objects, thereby facilitating the maintaining long-term tracking.

(4) We provide comprehensive and detailed experiments to demonstrate the effectiveness of each module and discover more cues for developing MOT trackers.

The manuscript is organized as follows: Section 2 reviews the development of MOT research and literature closely related to this paper. In Section 3, we explain the proposed methodology, including a detailed exploration of W-BBox, DB and TRM modules. We emphasize each module’s role in tracking task and demonstrate how these modules contribute to gaining more complex, reliable and diverse tracking cues, thus enhancing the model’s discriminative capabilities. Our experiments and results are described in Section 4. Finally, the conclusions are detailed in Section 5.

2 Related work

In recent years, MOT algorithms have experienced a boomed, as depicted in Fig. 1. These algorithms can be roughly divided into three categories based on their implementation principles. The first category is the classic tracking-by-detection paradigm, in which the detection boxes are usually obtained by using off-the-shelf detectors, and then the detection boxes are associated by data association algorithms. The second type is the algorithms that joint detection and tracking into a single network, pursuing a one-step implementation approach. And the last category is based on Transformer algorithms, which achieves detection and tracking by designing specific queries. The architectures of these three categories are generally shown in the Fig. 2.

Fig. 1
figure 1

The develop history of the MOT algorithm. Yellow boxes refer to tracking-by-detection methods, green boxes indicate the algorithms which joint detection and tracking. And blue boxes mean the tracking-by-query algorithms

Fig. 2
figure 2

General framework of TBD, JDT and TBQ multi-object algorithm

Tracking-By-Detection (TBD) With the recent advancements in object detection [13, 16, 17], tracking-by-detection methods have seen significant improvements. In this framework, an off-the-shelf object detector first predicts the object bounding boxes for each frame. These boxes are subsequently associated across adjacent frames using an association algorithm, with the overall architecture diagram shown in Fig. 2(a). Consequently, many studies have focused on data association, aiming to more effectively exploit appearance embeddings [9, 10, 18] or motion cues [19, 20]. For example, SORT [19] utilizes a Kalman Filter [21] for each tracked instance and associates each instance with its highest overlapping detection in the current frame using bipartite matching. DeepSORT [18] employs a separate deep network to extract the appearance embeddings of the instances and augments the overlap-based association cost in SORT with these appearance embeddings. ByteTrack [22] divides the detection boxes into high-score and low-score boxes and associates every detection box instead of only the high-score ones. Building on ByteTrack, OC-SORT [11] introduces observation-centric online smoothing, observation-centric momentum, and observation-centric recovery to improve the tracking robustness in non-linear motion scenarios. BoT-SORT [23] further refines ByteTrack by implementing an improved Kalman Filter, camera-motion compensation and ReID feature fusion. In order to clarify the effectiveness of the tracking-by-detection paradigm, Strongsort [6] enhances DeepSORT and introduces an appearance-free link model and gaussian-smoothed interpolation which associates short tracklets into complete trajectories and compensates for missing detections. While these TBD methods have shown promising results in MOT task, they often fall short in cases, involving non-linear motion and similar appearance, leaving significant room for improvement in multi-object tracking.

Joint Detection and Tracking (JDT) Some researches focus on converting existing detectors into trackers by integrating both tasks within the same framework, as shown in Fig. 2(b). To enable simultaneous output of detections and corresponding embeddings, Wang et al. [24] incorporates the embedding model into single-shot detector. CenterTrack [25] localizes objects and predicts their associations with the previous frame by applying a detection model to a pair of images and detections from the prior frame. FairMOT [26] designs two homogeneous branches within a single model to predict pixel-wise objectness scores and re-id features, thereby avoiding unfairness between detection and extracting embedding feature during training. Unicorn [27] further integrates an appearance extractor, an object detector and a feature interaction module into a single network. Despite the simplicity of its architecture, this paradigm exhibits drawbacks: object detection task requires deep and abstract features to infer object categories and locations, whereas embedding feature extraction necessitates shallow features to differentiate objects within the same category.

Table 1 List of symbols and abbreviations
Fig. 3
figure 3

Overview of our WDTtrack. \(T_{remain}^{t-1}\) and \(D^{t}_{high-remain}\) indicated the failed tracklets and detection boxes in the first association, respectively. \(T_{re-remain}^{t-1}\) denoted the failed tracklets in the second association, and \(T_{match}\) refer to the matched tracklets

Fig. 4
figure 4

The pipeline of WDTtracker

Fig. 5
figure 5

Illustration of Wider Bounding Box (W-BBox). The W-BBox only expands the horizontal shape to the original bounding box. It does not change the location center and the height

Tracking-By-Query (TBQ) Recently, with the introduction of Transformers [28] to MOT, a new tracking paradigm called tracking-by-query has attracted attention from several researchers. The core idea of the paradigm is to extend query-based object detectors [14, 15, 29] for tracking purposes and leverages Transformer to learn deep representations from both visual information and object trajectories, as illustrated in the Fig. 2(c). TransTrack [30] propagates track queries once to locate the object in the following frame. MOTR [31] and TrackFormer [32], which extend from the Deformable DETR [15], predict the object bounding boxes and update the tracking query to detect the same instances in subsequent frames. MeMOTR [33] leverages a long-term memory-augmented Transformer to stabilize the same object’s track embedding and make it more distinguishable. MOTRv2 [34] improves MOTR by elegantly incorporating an extra object detector. Although MOTRv2 achieves the state-of-the-art performance, the Transformer-based algorithms require a substantial amount of high-quality labeled data for training due to their large parameter counts, otherwise it is prone to underfitting. This poses significant challenges in specific domains with limited small-scale datasets. Furthermore, the Transformer model employs self-attention mechanism to achieve contextual awareness, while the capture of dependencies between different positions within this mechanism relies on local context information. Consequently, when dealing with longer video sequences, the learning capacity of the model becomes constrained.

Compared with JDT and TBQ algorithms, TBD algorithms leverage the capabilities of existing powerful detectors while focusing more intensely on the tracking task. They can achieve superior performance through the strategic design, addition, or removal of various modules and techniques, making them both convenient and effective. Therefore, our approach adopts the tracking-by-detection pipeline to track objects with similar appearances and irregular movements, utilizing the advanced object detector Yolov7 to provide location information. The proposed method significantly outperforms existing tracking-by-detection based methods in handling complex motions.

3 WDTtrack

In this section, we present the detail of WDTtrack and all symbols and abbreviations related to WDTtrack are collected in Table 1

3.1 Overview

The overview of WDTtrack is shown in Fig. 3 and primarily consists of the Yolov7 detector and the proposed WDTtracker. WDTtracker extracts high-quality tracking cues by combining discriminative appearance information with powerful motion information to overcome challenges associated with occlusion, rapid movement, and similar appearances. Specifically, to address the issue of fast-moving objects, we design the Wider Bounding Box (W-BBox) module to expand horizontally the matching space of detections and tracklets. Furthermore, we introduce direction bank (DB) module and use momentum of OC-SORT [11] to model the complex motion of the object with robustness. For Re-ID, recognizing that the appearance of low-score detection boxes is less reliable, we employ the CTL [35] Re-ID model to exclusively extract the appearance embedding of high-score detection boxes. Additionally, We have developed Tracklet Recovery Mechanism (TRM) to relocate lost objects. For a video sequence of length N, denoted \(F=\{F^1,F^2,...,F^N\}\), we use the off-the-shelf object detector (e.g. Yolov7 [16]) to detect each frame and produce bounding boxes. The proposed tracker then processes these bounding boxes and outputs the trajectories \(T=\{T^1,T^2,...,T^N\}\) for all frames F, with \(T^t=\{trk_1^t,trk_2^t,...\}\) representing the trajectories in frame t, where \(trk_1^t\) is the trajectory with ID 1 in frame t. For matching between trajectories of different frames, we design a cascade matching strategy to match twice. For more details on the WDTtracker pipeline, please see Section 3.2.

Fig. 6
figure 6

Illustration of direction clue. (a) shows the velocity directions represention \(V^t\) of Tracklet \(T^t\), and (b) denote the velocity directions matrix \(M^t\) between Tracklet \(T^{t-1}\) and Detection \(D^t\)

3.2 Tracking pipeline

The pipeline of WDTtracker is illustrated in Fig. 4. At step t, the frame \(F^t\) is fed into the WDTtrack framework, and the Yolov7 generates K detection boxes \(D^t=\{d_1^t,d_2^t,...,d_K^t\}\). Inspired by ByteTrack [22], we divide the detection boxes into two parts, \(D_{high}^t\) and \(D_{low}^t\), according to the confidence thresholds \(\tau _{high}\) and \(\tau _{low}\). Then the high-score detections \(D_{high}^t\) and low-score detections \(D_{low}^t\) are fed into WDTtracker. For \(D_{high}^t\), we apply CTL [35] Re-ID model to extract the appearance embeddings \(AE^t_{high}\) and calculate their wider bounding boxes \(WB^t_{high}\). Then, BoT-SORT’s kalman [23] filter is used to predict the positions of tracklets \(T^{t-1}\) from the previous step \(t-1\) and to update their wider bounding boxes \(WB_{trk}^t\). In addition, we estimate velocity directions \(V^{t-1}\) of \(T^{t-1}\) using a direction bank. Utilizing the wider bounding boxes \(WB^t_{high}\) and \(WB_{trk}^t\), velocity directions \(V^{t-1}\) and appearance features \(AE^t_{high}\) and \(AE^{t-1}_{trk}\), we perform the first association between the high-score detections \(D_{high}^t\) and \(T^{t-1}\). The cost matrix for first association is calculated as follows:

$$\begin{aligned} C_{first}^t=\gamma (C_{W\text{- }IoU}+\lambda _{high}C_{Dir})+(1-\gamma )C_{AE} \end{aligned}$$
(1)

Where \(\gamma \), \(\lambda _{high}\) are the weights of cost matrices of motion and velocity direction, respectively. Here, \(C_{W-IoU}\), \(C_{Dir}\) and \(C_{AE}\) are the cost matrices of IoU between \(WB_{trk}^t\) and \(WB_{high}^t\), difference in velocity direction between tracklet and detection, appearance similarity, respectively. The equations are as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} C_{W\text{- }IoU}=1-IoU(WB_{trk}^t,WB_{high}^t) \\ C_{Dir} = Dir(T^{t-1},D_{high}^t,V^{t-1}) \\ C_{AE}=Cosine(AE_{trk}^{t-1},AE_{high}^t) \end{array}\right. } \end{aligned}$$
(2)

Here, \(IoU(\cdot ,\cdot )\) is the function to gain the Intersection over Union between inputs, \(Dir(\cdot ,\cdot ,\cdot )\) is used to calculate the velocity direction similarity, which is derived from OC-SORT [11]. And \(Cosine(\cdot ,\cdot )\) is the function for cosine similarity calculation, which calculates as follows:

$$\begin{aligned} Cosine(AE_{trk}^{t-1},AE_{high}^t)=\dfrac{AE_{trk}^{t-1} \cdot AE_{high}^t}{\parallel AE_{trk}^{t-1}\parallel \times \parallel AE_{high}^t \parallel } \end{aligned}$$
(3)

Analogously, based on IoU and direction vector, the second association is performed between the low-score detections \(D_{low}^t\) and the remaining tracklet \(T_{remain}^{t-1}\) following the first association. The cost matrix for second association is as follows:

$$\begin{aligned} {\begin{matrix} C_{second}^t=(1-IoU(T_{remain}^{t-1},D_{low}^t)) \\ +\lambda _{low}Dir(T_{remain}^{t-1},D_{low}^i,V^{t-1}) \end{matrix}} \end{aligned}$$
(4)

where \(\lambda _{low}\) is the weight of cost matrix for velocity direction. Next, according to the Tracklet Recovery Mechanism (TRM, see Section 3.5 for detail), we initialize new tracklets \(T_{new}^t\) from the remaining high-score detections \(D_{high-remain}^t\) after the first association which score is higher than \(\tau _{track}\) firstly, and then determine whether the newly tracklets \(T_{new}^t\) are connected with the tracklets \(T_{re-remain}^{t-1}\) (the tracklets do not match in the second association) or the removed tracklets \(T_{remove}\) . After the tracklet recovery, any tracklet failing to match over \(\tau _{remove}\) consecutive frames is removed. Finally, the states (e.g. Direction Bank, Kalman filter and Appearance Embeddings) of the matched tracklets and the newly tracklets \(T_{new}^t\) are updated, and they are taken as the tracklet \(T^t\) at time step t. Please refer to algorithm 1 for the pseudo-code of the tracking pipeline.

Algorithm 1
figure d

Pseudo-code of tracking pipeline.

Fig. 7
figure 7

Illustration of Tracklet Recovery Mechanism. \(D_{high-remain}^t\), \(T_{re-remain}^{t-1}\) and \(T_{remove}\) denotes the remain high-score detections after the first association, the remain tracklets after second association and the removed tracklets respectively. And \(\tau _{track}\), \(\tau _{TRM}\), \(\tau _{remove}\) represent the tracking time threshold, the recovery threshold and the removing threshold respectively. \(size(\cdot )\) and \(len(\cdot )\) indicate the size of the set and the tracking time of a tracklet

3.3 Wider bounding box (W-BBox)

According to the imaging principle, it is widely acknowledged that within a camera’s field of view, objects exhibit a phenomenon known as “large near and small far” characteristics. As a result, in video frames, when an object moves a fixed distance, its horizontal movement amplitude will be greater than its vertical movement amplitude. Furthermore, overlapping regions are expected to occur more frequently in the vertical direction. Drawing inspiration from this observation, we decide to expand the horizontal matching space for remedying motion blur caused by fast horizontal movement. Consequently, we devise Wider Bounding Box (W-BBox, Fig. 5) and refer to the intersection over union (IoU) between these expanded boxes as wider IoU (W-IoU). Let \(d=(c_x,c_y,w,h)\) represent the original detection, with \(\delta \) denoting the wider scale rate. And \((c_x,c_y,w,h)\) are the center coordinates, width, and height of the box, respectively. Intuitively, the formula for the wider bounding box wb is:

$$\begin{aligned} wb=W\text{- }BBox(d,\delta )=(c_x,c_y,(1+\delta )w,h) \end{aligned}$$
(5)

and W-IoU is calculated by:

$$\begin{aligned} W\text{- }IoU=\dfrac{wb_1\cap wb_2}{wb_1\cup wb_2} \end{aligned}$$
(6)

Additionally, it is important to note that W-IoU calculations are performed exclusively for high-score detection boxes, while low-score detection boxes are evaluated using regular IoU. This distinction is made because low-score detection boxes typically correspond to backgrounds or objects that are significantly occluded. Direct expansion of their horizontal search spaces could lead to potential misalignments with other tracked objects. In this way, WDTtrack not only benefits from a wider matching area but also minimizes negative effects from low-score detections.

3.4 Motion model

BoT-SORT Kalman Filter (BTK) In multi-object tracking tasks, the motion of objects on the image plane is typically modeled by a discrete kalman filter using a constant velocity model. BoT-SORT [23] has demonstrated that directly estimating the width and height of the bounding box leads to better performance of the kalman filter, which has been refined accordingly. Therefore, we directly employ the improved BoT-SORT kalman filter to predict the location of the tracklet. However, the kalman filter operates under the assumption of linear movement, which becomes significantly less effective when dealing with non-linear and irregular motion patterns. To address this, we propose an additional feature metric, the Direction Bank (DB), to assess the object’s motion state, thereby improving the predictive accuracy of the model in scenarios involving non-linear motion.

Direction Bank (DB) The kalman filter’s assumption of linear motion presupposes consistent velocity direction. However, due to the nonlinear motion of objects and state noise, this assumption often does not hold. We can only approximate motion as linear for a relatively short duration, and even then, noise still affects the consistency of velocity direction. To mitigate these side effects brought by the consistency of velocity direction, we come up with a Direction Bank (DB, as sketched in Fig. 6(a)), which enhances the reliability of velocity direction consistency and thus improves the descriptive power of nonlinear motion states. At step t, we suppose that there are n tracklets in \(T^t\), \(T^t=\{trk_1^t,trk_1^t,...,trk_n^t\}\), where \(trk_i^t\) is the tracklet which id is i in frame t. The Direction Bank \(DB^t\) can be represented as \(DB^t=\{db_1^t,db_2^t,...,db_n^t\}\), with \(db_i^t=\{(x_i^{t-l_{DB}+1},y_i^{t-l_{DB}+1}),(x_i^{t-1},y_i^{t-1}),...,(x_i^{t},y_i^{t})\}\) denoting the direction bank of \(trk_i^t\) in both horizontal and vertical direction for the past \(l_{DB}\) frames. And \((x_i^t,y_i^t)\) means the direction vector of \(trk_i^t\) at step t in both horizontal and vertical direction, which estimates as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_i^t=\dfrac{c_{xi}^t-c_{xi}^{t-1}}{\sqrt{(c_{xi}^t-c_{xi}^{t-1})^2+(c_{yi}^t-c_{yi}^{t-1})^2}+\theta } \\ y_i^t=\dfrac{c_{yi}^t-c_{yi}^{t-1}}{\sqrt{(c_{xi}^t-c_{xi}^{t-1})^2+(c_{yi}^t-c_{yi}^{t-1})^2}+\theta } \end{array}\right. } \end{aligned}$$
(7)

where \((c_{xi}^t,c_{yi}^t)\) represents the center coordinate of \(trk_i^t\) and \(\theta \) is a small positive number introduced to prevent the denominator from reaching 0 (\(\theta =10^{-6}\)). Next, we average the \(db_i^t\) to acquire the velocity direction \(v_i^t=\{x_{i\text{- }avg}^t, y_{i\text{- }avg}^t\}\) of \(trk_i^t\). And the set \(V^t=\{v_1^t,v_2^t,...,v_n^t\}\) is the velocity directions of all tracklets \(T^t\) at time t.

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{i\text{- }avg}^t=\dfrac{x_i^{t-l_{DB}+1}+x_i^{t-l_{DB}+2}+...+x_i^{t}}{l_{DB}} \\ y_{i\text{- }avg}^t=\dfrac{y_i^{t-l_{DB}+1}+y_i^{t-l_{DB}+2}+...+y_i^{t}}{l_{DB}} \end{array}\right. } \end{aligned}$$
(8)

For the velocity direction matrix \(M^t\) relating to the detection box \(D^t\) and tracklet \(T^{t-1}\), pairwise calculations are made between the bounding box matched by each tracklet in the previous frame and the detection box in current frame, as prescribed in (7) and illustrated in Fig. 6(b). Subsequently, the direction vector \(v_i^{t-1}\) of \(trk_i^{t-1}\) and the velocity direction matrix \(M^t\) are utilized to compute the velocity direction similarity.

3.5 Tracklet Recovery Mechanism (TRM)

In this section, we detail the design principle of the Tracklet Recovery Mechanism (TRM), illustrated in Fig. 7 and described in algorithm 2 for the pseudo-code of TRM. This mechanism is not only simple and convenient but also effective in mitigating tracking loss and ID switches caused by occlusion, thereby facilitating long-term tracking continuity. TRM is performed between the newly initialized tracklets \(T_{new}^t\) and the tracklets \(T_{re-remain}^{t-1}\) (the tracklets at step \(t-1\) which do not match after twice associations) or the removed tracklets \(T_{remove}\). Firstly, TRM involves generating new tracklet \(T_{new}^t\) from the remaining high-score detections \(D_{high-remain}^t\) that exceed the score threshold \(\tau _{track}\) after the first association. Subsequently, it assesses whether the following three conditions are met:

  1. 1.

    There is only one new tracklet in \(\tau _{track}^t\)

  2. 2.

    There is only one tracklet among the unmatched tracklets \(T_{re-remain}^{t-1}\), and it has successfully tracked for at least \(\tau _{TRM}\) frames.

  3. 3.

    There is only one tracklet among removed tracklets \(T_{remove}\) and it has successfully tracked at least \(\tau _{TRM}\) frames, and its removal time \(\tau _{remove}\) is shorter than the tracking time.

If conditions 1 is met, the mechanism then checks for condition 2. If condition 2 is fulfilled, TRM directly associates the new tracklet \(T_{new}^t\) with the unmatched tracklet \(T_1'\) to form \(T_{recovery}^t\), bypassing condition 3. If condition 2 is not met, condition 3 is evaluated; if met, the new tracklet \(T_{new}^t\) and the removed tracklet \(T_2'\) will be joined to form \(T_{recovery}^t\). If neither condition is satisfied, no action is taken. In this way, the long-term-lost tracklet in removed tracklet \(T_{remove}\) and the short-term-lost tracklet in \(T_{re-remain}^{t-1}\) have the opportunity to be recovery. We not only harvest the benefit of recovering the tracklet lost due to occlusion, but also decrease ID switching.

Algorithm 2
figure e

Pseudo-code of Tracklet Recovery Mechanism (TRM).

4 Experiments

4.1 Experiments settings

Datasets We evaluate the effectiveness of our approach on DanceTrack [4] and SportsMOT [36] datasets, following the “private detection” protocol. Each dataset presents unique characteristics and challenges. The DanceTrack dataset comprises 100 videos (105,855 frames), predominantly featuring group dances sourced from the Internet. It is divided into 40 training videos, 25 validation videos, and 35 test videos. Furthermore, DanceTrack is known for its tremendous difficulties such as irregular motion, diverse object poses, and object appearances that are similar to one another. In contrast, SportsMOT is a large Dataset, it consists of 240 video sequences (150,379 frames) collected from basketball, volleyball and football games scenes, which splitting of 45, 45 and 150 video sequences for training, validation and testing subsets respectively. And SportsMOT is characterized with fast-variable-speed motion and similar yet distinguishable appearance. Compared to SportsMOT, DanceTrack exhibits richer and more complex motions, as well as more challenging object appearances, enabling a comprehensive analysis of the proposed tracking method.

Metrics The Higher Order Tracking Accuracy (HOTA) metric [37], Identity F1 Score (IDF1) [38], Association Accuracy (AssA) [37], Detection Accuracy (DetA)[37] and Multi-Object Tracking Accuracy (MOTA) [39] are used to evaluate different aspects of tracking performance in our experiments. And the calculation formulas are as follows:

$$\begin{aligned} \textrm{HOTA}=\sqrt{\textrm{DetA}\cdot \textrm{AssA}} \end{aligned}$$
(9)
$$\begin{aligned} \textrm{IDF1}=\frac{|\textrm{IDTP}|}{|\textrm{IDTP}|+0.5|\textrm{IDFN}|+0.5|\textrm{IDFP}|} \end{aligned}$$
(10)
$$\begin{aligned} \textrm{AssA}=\frac{1}{|\textrm{TP}|}\sum _{c\in \{\textrm{TP}\}}\frac{|\textrm{TPA}(c)|}{|\textrm{TPA}(c)|+|\textrm{FNA}(c)|+|\textrm{FPA}(c)|} \end{aligned}$$
(11)
$$\begin{aligned} \textrm{DetA}=\frac{|\textrm{TP}|}{|\textrm{TP}|+|\textrm{FN}|+|\textrm{FP}|} \end{aligned}$$
(12)
$$\begin{aligned} \textrm{MOTA}=1-\frac{|\textrm{FN}|+|\textrm{FP}|+|\textrm{IDSW}|}{|\textrm{gtDet}|} \end{aligned}$$
(13)

Where TP, FN, FP represent the true positives, false negatives, and false positives respectively. Similarly, IDTP, IDFN, IDFP denote the identity true positives, identity false negatives, and identity false positives respectively. Additionally, TPA, FNA, FPA stand for the true positive association, Identity false negatives, and identity false positives, respectively. For further details, please refer to paper [37]. IDSW and gtDet represent the number of identity switching and the total number of ground truth detections, respectively.

Table 2 Results on DanceTrack testing set

From the equations, it is evident that MOTA primarily measures the accuracy of detection, with matching conducted at the detection level. A bijective (one-to-one) mapping is established between predicted detections and ground truth detections in each frame. Conversely, IDF1 focuses on the performance of association, calculating a bijective mapping between sets of predicted tracklets and ground truth tracklets at the tracklet level. The HOTA metric explicitly balances the accuracy of detection, association, and localization. Notably, HOTA is the geometric mean of a detection score and an association score. This formulation ensures that both detection and association are evenly balanced, unlike many other tracking metrics, and that the final score is somewhere between the two. Considering that this article emphasizes the performance of object association over detection, we employ HOTA, IDF1, and AssA as the primary metrics. However, since most relevant studies utilize MOTA and DetA metrics, we also retain these for comparison purposes.

Implementations details All experiments are implemented using the Python programming language and PyTorch. The training phase ran on a device with 10th Gen Intel(R) Core(TM) i9-10980XE@3.00GHz and 4 NVIDIA GeForce RTX 3090 GPU. While the inference phase is performed on a laptop with 12th Gen Intel(R) Core(TM) i5-12500H@2.50 GHz and a NVIDIA GeForce RTX 3060 GPU. For detection, the default high detection score threshold \(\tau _{high}=0.5\) and the low threshold \(\tau _{low}=0.2\). Unless otherwise stated, the same tracker parameter are used throughout experiments, including the W-BBox scale rate \(\delta =0.4\), the DB’s length \(l_{DB}=10\), TRM’s hyper-parameters \(\tau _{TRM}=30\), the tracklet initialization threshold \(\tau _{track}=0.8\), the directional cost matrix weight \((\lambda _{high},\lambda _{low})=(0.7,0.9)\), and the motion cost matrix weight \(\gamma =0.3\). For the lost tracklets, we keep them for \(\tau _{remove}\) frames in case it appeared again, where \(\tau _{remove}=90\).

Fig. 8
figure 8

The influence of training set amount on the MOTRv2 and WDTtrack

Fig. 9
figure 9

Examples of SORT, DeepSORT, ByteTrack, OC-SORT and Deep OC-SORT suffer from the miss, ID switch, new ID and false positive of tracks from occlusion or non-linear motion but WDTtrack survives. And the errors that occur are highlighted with a red rectangular box. To be precise, the miss and new ID issues happen on the objects by SORT at (a) #122 \(\rightarrow \) #124 of dancetrack0021; miss and false positive issues happen on the objects by DeepSORT at (c) #469 \(\rightarrow \) #471 of dancetrack0003; ID switch issues happen on the objects by ByteTrack at (e) #327 \(\rightarrow \) #345 of dancetrack0059; miss and new ID issues happen on the objects by OC-SORT at (g) #70 \(\rightarrow \) #180 of dancetrack0038; ID switch issues happen on the objects by Deep OC-SORT at (i) #670 \(\rightarrow \) #690 of dancetrack0031

4.2 Benchmark results

Table 2 compares WDTtrack with mainstream MOT methods on the DanceTrack testing sets. Each score is either sourced from previous studies (e.g. Deep OC-SORT [7]) or obtained by submitting results to host test servers with hidden annotations and leader boards. We use our public configuration in Section 4.1. We can see that the proposed WDTtrack ranks first among all non-Transformer-based trackers in DanceTrack dataset. Particularly, WDTtrack outperforms other non-Transformer-based method by a large margin in term of HOTA, IDF1 and AssA, achieving scores of 66.8% in HOTA, 72.8% in IDF1, and 55.9% in AssA. Compared to the second-best non-Transformer-based method, Deep OC-SORT [7], which improves OC-SORT [11] by incorporating dynamic appearance and adaptive weighting, our WDTtrack has increased the HOTA by 5.5%, IDF1 by 11.3%, and AssA by 10.1%, establishing a new state of the art (Fig. 8). These comparisons demonstrate the effectiveness of WDTtrack on the DanceTrack dataset, which is known for its complex motions and indistinguishable appearances. Further, we visualize the tracking results of SORT, DeepSORT, ByteTrack, OC-SORT, and Deep OC-SORT on DanceTrack in Fig. 9, illustrating the trajectory comparisons between these methods and WDTtrack.

Although the Transformer-based methods (e.g. CO-MOT [40], Me-MOTR [33], MOTRv2 [34]) generate comparable results on the DanceTrack testing set, they possess two critical drawbacks that significantly limit their applicability. Firstly, these Transformer-based methods require substantial computational resources, which diminishes their training and inference efficiency. Specifically, we evaluate Transformer-based methods (MOTR, MOTRv2, CO-MOT, and Me-MOTR) on the device with NVIDIA GeForce RTX 3090 GPU. These approaches achieve frame rates of 11.63, 13.98, 14.30 and 12.03 FPS (frames per second) on the DanceTrack testing set, respectively. In comparison, WDTtrack operates at a higher frame rate of 19.79 FPS under the same experimental conditions. It is noteworthy that MOTRv2 introduces a pretrained YoloX detector, which imposes additional computational burden and poses challenges for deployment. Moreover, the official inference code does not provide runtime statistics for YoloX. In other words, the actual speeds of MOTRv2 falls significantly below 13.98 FPS. Secondly, the Transformer-based algorithms require a substantial amount of high-quality labeled data for training due to their large parameter counts, which can lead to underfitting if not met. This poses significant challenges in domains with limited small-scale datasets. However, WDTtrack requires only a small amount of data for effective training. Notably, tracking-by-query methods are often trained with additional datasets, such as CrowdHuman. For a fair comparison, we randomly extract 100%, 90%, 80%, 70%, 60%, and 50% of video sequences from the DanceTrack training set to build six training subsets. And these subsets are used to analyze the influence of varying amounts of training data on the model’s performance. We train the best-performing MOTRv2 and WDTtrack using these subsets and validate them on the DanceTrack validation set. As illustrated in the Fig. 8, the performance of the WDTtrack is significantly superior to that of the MOTRv2 when using the same amount of training data. It clearly shows that the amount of training data has a substantial impact on MOTRv2. As the amount of data in the DanceTrack training set is reduced, the performance of the MOTRv2 declines accordingly. For instance, when the data volume is reduced to 60%, MOTRv2’s HOTA decreases by 6.34%, IDF1 by 7.53%, AssA by 7.53%, DetA by 3.62%, and MOTA by 5.63%. In contrast, a similar reduction in data volume results in only a 1.12% drop in WDTtrack’s HOTA, 0.98% in IDF1, 0.56% in AssA, 1.78% in DetA, and 1.50% in MOTA. Therefore, in domains with limited small-scale datasets, WDTtrack consistently achieves better performance. Additionally, as shown in Table 2, WDTtrack substantially outperforms several other classical Transformer-based MOT algorithm, including MotionTrack [20], MOTR [31], and TransTrack [30]. In summary, when application scenarios demand high real-time performance and high-quality annotated data for training is limited, the performance of the traditional tracking-by-detection methods (e.g. WDTtrack) surpasses that of the tracking-by-query methods (e.g. MOTRv2).

Table 3 Results on SportsMOT testing set
Fig. 10
figure 10

Example tracking results of WDTtrack on SportsMOT testing set. Each row displays results from sampled frames in chronological order of a video sequence. Bounding boxes and identities are marked in the images, with different colors represent different identities. Best viewed in color

Table 4 Ablation study on DanceTrack testing set

To assess the performance of WDTtrack in challenging scenarios involving fast and variable-speed motion, we present the results on the SportsMOT dataset in Table 3. It is noteworthy that WDTtrack achieves the highest rankings in terms of IDF1 score and AssA score among all algorithms, demonstrating its exceptional ability to accurately maintain object identity during tracking. Furthermore, WDTtrack secures second place in the HOTA index with only a marginal difference of 0.3% from the top-ranked algorithm. Additionally, with comparable detection accuracy (DetA), WDTtrack outperforms CenterTrack [25] and TransTrack [30] by a significant margin in association metrics such as HOTA, IDF1, and AssA. Despite OC-SORT’s [11] superior performance in DetA, it falls short in comparison to WDTtrack when evaluated based on HOTA, IDF1, and AssA. In essence, our findings demonstrate that WDTtrack significantly enhances tracking capabilities. What’s more, Fig. 10 visualizes several tracking results of WDTtrack on the SportsMot testing set.

Table 5 Ablation study of applying W-BBox to different score detections

4.3 Ablation study

Components analysis To demonstrate the effectiveness of the proposed modules and quantify their contribution to our bag-of-tricks for MOT, we perform an ablation study on the DanceTrack testing set. The same tracking parameters are used for all the experiments, as described in the implementation detail section. Starting with a baseline of ByteTrack, we enhance performance by adding the following modules: BoT-SORT improved Kalman (BTK), appearance embedding with a fixed EMA extracted by the CTL ReID model (ReID), Wider Bounding Box (W-BBox), Direction Bank (DB), and Tracklet Recovery Mechanism (TRM). Table 4 summarizes the path from the original ByteTrack to our tracker. The first row represents our re-implemented ByteTrack, without any additional modules. The introduction of these modules significantly improves tracking performance, as shown in Table 4, though it also leads to a decrease in speed. The incorporation of BoT-SORT improved Kalman (BTK) results in a 7.69% increase in HOTA, a 2.68% increase in IDF1, a 5.98% increase in AssA, with almost negligible reduction in FPS. Integration of the CTL ReID module enhances HOTA by 6.9%, IDF1 by 9.68%, and AssA by 10.21%, but significantly reduces speed from 35.02 FPS to 22.27 FPS. The addition of the W-BBox module yields substantial improvements with HOTA increasing to 62.79%, IDF1 to 68.16%, and AssA to 49.48%. And the speed remains nearly unchanged. Moreover, incorporating the DB module has a minimal impact on speed (a decrease from 22.21 FPS to 21.96 FPS), while HOTA increases by 3.9%, IDF1 increases by4.43%, and AssA increases by 6.28%. The inclusion of TRM achieves optimal performance for WDTtrack with only a slight sacrifice in speed (a reduction of 2.18 FPS), resulting in HOTA reaching 66.79%, IDF1 72.82% , and AssA 55.93%.

Effect of W-BBox Apart from verifying the contribution of W-BBox in the overall framework, we also compare the performance of W-BBox across different score detections (e.g. \(D_{high}\), \(D_{low}\) and \(D_{high}+D_{low}\)) and various wider scale rates \(\delta \). As shown in Tables 5 and 6, the optimal performance is achieved when W-BBox is applied to high-score detections \(D_{high}\) with a wider scale rate \(\delta \)=0.4. Notably, W-BBox increases the HOTA score from 63.69% to 66.79%, the IDF1 from 69.30% to 72.82%, and the AssA from 50.83% to 55.93%, underscoring the significance of expanding the search space for high-score detections (rows 1 and 2 in Table 5). However, it could clearly be found in Table 5 that using W-BBox in low-score detection \(D_{low}\) will degrade the tracking performance. In other words, it is prejudicial to expand the search space of \(D_{low}\) in the horizontal direction (Fig. 11). As show in Fig. 12(a) and (b), an ID switch occurs between the object with ID 6 and the object with ID 7 at frame 452, highlighting that ID switches are likely during rapid horizontal movements in occlusion scenarios. Nonetheless, the application of the W-BBox module prevents ID switching in such cases.

Table 6 Ablation study of wider scale rate \(\delta \), number \(l_{DB}\) of direction vectors, TRM recovery threshold \(\tau _{TRM}\) and tracklet’s survival time \(\tau _{remove}\)
Fig. 11
figure 11

Ablation study of directional cost matrix weight \(\lambda _{high}\), \(\lambda _{low}\) and motion cost matrix weight \(\gamma \)

Fig. 12
figure 12

Visualization with or without using W-BBox, DB and TRM. And the errors that occur are highlighted with a red rectangular box

Effect of DB Similarly, we assess the effectiveness of the Direction Bank (DB) with different score detections. As sketched in Table 7, applying DB to both \(D_{high}\) and \(D_{low}\) proves highly advantageous, demonstrating the powerful ability of DB to leverage short-term motion cues to mitigate the ID switching issue stemming from rapid, irregular motion. Additionally, the velocity direction used to describe a tracklet is determined in two ways, one by an Exponential Moving Average (EMA) of the direction vectors frame by frame, with a weighting factor of \(\alpha =0.9\), and the other is by averaging \(l_{DB}\) direction vectors in DB. The results are shown in Table 6, revealing that the method averaging 10 direction vectors in DB achieves optimal HOTA values. This indicates that EMA, although effective in appearance embedding, is not suitable for describing motion cues for objects with rapid irregular movement. The simple method of averaging short-term motion cues turns out to be the most effective. In rows (c) and (d) of Fig. 12, we note that the ID switching is observed when object with ID 2 passes in front of the object with ID 1. However, no ID switching occurs when using the DB module, due to the significant difference in motion attributes between the objects with ID 2 and ID 1.

Effect of TRM As demonstrated in Table 4, introducing Tracklet Recovery Mechanism (TRM) increases HOTA from 66.69% to 66.79%, IDF1 from 72.59% to 72.82%, and AssA from 55.76% to 55.93%. Furthermore, we analyze the moments of reconnecting the loss tracklet, and the results are indicated in Table 6. We observe that optimal performance, a HOTA of 66.79%, is achieved when setting \(\tau _{TRM} = 30\). The effectiveness of TRM module is also illustrated in Fig. 12(e) and (f). Without TRM, when the object with ID 3 reappears in frame 768 after a prolonged occlusion period (from frames 623 to 764), it would typically be assigned a new ID along with an initialized trajectory. However, by employing TRM, the trajectory of the object with ID 3 can be restored instead of initiating a new one.

Tracklet buffer The survival time (\(\tau _{remove}\)) of the tracklet is also crucial, influencing the robustness and reliability of the trajectory, and directly affecting the TRM. Therefore, we compare different survival times of tracklet and the results are shown in Table 6. It is proved that when \(\tau _{remove}=90\), the tracklet is the most reliable.

Table 7 Ablation study of applying DB to different score detections
Table 8 Ablation study of IoU and Kalman Filter

Weights of cost Matrix As depicted in the Fig. 11, we provide a detailed analysis of the weights (e.g. \(\lambda _{high}\), \(\lambda _{low}\) and \(\gamma \) ) of each component in the final cost matrix (details in (1) and (4)). By setting \(\lambda _{high}=0.7\), \(\lambda _{low}=0.9\) and \(\gamma =0.30\), we achieve the best tracking performance with WDTtrack.

Effect of bag-of-tricks Apart from analyzing the contributions of the individual components (e.g. W-BBox, DB, TRM) within the proposed tracker, we also evaluate the effectiveness of different types of IoU calculations (including W-IoU, W-GIoU, W-DIoU and W-CIoU) and various kalman filter (specifically the default kalman, StrongSORT’s kalman, and BoT-SORT’s kalman) to identify the most suitable IoU calculation and kalman filter for WDTtrack. As depicted in Table 8, the default IoU and BoT-SORT’s kalman filter emerge as the optimal choices. It is important to note that W-IoU, W-GIoU, W-DIoU and W-CIoU are the W-BBox based modifications where the IoU, GIoU [45], DIoU [46] and CIoU [46] calculations are performed, respectively.

5 Conclusion

In this paper, we introduce WDTtrack, an enhanced multi-object tracker that employs a bag-of-tricks approach to effectively track multiple objects with indistinguishable appearances and irregular motions. Extensive experiments conducted on DanceTrack and SportsMOT datasets demonstrate that WDTtrack significantly outperforms all existing tracking-by-detection methods, particularly on DanceTrack. More specifically, WDTtrack achieves a HOTA of 66.79%, an IDF1 of 72.82%, and a MOTA of 91.50% on the DanceTrack testing set, surpassing the nearest competitor by margins of 5.3%, 11.3%, and 10.1% respectively. In addition, we also conduct detailed and comprehensive ablation studies on DanceTrack dataset to validate the effectiveness of WDTtrack and each of its components (such as W-BBox, DB and TRM) for multi-object tracking (MOT) task in challenging scenarios characterized by irregular motions and similar appearances. The methodologies and components we propose can be integrated into other tracking-by-detection trackers easily. We hope this work will advance the field of multi-object tracking, particularly for targets exhibiting dramatic non-linear motion and similar appearance.

The primary limitation of our method lies in its dependence on the quality of hyperparameters for each module. If these hyperparameters are not fine-tuned specifically for a specific domain, they can hinder the model’s ability to generalize to that domain’s data, potentially leading to suboptimal tracking outcomes. In future work, we will investigate additional techniques to further enhance the accuracy and generalization capabilities of the model across diverse datasets. For instance, we plan to develop adaptive strategies that dynamically determine the weight values for motion and appearance features based on their demonstrated effectiveness. We intend to assess the validity and relevance of different types of tracking clues in determining their respective weights. Moreover, we will design and implement more powerful yet lightweight ReID models to improve both appearance discrimination and computational efficiency.