Abstract
Advanced Driving Assistance System (ADAS) can predict pedestrian’s trajectory, in order to avoid traffic accidents and guarantee driving safety. A few current pedestrian trajectory prediction methods use a pedestrian’s historical motion to predict the future trajectory, but the pedestrian’s trajectory is also affected by the vehicle using the ADAS for prediction (target vehicle). Other studies predict the pedestrian’s and vehicle’s trajectories separately, and use the latter to adjust the former, but their interaction is a continuous process and should be considered during prediction rather than after. Therefore, we propose PVII, a pedestrianvehicle interactive and iterative prediction framework for pedestrian’s trajectory. It makes prediction for one iteration based on the results from previous iteration, which essentially models the vehiclepedestrian interaction. In this iterative framework, to avoid accumulation of prediction errors along with the increased iterations, we design a bilayer Bayesian en/decoder. For each iteration, it not only uses inaccurate results from previous iteration but also accurate historical data for prediction, and calculates Bayesian uncertainty values to evaluate the results. In addition, the pedestrian’s trajectory is affected by both target vehicle and other vehicles around it (surrounding vehicle), so we include into the framework a pretrained speed estimation module for surrounding vehicles (SE module). It estimates the speed based on pedestrian’s motion and we collect data from pedestrian’s view for training. In experiments, PVII can achieve the highest prediction accuracy compared to the current methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The ADAS is equipped in target vehicle, and senses driving environment to predict possible risks to assist the driver, and improves safety of not only the target vehicle but also pedestrians and surrounding vehicles. Nowadays, it has been evolving to one of the fastest developing application in vehicles [1, 2]. A fundamental processing of the ADAS system is to predict pedestrian’s trajectory, and based on the prediction results, it further estimates probability of collision. If the probability is beyond a threshold, the system alerts the driver to avoid possible accident [3, 4]. One challenge of this system is to predict pedestrian’s trajectory, because a pedestrian’s motion is highly flexible affected by many factors [5,6,7]. That is, compared to a vehicle, a pedestrian’s motion can usually change in a sudden and is thus not easy for prediction. This situation becomes more severe with complicated city traffics, where the pedestrian interacts with not only the target vehicle but also surrounding vehicles.
A few current studies of pedestrian trajectory prediction basically use a pedestrian’s historical motion to predict future trajectory, and do not consider interaction between the target vehicle and pedestrian [8,9,10,11,12]. Even if other studies consider the vehiclepedestrian interaction, they simply predict the pedestrian’s and vehicle’s trajectories separately and use the latter to adjust the former [13,14,15,16,17,18]. However, in real world scenarios, the vehicle and pedestrian affect each other in a continuous manner, and such interaction should be considered during prediction rather than after. Figure 1 shows an illustration of the pedestrianvehicle interaction. A pedestrian is crossing road while a target vehicle is coming. At time \(t_0\), the pedestrian and target vehicle are far apart so both moving with normal speed. At time \(t_1\), the pedestrian notices the vehicle moving without decelerating, so begins to slow down. At time \(t_2\), the target vehicle notices the pedestrian moving slowly, and realizes it is time to decelerate. At time \(t_3\), since the target vehicle moves slowly, the pedestrian begins to accelerate to quickly pass by the vehicle. In addition, the current studies only focus on target vehicle and do not consider surrounding vehicles. However, a pedestrians motion is not only affected by the target vehicle, but also all the other vehicles on the road. Figure 1 also shows the impact of surrounding vehicle upon pedestrian. At time \(t_3\), the pedestrian begins to accelerate, but the speed is affected because the surrounding vehicle is coming with normal speed.
Therefore, in order to consider the continuous vehiclepedestrian interaction during prediction, we propose PVII, a pedestrianvehicle interactive and iterative prediction framework for pedestrian’s trajectory. PVII uses an iterative manner to make prediction [19,20,21]: in one iteration, it predicts motions of both target vehicle and pedestrian for only a few frames, and in the following iteration, it makes further prediction based on the combined motions. This is essentially modeling the vehiclepedestrian interaction. In addition, in order to avoid accumulation of prediction errors along with the increased iterations, we also design a bilayer Bayesian en/decoder. In each iteration, it makes prediction using both historical data and predicted motions from the previous iteration, rather than simply using the predicted motions which may contain errors. It also calculates Bayesian uncertainty values to evaluate the prediction results, and the values can be used in the following iteration.
Besides the iterative framework, in order to take impact of surrounding vehicles into consideration, we design a pretrained SE module. This module establishes relationship between a pedestrian’s motion and average speed of surrounding vehicles using the former to estimate the latter. Since it is not easy to capture motions of the surrounding vehicles from target vehicle’s view, data used to train this module needs to be collected from pedestrian’s view, and we record data of both our motion and surrounding vehicles when crossing roads.
In summary, this paper has the following contributions.

We propose a pedestrianvehicle interactive and iterative prediction framework for pedestrian’s trajectory. It uses an iterative manner modeling the continuous vehiclepedestrian interaction during prediction.

In this framework, we design a bilayer Bayesian en/decoder. For each iteration, it not only uses inaccurate results from previous iteration but also accurate historical data for prediction. It also makes Bayesian uncertainty evaluation for the results to avoid accumulation of errors.

We also include a pretrained SE module. It estimates speed of surrounding vehicles based on pedestrian’s motion, so that surrounding vehicle’s impact upon pedestrian’s trajectory can be taken into consideration. To train the module, we collect data from pedestrian’s view rather than target vehicle’s.
2 Related work
2.1 Prediction methods for pedestrian’s trajectory
In order to predict a pedestrian’s trajectory, position is the most fundamental feature, so [13, 17] predict pedestrian’s trajectory based on the historical positions, and [8, 9] combine velocity information with the positions. Besides positions, some studies consider pedestrian’s intention, gesture or action in prediction: [14, 18] extract pedestrian’s intention information to cross road, [11, 16] use pedestrian’s head angle and body turn, and [15] focuses on pedestrian’s actions such as walking and looking at phone. Additionally, [10, 12] extracts pedestrian’s destination information for accurate prediction.
2.2 Prediction methods for target vehicle’s motion
Prediction of the target vehicle’s motion is important in predicting pedestrian’s trajectory [22]. [23] uses sensors to obtain vehicle’s velocity and predicts future motion, [13, 14] combine vehicle’s steering angle with the velocity, and [24] includes images from vehicle’s view. More recent studies extract traffic condition information from the images: [15] analyzes trajectories of surrounding vehicles, and [16] uses distance between vehicle and pedestrian to adjust predicted vehicle motion. In addition, [17, 18] do depth estimation for the images.
2.3 Pedestrianvehicle interactive prediction methods
To predict a pedestrian’s trajectory, studies considering the pedestrianvehicle interaction usually predict pedestrian’s and vehicle’s motions separately, and then use the latter to adjust the former [13,14,15,16,17,18]. Among the studies, [13,14,15, 17, 18] use data from vehicle’s view, and [16] additionally use data from pedestrian’s view for accurate prediction.
PVII is different from [13,14,15,16,17,18] in modeling the pedestrianvehicle interaction during prediction, rather than separately considering them. Furthermore, PVII also uses data from pedestrian’s view, but is different from [16] in forming relationship between pedestrian’s motion and speed of surrounding vehicles, so that impact of the surrounding vehicles can be included for more accurate prediction.
3 Preliminaries
Historical data stream of continuous T frames recorded in target vehicle of pedestrian’s motion Q \(=\) \(\{ q_{1}, q_{2},\dots ,q_{T} \}\), as well as the target vehicle’s motion matching the frames U \(=\) \(\{ u_{1}, u_{2},\dots ,u_{T} \}\). They will be combined to form initial input of PVII \(X = \{ x_{1}, x_{2},\dots ,x_{T} \}\).
Pedestrian’s trajectory predicted positions of pedestrian in continuous T frames \(Y = \{ y_{1}, y_{2},\dots ,y_{T} \}\). This is output of PVII.
Pedestrian/vehicle’s motion not only positions of pedestrian/vehicle in continuous frames, but also a few other features such as velocity. They are intermediate data of PVII.
Problem definition use X to predict Y and minimize \(\Vert Y  Y_0 \Vert \), where \(Y_0\) is the ground truth of pedestrian’s trajectory in T frames.
4 Method
We explain the iterative prediction framework, bilayer Bayesian en/decoder and SE module in detail.
4.1 Iterative prediction framework
As discussed in Section 1, the pedestrianvehicle interaction is such a process: pedestrian adjusts his motion every few frames according to vehicle’s motion, and the vehicle also adjusts its motion according to pedestrian’s motion. To model this process in prediction, we need to iteratively predict pedestrian’s and vehicle’s motions in a few frames, and then use the predicted motions to make further prediction for the next few frames. Here, each iteration is essentially a pedestrianvehicle interaction. Therefore, we design an iterative prediction framework.
In the framework, there are two parallel modules, one of which uses pedestrian’s historical motion combining vehicle’s motion to predict the pedestrian’s future motion, and the other uses the vehicle’s historical motion combining the pedestrian’s motion to predict the vehicle’s future motion. Outputs of the two modules are inputted to themselves in an iterative manner, so that further prediction can be obtained referencing previous prediction results. The combination of vehicle/pedestrian’s motion in each iteration incorporates short term interaction information between the pedestrian and vehicle, while the iterative processing referencing previous results incorporates long term interaction information in prediction. As a result, the pedestrianvehicle interaction can be modeled with sufficient information, and accurate prediction results can be obtained from the framework.
Initial input of the framework is historical data including a pedestrian’s motion in frames, as well as the target vehicle’s motion. The pedestrian’s frames are processed to extract features such as position, velocity, head turn and distance from the vehicle, and the target vehicle’s motion includes features such as position, velocity and direction. According to Section 3, we represent the pedestrian’s motion of T frames as Q \(=\) \(\{ q_{1}, q_{2},\dots ,q_{T} \}\), and the vehicle’s motion as U \(=\) \(\{ u_{1}, u_{2},\dots ,u_{T} \}\). Combining Bayesian uncertainty values (the initial values are all 0; see Section 4.3 for details), the pedestrian’s motion of T frames is \(P=\{ p_{1}, p_{2},\dots ,p_{T}\}=\{( q_{1},\sigma _{1}),(q_{2},\sigma _{2}),\dots ,(q_{T},\sigma _{T} ) \},\) and the vehicle’s motion is \(V=\{ v_{1}, v_{2},\dots ,v_{T}\}=\{ ( u_{1},{\sigma ^{'}} _{1} ) ,( u_{2}, {\sigma ^{'}}_{2} ),\dots ,( u_{T},{\sigma ^{'}}_{T})\}.\) Therefore, initial input of the framework is represented as \(X=\{ x_{1}, x_{2},\dots ,x_{T} \}=\{ ( p_{1}, v_{1} ), ( p_{2},v_{2} ),\dots ,(p_{T},v_{T}) \}.\) Output of the framework is the pedestrian’s trajectory \(Y = \{ y_{1}, y_{2},\dots ,y_{T} \}\). Left and right parts of Fig. 2 shows the framework’s input and output, respectively.
The framework is mainly composed of a pedestrian motion prediction module (PP module) and a vehicle motion prediction module (VP module), predicting the pedestrian’s and vehicle’s motions, respectively. In the first iteration, input of the PP module is the X, and the output is the pedestrian’s motion in the following T frames \(P^{1}\) \(=\) \(\{ p_{1}^{1}, p_{2}^{1},\dots ,p_{T}^{1} \}\) \(=\) \(\{ ( q_{1}^{1},\sigma _{1}^{1} ) , ( q_{2}^{1},\sigma _{2}^{1} ),\dots , ( q_{T}^{1},\sigma _{T}^{1}) \}.\) Similarly, input of the VP module is also the X, and the output is the vehicle’s motion in the following T frames \(V^{1}\) \(=\) \(\{ v_{1}^{1}, v_{2}^{1},\dots ,v_{T}^{1} \}\) \(=\) \(\{ ( u_{1}^{1}, {\sigma ^{'}} _{1}^{1} ) ,( u_{2}^{1},{\sigma ^{'}}_{2}^{1} ),\dots ,( u_{T}^{1},{\sigma ^{'}}_{T}^{1}) \}.\) Then we combine the outputs and obtain \(X^{1}\) \(=\) \(\{ x_{1}^{1}, x_{2}^{1},\dots , x_{T}^{1} \}\) \(=\) \(\{ ( p_{1}^{1}, v_{1}^{1} ) ,( p_{2}^{1},\)\(v_{2}^{1} ), \dots , ( p_{T}^{1},v_{T}^{1} ) \}\) for the second iteration. In the ith iteration, input of the PP module is the X and results obtained from the \(i1\)th iteration, that is \(X^{i1}\) \(=\) \(\{ x_{1}^{i1}, x_{2}^{i1},\dots ,x_{T}^{i1} \}\) \(=\) \(\{ ( p_{1}^{i1}, v_{1}^{i1} ) , ( p_{2}^{i1},v_{2}^{i1} ),\dots , ( p_{T}^{i1}, v_{T}^{i1} ) \},\) and the output is \(P^{i}\) \(=\) \(\{ p_{1}^{i}, p_{2}^{i},\dots ,p_{T}^{i} \}.\) Similarly, input of the VP module is the X and results obtained from the \(i1\)th iteration \(X^{i1}\), and the output is \(V^{i}\) \(=\) \(\{ v_{1}^{i}, v_{2}^{i},\dots ,v_{T}^{i} \}.\) Then we combine the outputs and obtain \(X^{i}\) \(=\) \(\{ x_{1}^{i}, x_{2}^{i},\dots ,x_{T}^{i} \}\) \(=\) \(\{ ( p_{1}^{i}, v_{1}^{i} ) , ( p_{2}^{i},v_{2}^{i} ),\dots , ( p_{T}^{i},v_{T}^{i} ) \}\) for the \(T+1\)th iteration. Middle part of Fig. 2 shows the framework’s iterative processing.
Either the PP or VP module is essentially a bilayer Bayesian en/decoder, which can avoid accumulation of errors along with iterations (see Section 4.2 for details). In addition, the framework also includes a pretrained SE module, whose input is the X and the output is inputted into both the PP and VP modules to improve prediction accuracy (see Section 4.3 for details).
4.2 Bilayer Bayesian en/decoder
As discussed in 4.1, bilayer Bayesian en/decoder is used in the PP and VP modules. In the ith iteration, the en/decoder’s input is the X initially inputted to the framework and \(X^{i1}\) obtained from the \(i1\)th iteration, and the output is \(P^{i}\) or \(V^{i}\) for the PP and VP modules, respectively. For simplicity, we explain the bilayer Bayesian en/decoder of the PP module, that is, the en/decoder’s output is \(P^{i}\).
In order to avoid accumulation of prediction errors along with iterations, we design the bilayer Bayesian en/decoder with a bilayer encoder and a Bayesian decoder. The former guarantees to consider the initial input X in prediction, rather than only the predicted \(X^{i1}\) which may contain errors; the latter provides a Bayesian uncertainty value for predicted motion in each frame, and can tell the following iteration the input’s accuracy. Below is the detailed explanation about the bilayer encoder and Bayesian decoder.
4.2.1 Bilayer encoder
The bilayer encoder is composed of two layers of LSTM cells, and the number of cells in each layer is T. For one layer, we input the \(X^{i1}\) directly, and for the other layer, we calculate Euclidian distance \(\Delta X^{i1}\) between the \(X^{i1}\) and X and input the \(\Delta X^{i1}\). Finally, we combine outputs of the two layers as well as the SE module (see section 4.3) as the encoding result. Left of Fig. 3 shows the bilayer encoder.
4.2.2 Bayesian decoder
The Bayesian decoder is composed of a fully connected dropout network and then two fully connected networks A and B (FCA and FCB, respectively), using the Bayesian uncertainty evaluation method proposed in [25,26,27,28]. Different from the dropout operation in training, the dropout network randomly drops network parameters in prediction, so that each decoding can result in a specific output from the FCA and also the FCB. Given the encoding result from the bilayer encoder, we do M decodings to obtain M results from FCA and M results from FCB, and use them to calculate the final decoding result with Bayesian uncertainty values.
Specifically, for the jth frame in the ith iteration, in order to obtain pedestrian’s motion \(q _{j}^{i}\) with Bayesian uncertainty value \(\sigma _{j}^{i}\), we do M decodings and can thus obtain M \(\mu _{j}^{i}\) from FCA and M \(\lambda _{j}^{i}\) from FCB, that is, \(\{ ( \mu _{j}^{i} ) _{1} , ( \mu _{j}^{i} ) _{2},\dots , ( \mu _{j}^{i} ) _{M} \}\) from FCA and \(\{ (\! \lambda _{j}^{i} \!) _{1} , \!( \lambda _{j}^{i} \!) _{2},\dots \) \( , ( \lambda _{j}^{i} ) _{M} \}\) from FCB. Then we use equations (1) and (2) to calculate the \(q_{j}^{i}\) and \(\sigma _{j}^{i}\). Right of Fig. 3 shows the Bayesian decoder.
4.3 SE module
As discussed in Section 4.1, the SE module is a pretrained module, and it can predict average speed of surrounding vehicles using pedestrian’s motion, in order to include in the framework impact of surrounding vehicles upon trajectory prediction. Input of the SE module is the initial X, but only the P from X is used, and the output is a value \(\theta \) indicating speed level. Average speed of surrounding vehicles is divided into six levels: fast, relatively fast, medium, relatively slow, slow and stop, so \(\theta \) is a value from 16. Between the input and output, we use the traditional LSTM model to make prediction. Below is details of the data collection, preprocessing and model training.
4.3.1 Data collection
We need to collect data with both pedestrian’s motion and surrounding vehicles from the pedestrian’s view, where the latter can be processed to calculate the average speed to annotate the former. To obtain such data, we use a motion recorder (Insta360 ONE X2) with a cellphone, where the former can record pedestrian’s speed and the latter can provide more accurate position information. We fix the motion recorder to body side and cross various streets in Beijing. We follow these rules: obeying traffic regulations, guaranteeing safety and including as many moving vehicles as possible, and finally we collect 59 datasets.
4.3.2 Preprocessing
First, we match frames obtained from the motion recorder with GPS information from the cellphone. Then for each T continuous frames P, we calculate speed level of the surrounding vehicles. Specifically, we calculate all surrounding vehicle’s areas \(S_{1}\) and \(S_{T}\) of the 1st and Tth frames, respectively, and use rate of the area change \(\phi \)=\( ( S_{T}S_{1} ) / S_{1}\) to represent the speed level \(\theta \). For example, when \(T = 15\), the correspondence between \(\phi \) and \(\theta \) is: \(\theta = 1, \phi > 3; \theta = 2, 2< \phi< 3; \theta = 3, 1< \phi< 2; \theta = 4, 0.5< \phi< 1; \theta = 5, 0< \phi< 0.5; \theta = 6, \phi < 0\). In this way, we can obtain the training data in form \(( P,\theta )\). Note that here we do not use complex preprocessing algorithm and simply divide average speed of surrounding vehicles into six levels, because this can facilitate us to annotate the big data, and is also sufficient for relatively accurate prediction. Figure 4 is workflow of the preprocessing.
4.3.3 Training
We set size of the LSTM cells 64 and T 15, then divide the training data into 3:1:1 as training, validation and testing sets, respectively, and during the training, we use the RMSprop optimizer and set learning rate to 1e4.
4.4 Training of the framework
To train the framework, because all the training data are accurate and their Bayesian uncertainty values are 0, we need to include data with Bayesian uncertainty values not 0. To do this, we train the framework with the already obtained training data with Bayesian uncertainty values 0, so that the framework can use pedestrian’s and vehicle’s motions in the first T frames to predict the motions in the second T frames, and the prediction results could have Bayesian uncertainty values not 0. To annotate these results, the motions in the third T frames can be directly obtained and used, since the frames are continuous. In this way, we can obtain training data with Bayesian uncertainty values not 0.
Though the framework is iterative, we need only to guarantee it can make accurate prediction for T frames, so we simply use the training data with and without Bayesian uncertainty values 0, and do not need to use iterative training methods such as that proposed in [19,20,21]. We set size of the bilayer Bayesian LSTM cells 128 and T 15, then divide the experimental data into 3:1:1 as training, validation and testing sets, respectively, and during the training, we use the RMSprop optimizer and set learning rate to 1e4.
5 Experiments
5.1 Data and baseline methods
Data used in our experiments is the PIE collected in downtown Toronto by Rasouli et al. [14]. It contains 1824 datasets of pedestrians in various traffic conditions from vehicle’s view and is fully annotated. According to description of the PIE data, it is the largest publicly available dataset for studying pedestrian behavior in traffic. To match purpose of our study, we only use the datasets of moving pedestrians and discarding those of standing ones. As discussed in Section 4.3, for the SE module, we collect data in various streets of Beijing. Though the locations for data collection between [14] and us are different, consistency of the scenarios for data collection is kept as much as possible. Actually, since the SE module is pretrained, even though there exists small inconsistency, it should be acceptable. Besides the PIE data, we also use the JAAD data collected from several locations in North America and Eastern Europe by the same author Rasouli et al. [29]. This data contains 346 short video clips from 240 hours of driving footage. Combining the JAAD and data collected in Beijing, we observe similar results in the experiments compared to the PIE data. Therefore, we simply exhibit results on the PIE data in the following sections.
We choose the Kalman filter, LSTM, BLSTM [13], PIE_traj [14], BitrapNP [10] and SGNetED [12] as the baseline methods. The Kalman filter and LSTM are classic baseline methods, and the BLSTM, PIE_traj, BitrapNP and SGNetED are representative novel methods to predict pedestrian’s trajectory. There are several additional novel methods, but the source codes are not provided, so we have not included them [15,16,17,18].
5.2 Comparison with the baseline methods
We compare PVII with the baseline methods as discussed in Section 4.1. According to [13, 14], input of each method is a pedestrian’s motion in 15 frames, and the output is predicted trajectory in 45 frames. To evaluate the prediction accuracy, we calculate MSE between pedestrian’s bounding boxes from prediction and ground truth in the 15th frame, and also in the 30th and 45th frames (MSE15, MSE30, MSE45, respectively). We also calculate MSE between pedestrian’s centers from prediction and ground truth in the 45th frame, and also in all the 45 frames (CMSE and CFMSE, respectively). All the testing data are used to calculate the MSE.
The testing results are listed in Table 1. PVII can achieve the smallest MSE or highest accuracy among all the methods. In addition, we also show prediction results of PIE_traj, PVII and SGNetED in four scenarios in Fig. 5. It is obvious that PVII can obtain more accurate prediction results. It is worthwhile to note that the testing results may be slightly different from those listed in [17], because we only use and test on datasets of moving pedestrians.
5.3 Test with different parameter values
PVII includes LSTM cells in all the PP, PV and SE modules, and an important parameter is LSTM cell size. Therefore, for the PP and PV modules, we use different LSTM cell sizes 32, 64, 128, 256 and 512 and test PVII following Section 5.2. To evaluate the results, we calculate MSEs from the 1st frame to the 45th. For the SE module, since it essentially does multiclassification, we specifically test it using different LSTM cell sizes 16, 32, 64, 128 and 256. To evaluate the results, we calculate accuracy value and f1score, where the former is number of correctly predicted vehicle speed levels over number of predictions and the latter is a function of each speed level’s prediction precision and recall.
The testing results are plotted in Fig. 6. When the LSTM cell sizes of PVII and SE module are 128 and 64, respectively, they can achieve the highest accuracy. This is consistent with our parameter settings as discussed in Section 4. When the LSTM cell sizes are too large or small, the model’s generalization is affected.
5.4 Test with various input features
As discussed in Section 4.1, in the inputted historical data, features of pedestrian’s motion include position, velocity, head turn and distance from the vehicle, and features of vehicle’s motion include position, velocity and direction. In order to study necessity of using the multiple features, we remove one of the features and test PVII following Section 5.2. Specifically, we input these data as listed below.

Data 1: removing pedestrian’s velocity feature from the historical data;

Data 2: removing pedestrian’s head turn feature from the historical data;

Data 3: removing pedestrian’s distance feature from the historical data;

Data 4: removing vehicle’s position feature from the historical data;

Data 5: removing vehicle’s velocity feature from the historical data;

Data 6: removing vehicle’s direction feature from the historical data.
The testing results are listed in Table 2. Compared with data 16, using historical data with all the features can result in the highest accuracy. This is consistent with the features we use as discussed in Section 4.1.
5.5 Ablation test
PVII mainly includes the PP, PV and SE modules. The former two use bilayer Bayesian en/decoder with a bilayer encoder and a Bayesian decoder, and the latter is pretrained and additionally added. In order to study necessity of combining these techniques, we remove one or several of them, and test PVII following Section 5.2. Specifically, we compare these methods as listed below.

Variant 1: replacing the bilayer encoder by a singlelayer encoder not using the initial input X;

Variant 2: replacing the Bayesian decoder by a fully connected network;

Variant 3: replacing the bilayer Bayesian en/decoder by a LSTM;

Variant 4: removing the connection between the SE and PP modules;

Variant 5: removing the connection between the SE and VP modules;

Variant 6: removing the SE module from PVII;

Variant 7: replacing the bilayer Bayesian en/decoder by a LSTM and removing the SE module from PVII.
The testing results are listed in Table 3. Compared with variants 17, using all the techniques can results in the highest prediction accuracy. This is consistent with our overall design as discussed in Section 4.
5.6 Accuracy of Bayesian uncertainty evaluation
PVII uses Bayesian uncertainty evaluation to assess accuracy of decoded or predicted motion. In order to test how accurate the Bayesian uncertainty evaluation can reflect prediction accuracy, we record the values and corresponding CFMSEs in the previous tests.
Correspondence between the values and corresponding CFMSEs is plotted in Fig. 7. It is obvious that they are positively related, indicating sufficient accuracy of using the Bayesian uncertainty evaluation to reflect prediction accuracy.
6 Conclusion and discussion
We propose PVII, a pedestrianvehicle interactive and iterative prediction framework for pedestrian’s trajectory. Because the pedestrianvehicle interaction is a continuous process, we use an iterative prediction method to model the interaction during prediction. To avoid accumulation of errors along with the increased iterations, we design a bilayer Bayesian en/decoder, which combines inaccurate results from the previous iteration with accurate historical data for prediction, and makes Bayesian uncertainty evaluation for the results. In addition, in order to incorporate into the framework impact of surrounding vehicles, we also design a pretrained SE module estimating speed of the surrounding vehicles based on pedestrian’s motion, and collect the training data from pedestrian’s view. Experimental results indicate PVII can improve prediction accuracy of pedestrian’s trajectory. As a result, PVII can be used in the ADAS for more accurate probability estimation of collision.
In the future, to further improve PVII, the most important aspect is to update the training data. In this study, We train and test PVII using the PIE and JAAD data. Though the former is sufficiently large from downtown Toronto and the latter is from several locations in North America and Eastern Europe, there could still be potential limitations to generalize PVII to versatile scenarios in various locations. In addition, we pretrain the SE module using collected data from pedestrian’s view in Beijing. Though we take into consideration consistency of the scenarios for data collection between Toronto and Beijing, it would be more advantageous to collect data from both vehicle’s and pedestrian’s views in the same location. Therefore, we will gather data from various locations from both vehicle’s and pedestrian’s views and further improve our work.
Code and data
References
Louwerse WJR, Hoogendoorn SP (2004) Adas safety impacts on rural and urban highways. In: IEEE Intell Veh Symp 2004, IEEE, pp 887–890
Masello L, Castignani G, Sheehan B, Murphy F, McDonnell K (2022) On the road safety benefits of advanced driver assistance systems in different driving contexts. Trans Res Interdiscip Perspect 15:100670
Zhang L, Yuan K, Chu H, Huang Y, Ding H, Yuan J, Chen H (2021) Pedestrian collision risk assessment based on state estimation and motion prediction. IEEE Trans Veh Technol 71(1):98–111
Pan R, Jie L, Zhang X, Pang S, Wang H, Wei Z (2022) A v2p collision risk warning method based on lstm in iov. Secur and Commun Netw 2022
Rasouli A, Kotseruba I, Tsotsos JK (2017) Agreeing to cross: How drivers and pedestrians communicate. In: 2017 IEEE Intell Veh Symp (IV), IEEE, pp 264–269
Rasouli A, Tsotsos JK (2019) Autonomous vehicles that interact with pedestrians: A survey of theory and practice. IEEE Trans Intell Trans Syst 21(3):900–918
Camara F, Bellotto N, Cosar S, Weber F, Nathanael D, Althoff M, Wu J, Ruenz J, Dietrich A, Markkula G et al (2020) Pedestrian models for autonomous driving part ii: highlevel models of human behavior. IEEE Trans Intell Trans Syst 22(9):5453–5472
Bouhsain SA, Saadatnejad S, Alahi A (2020) Pedestrian intention prediction: A multitask perspective. Tech Rep
Sui Z, Zhou Y, Zhao X, Chen A, Ni Y (2021) Joint intention and trajectory prediction based on transformer. In: 2021 IEEE/RSJ International conference on intelligent robots and systems (IROS), IEEE, pp 7082–7088
Yao Y, Atkins E, JohnsonRoberson M, Vasudevan R, Du X (2021) Bitrap: Bidirectional pedestrian trajectory prediction with multimodal goal estimation. IEEE Robot Autom Lett 6(2):1463–1470
Czech P, Braun M, Kreßel U, Yang B (2022) Onboard pedestrian trajectory prediction using behavioral features. In: 2022 21st IEEE International conference on machine learning and applications (ICMLA), IEEE, pp 437–443
Wang C, Wang Y, Xu M, Crandall DJ (2022) Stepwise goaldriven networks for trajectory prediction. IEEE Robot Autom Lett 7(2):2716–2723
Bhattacharyya A, Fritz M, Schiele B (2018) Longterm onboard prediction of people in traffic scenes under uncertainty. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4194–4202
Rasouli A, Kotseruba I, Kunic T, Tsotsos JK (2019) Pie: A largescale dataset and models for pedestrian intention estimation and trajectory prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6262–6271
Malla S, Dariush B, Choi C (2020) Titan: Future forecast using action priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11186–11196
Kim K, Lee YK, Ahn H, Hahn S, Oh S (2020) Pedestrian intention prediction for autonomous driving using a multiple stakeholder perspective model. In: 2020 IEEE/RSJ International conference on intelligent robots and systems (IROS), IEEE, pp 7957–7962
Neumann L, Vedaldi A (2021) Pedestrian and egovehicle trajectory prediction from monocular camera. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10204–10212
Ruijie Q, YuWu LZ, Yi Y (2021) Holistic lstm for pedestrian trajectory prediction. IEEE Trans Image Process 30:3229–3239
Bengio S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks. Adv Neural Inform Process Syst 28
Lamb AM, Goyal AGAP, Zhang Y, Zhang S, Courville AC, Bengio Y (2016) Professor forcing: A new algorithm for training recurrent networks. Adv Neural Inform Process Syst 29
Zhang W, Feng Y, Liu Q (2021) Bridging the gap between training and inference for neural machine translation. In: Proceedings of the twentyninth international conference on international joint conferences on artificial intelligence, pp 4790–4794
Schmidt S, Faerber B (2009) Pedestrians at the kerbrecognising the action intentions of humans. Transport Res F: Traffic Psychol Behav 12(4):300–310
Alvarez WM, Moreno FM, Sipele O, Smirnov N, OlaverriMonreal C (2020) Autonomous driving: Framework for pedestrian intention estimation in a real world scenario. In: 2020 IEEE Intelligent vehicles symposium (IV), IEEE, pp 39–44
Yao Y, Xu M, Choi C, Crandall DJ, Atkins EM, Dariush Behzad (2019) Egocentric visionbased future vehicle localization for intelligent driving assistance systems. In: 2019 International conference on robotics and automation (ICRA), IEEE, pp 9711–9717
Kendall A, Gal Y (2017) What uncertainties do we need in bayesian deep learning for computer vision? Adv Neural Inform Process Syst 30
Costante G, Mancini M (2020) Uncertainty estimation for datadriven visual odometry. IEEE Trans Robot 36(6):1738–1757
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: A review for statisticians. J Am Stat Assoc 112(518):859–877
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: International conference on machine learning, PMLR, pp 1050–1059
Rasouli A, Kotseruba I, Tsotsos JK (2017) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In: Proceedings of the IEEE international conference on computer vision workshops, pp 206–213
Funding
This work was funded by the National Natural Science Foundation of China (No. 62172028).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shen, Q., Huang, S., Sun, B. et al. PVII: A pedestrianvehicle interactive and iterative prediction framework for pedestrian’s trajectory. Appl Intell 54, 9881–9891 (2024). https://doi.org/10.1007/s10489024055958
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489024055958