1 Introduction

A dynamic graph can be described as a temporal graph that is based on timestamped interactions. The key elements of a dynamic graph include the source node, target node, timestamp, and edge features. The primary task of dynamic graph embedding is to incorporate learning of node representations [1,2,3,4] that encode temporal and structural dependencies which leads to various downstream tasks. Traditional graph representation learning methods mostly rely on static graphs to solve problems. Dealing with nodes and edges in real-world scenarios may seem straightforward. However, it is important to recognise that they are dynamic and evolving. As a result, understanding the dynamic evolution patterns of nodes with graphs has emerged as an important area in the field of graph data analysis and mining [5].

Fig. 1
figure 1

(a) Continuous-time dynamic graph, (b) Static graph ignoring nodes and directions, (c) Discrete-time dynamic graph divided into snapshots

With the advancement of deep learning and graph neural network research, the study of dynamic graphs has experienced rapid development. The primary structure of dynamic graphs is shown in Fig. 1(a). Existing research has typically used static graph models to analyse dynamic graphs, but this approach only produces static representations of nodes, missing a considerable amount of valuable dynamic information. In particular, as shown in Fig. 1(b), when time information is ignored and dynamic graphs are represented as static graphs, the sequence \(U2\rightarrow I1\rightarrow U5\) would be considered a valid interaction path. However, it is ineffective in terms of capturing temporal dynamics, because nodes have only one embedding and cannot represent temporal information. Therefore, time-aware prediction tasks in real-world scenarios cannot be effectively accomplished using static graph neural network models.

In real-world scenarios, the evolution of graphs is constantly changing over time, with nodes and edges appearing or disappearing as time progresses. For example, in recommendation systems, if a user shows significant interest in a particular item at time t and clicks on it (creating an edge), similar items (nodes) may appear at \(t+1\). To address such real-world problems, a significant body of research has used discrete dynamic graph approaches to capture episodic temporal information [6]. These methods represent dynamic graphs as a series of discrete static snapshots, and are often combined with RNN structures to model temporal sequences, capture dynamic properties on the graph, and learn temporal representations.

However, the episodic temporal information inevitably leads to the loss of node connectivity details, as shown in Fig. 1(c). Due to the segmentation of snapshots, certain effective pathways (e.g. \(U5\rightarrow I1\leftarrow U2\)) are not considered.

In order to solve the above problems, researchers have proposed the concept of continuous time dynamic graph [2, 3, 7, 8]. Chang et al. [9] proposed a selection mechanism to obtain neighbours of interacting nodes, which uses the information of interacting edges and dependent edges to capture local information while combining the information of nodes in different time periods to capture global information. Zhang et al. [4] proposed a stochastic wander strategy based on a temporal network to capture the dynamically changing information of the temporal network ( motifs) dynamically changing representations. Huang et al. [10] modelled the edge formation process using Hawkes process and predicted the links by combining dynamic events, dynamic nodes.

Historical data and past events are essential for future forecasting. Continuous-time dynamic graph models significantly enhance predictive accuracy capturing temporal dependencies among data [11]. Therefore, various domains, including weather forecasting and stock market prediction, benefit significantly from those models [4, 12]. The memory module, a crucial element in the majority of continuous-time dynamic graph models, is used to maintain past node state representations. The dynamic graph is modelled as a sequence of time-stamped events, with techniques for updating and storing temporal information. For example, Rossi et al. [2] proposed a strategy for memory units to update the storage of temporal information, which fuses temporal information into the embedding of nodes; Xu et al. [3] proposed K-hopping temporal neighbours for sampling strategy and learned temporal encoding to preserve temporal information in message passing; Wang et al. [13] proposed asynchronous propagation of attention graph neural network algorithm, which decouples model inference and graph computation, allowing graph computation to be executed independently without affecting the speed of model inference.

Despite the progress made by the aforementioned studies in advancing this field, two significant challenges still persist:

Challenge 1: Complex Dependencies in Dynamic Graph Models. Models such as TGN [2] and TGAT [3] suffer from the drawbacks of excessive smoothing and time consumption when the number of layers is too high during the graph query operations, which leads to models with complex dependencies.

Challenge 2: Stability problems in Dynamic Graph Models. Models such as EvolveGCN [14] and MERIT [15], which utilise the RNN structure [16, 17], are usually used to solve the long-term dependency problem prevalent in general recurrent neural networks, and cannot ensure that events arrive in timestamped order when applied to dynamic graphical neural networks, resulting in unstable model performance.

To address the aforementioned problems, we propose a novel Dynamic Graph Neural Network via Memory regenerate and Neighbor propagation model(DGNN-MN). In this model, we first input the dynamic graph into the memory regeneration module to achieve the generation of memory vectors that fuse temporal information. Then, we innovatively propose a propagation strategy along the temporal edge to achieve the acquisition of neighbour information and generate node vectors that fuse neighbour information. The model is based on the multi-head attention module to achieve the fusion of node vectors containing temporal information and node vectors containing neighbour information. Finally, the feedback strategy is used to achieve iterative updating of the node representation.

The contributions of the present work are summarized as follows.

  • We propose a model for gathering node information based on a memory update strategy. Then, we optimize node representations via a memory update strategy. To collect the original information, we use an information aggregation module to combine it with data from neighbour nodes.

  • We propose a method for retrieving neighbour information using a strategy called Neighbour Propagation. Our method stores vast amounts of high-order neighbour data through an independent neighbour information propagation module. This results in real-time model inference, achieved by directly accessing vector lists, thereby optimising execution efficiency.

  • We conducted extensive experiments on five real-world dynamic graph datasets, comparing DGNN-MN with seven base methods on link prediction and node classification. The results demonstrate that DGNN-MN outperforms the existing baseline models. In addition, the effectiveness of using the memory regenerate and neighbor propagation strategies are demonstrated.

The rest of the paper is organized as follows. Section 2 briefly introduces related work. Section 3 provides the basic definition of the problem. In Section 4, we describe the overall framework of DGNN-MN and the introduction of each component. Section 5 presents the experiments and analysis. Finally, Section 6 summarizes the conclusions of this study and discusses the future work.

2 Related work

In this section, we provide an overview of some researches relevant to this paper, including Graph Neural Networks (GNNs) and Dynamic Graph Neural Networks(DGNNs).

2.1 Graph neural networks

Graph Neural Networks (GNNs), as discussed in reference [18], represent a class of deep learning models designed for graph data. They have the capability to learn intricate nonlinear relationships within graphs and effectively integrate both structural and feature-based information of nodes.

Graph neural networks were first proposed by Gori et al. [19] and aim to map nodes, edges and even the whole to a low-dimensional vector space and maintain the structure and properties of the graph itself, using deep learning methods. Later Thomas Kipf et al. [20] first proposed graph convolutional neural networks to learn node representations through the manipulation of convolutional layers, an approach that has been remarkably successful in handling tasks such as node classification and link prediction. Hamilton et al. [21] introduced the GraphSAGE model, which performs information aggregation directly on fixed-size node neighborhoods. Veličković et al. [22] proposed using a combination of node features and attention mechanisms to assign different weights to different nodes, with the learned representation vectors being more discriminative. Bhatti et al. [23] summarised the applications of graph neural networks in various research areas, analysed the current state of research and gave future research directions. Representative work related to graph neural also includes LEFTNet [24] etc.

Graph neural networks have been studied in heterogeneous graphs and dynamic graphs, etc., and have shown strong academic and application potentials in recent years [25, 26]. Wang et al. [27] proposed to learn heterogeneous graph node representations by aggregating the feature information of neighbour nodes under different meta-paths according to the importance coefficients. Fu et al. [28] proposed to capture heterogeneous node information in the middle of meta-paths using a relationship-aware approach, and proposed a knowledge distillation method to capturing relevant semantics. Li et al. [29] devised attribute-completion and attribute-enhancement methods to learn information in heterogeneous graphs and used Transformer’s approach instead of the original graph-attention approach. Yu et al. [30] proposed a new transformer-based architecture for dynamic graph learning, which enables the model to effectively benefit from a longer history. Saxena et al. [31] proposed a method based on network centrality and combined it with Graph Convolutional Networks (GCN) to predict the connections between network nodes.

As graph neural networks mature, they are beginning to be applied to a wider range of domains, e.g., Chen et al. [32] applied graph neural networks to predicting ocean surface temperatures by proposing a spatio-temporal attention graph convolutional network and introducing an attention mechanism to capture spatial dependence and temporal correlation. Such applications extend the domain of graph neural networks and make them a versatile tool for multiple domains.

2.2 Dynamic graph neural networks

Dynamic Graph Neural Networks(DGNNs) models allow us to model the evolution of nodes and edges in time-series data, enabling a better understanding and prediction of temporal phenomena [33, 34].

Nguyen et al. [35] proposed a method to generate node embeddings by simulating temporal random wandering. Zhang et al. [6, 36, 37] modelled temporal dependence by extracting node-induced sequences using a coupled prediction network. However, these methods cannot be generalised to new nodes and edges and do not depict the evolving nature of the dynamical graph. Trivedi et al. [38] introduced an attention layer to aggregate node neighbourhood information for node embeddings. Rossi et al. [2] proposed a generalised induction framework which operates on a continuous time dynamic graph represented as a sequence of events. Xu et al. [3] proposed a memoryless approach where temporal information is incorporated into the Transformer model through Bochner’s theorem to perform self-attention operations to obtain node embeddings. Wang et al. [13] proposed an asynchronous propagation attention mechanism , so that the time graph aggregation operations can be done locally during online inference. Li et al. [39] proposed the use of parametric metrics to estimate the impact scores of nodes in an evolving graph, and further developed an efficient single-scan algorithm to ensure the validity of graph query results.

Although various models have developed embedding methods that effectively learn the dynamic behaviour of nodes, current research is focused on the design of appropriate models for the propagation and interaction of temporal information.

3 Preliminaries

In this section, we formally define several essential concepts and problem definitions related to Dynamic Graph Neural Networks.

Definition 1

Static Graph. A static graph is composed of a vertex set V(G) and an edge set E(G), denoted as \(G = (V, E)\). Here, E(G) is a finite set of edges, where edges are unordered pairs (for undirected graphs) or ordered pairs (for directed graphs) of vertices.

Definition 2

Continuous Time Dynamic Graph. A continuous-time dynamic graph can capture the complete temporal information of a dynamic network and is represented as \(G=\{(v_i,v_j,t,\Delta t,e_{ij} (t));i,j=1,2,\dots ,n\}\), where \(v_i\),\(v_j\) represent nodes in the network, t and \(\Delta \) \(t\) denote the appearance time and duration of a connection, and \(e_{ij} (t)\) represents an edge.

Neural networks used for supervised learning in dynamic graph scenarios can be viewed as a combination of an encoder and a decoder. Solving dynamic graph problems can be considered as finding an appropriate function that maps dynamic graph information into node embeddings, which is the role of the encoder. The decoder’s function is to take one or more node embeddings as input and make predictions for specific tasks, such as link prediction or node classification. The problem of Dynamic Graph Embedding was defined as follows.

Problem 1

Dynamic Graph Embedding. Given a dynamic graph G, there exists a mapping function \(f:V\rightarrow R^d\). Dynamic graph embedding aims to learn a mapping function f that projects nodes from V into d-dimensional vectors. These low-dimensional vectors should capture heterogeneous information and high-order information within G.

The key notations used in this study are listed in Table 1.

Table 1 Notation

4 DGNN-MN model

In this section, we introduce a novel DGNN-MN model designed for representation learning on continuous-time dynamic graphs with timestamped event sequences. For each time t, it generates graph node embeddings \(Z(t)=(z_1 (t),\dots ,z_{n(t)} (t))\).

Fig. 2
figure 2

Overall architecture of proposed DGNN-MN

The general framework (Fig. 2) consists of three key modules: Memory Regenerate Module (A), Multi-Head Attention Module (B), and Neighbor Information Propagation Module (C). Specifically, in Module (A) captures initial information of nodes through a node information acquisition strategy and subsequently captures temporal information of nodes via a node information aggregation and update method. The Module (C) initially acquires neighbor information based on a time-edge propagation strategy and then stores neighbor information in a parallelized manner. The Module (B) builds current node embeddings on top of the Module (A) and Module (C) using multi-head attention operations. Finally, the obtained node embeddings are decoded by a multi-layer perceptron for downstream tasks, such as link prediction and node classification.

4.1 Memory regenerate module

An essential component of DGNN-MN is the Memory Regenerate Module, which plays a crucial role in the acquisition of temporal interaction information related to nodes. The module consists of three main units: Node Information Capture, Information Aggregation and Memory Regenerate. The model efficiently performs parallel processing while learning node representations from the temporal characteristics of the data. The specific workflow is illustrated in Fig. 3.

Fig. 3
figure 3

Memory regenerate module

4.1.1 Node information capture

When an event occurs, accompanied by an interaction between two nodes, for each node i involved in the event, messages need to be computed and their memories updated. At time t, when the original node i interacts with the target node j through event \(e_{ij}(t)\), two messages can be computed:

$$\begin{aligned} U_i(t)=msg(s_i (t^- ),s_j (t^- ),\Delta t,e_{ij} (t)),\end{aligned}$$
(1)
$$\begin{aligned} U_j(t)=msg(s_j (t^- ),s_i (t^- ),\Delta t,e_{ij} (t)), \end{aligned}$$
(2)

Similarly, if it is an event at the node level, denoted as \(v_i (t)\), a single message can be computed for the nodes involved in the event:

$$\begin{aligned} U_i(t)=msg(s_i (t^- ),t,v_i (t^- )), \end{aligned}$$
(3)

where \(s_i(t^- )\) represents the memory of node i before time t (containing all the information about node i before the last interaction). msg is a learnable message function implemented using MLPs. This module also supports edge deletion events.

In the edge deletion event \((i,j,t',t)\), the edge created between nodes i and j at time \(t'\) is deleted at time t. To handle this, the node and its incoming and outgoing edges are simply removed from the spatio-temporal graph. This ensures that the node is not utilized in the graph while calculating embeddings for other nodes.

4.1.2 Node information aggregation

Due to the possibility of a single node having multiple interactions with other nodes before time t, meaning a node may generate multiple events before time t, for computational efficiency, this module uses an average aggregation mechanism to aggregate messages \( U_i(t_1),..., U_i(t_n)\), where \(t_1,...,t_n \le t\) when generating messages for each event.

$$\begin{aligned} m_i(t)= agg(U_n(t_i)) = \frac{U_i(t_1)+U_2(t_1)+...+U(t_n)}{i}, \end{aligned}$$
(4)

where agg represents an average aggregation function.

This function can also be implemented in a learnable manner, for instance using RNNs or graph attention mechanisms, but doing so would increase computational complexity.

4.1.3 Node memory update

The memory of a node is updated for every event involving the node itself. For interaction events involving two nodes, i and j, the memory of both nodes is updated after the event occurs. For node-level events, only the memories of the relevant nodes are updated. The specific implementation process is as follows:

$$\begin{aligned} S_i(t)=mem(m_i(t),s_i (t^- )), \end{aligned}$$
(5)

where mem is a learnable memory regenerate function, in this case, we use a recurrent neural network, specifically LSTM.

4.2 Neighbour messaging module

The Memory Regenerate Module only considers the temporal dependencies between nodes. To address the issue of incomplete higher order information, we suggest implementing the Neighbour Information Propagation Module. In addition to storing a considerable amount of high-level adjacent information, this module enables selection of important adjacent information. The process of two-hop neighbour information propagation is illustrated in Fig. 4. When the Memory Regenerate Module generates a vector for real-time interaction between two nodes, a propagator first generates an interaction vector. This vector is then propagated through feedback operations to the vectors of the k-hop neighbours. DGNN-MN does not need to query the information of neighbouring nodes in the graph, as this module independently stores neighbour information during model inference. Instead, it reads the stored information from the vector lists of neighbouring nodes for real-time model inference.

Fig. 4
figure 4

Neighbor information propagation module

4.2.1 Neighbour messaging module

In Fig. 4, the red and blue nodes interact at time t. The information of the two nodes, as well as the features of their edges, can be jointly represented as an interaction tuple \((z_{i} (t),z_{j} (t),e_{ij} (t)\). Here, \(z_i (t)\) and \(z_j (t)\) are created by the attention encoder. The neighbor propagation process can be described using the following two mathematical equations:

$$\begin{aligned} neighbor(t)= & \varphi (z_i (t),e_{ij} (t),z_i (t)),\end{aligned}$$
(6)
$$\begin{aligned} N(t)= & \psi (N(t^- ),\rho (f(neighbor(t))_{N_{ij}^k }), \end{aligned}$$
(7)

where \(\varphi (\cdot )\) is referred to as the neighbor generation function, responsible for inductively summarizing neighbor information; \(N_{ij}^k\) represents the transmission of neighbor information between the interacting nodes; \(f(\cdot )\) is used for reasonable attenuation or mapping of messages; \(\rho (\cdot )\) is a dimension reduction function, used to aggregate multiple incoming neighbor information into one; \(\psi (\cdot )\) updates the information of past nodes \(N(t^-)\). The specific explanations for the equations are as follows:

  1. 1)

    The neighbor information is generated through the generation function \(\varvec{\varphi }\). To record what happens when a node is involved in an interaction, the elements of the interaction tuple are summed when a node is involved in an interaction, i.e. \(neighbour(t)=z_i(t)+e_{ij}(t)+z_i(t)\), and timestamped. Note that instead of cascading, direct summation is used, which limits vector dimensionality and saves vector list memory.

  2. 2)

    Sampling subgraphs follows a temporal pattern for neighbor information propagation \(\varvec{N_{ij}^k}\). After the generation of neighbour information in the first step, there is a need to disseminate this information to other nodes. However, propagating a single neighbour’s information to all neighbours is inefficient. Therefore, to model rapidly changing trends and update node embeddings, we employ a newer neighbour sampling strategy. This strategy is more suitable for capturing time-varying information.

  3. 3)

    The propagation function f applies appropriate attenuation or mapping to the messages. After sampling the neighboring nodes, the information of the neighboring nodes is strengthened and weakened by means of an identity mapping, which ensures that the propagation of the information of the nodes is strictly related to the structure of the original node and its neighboring nodes.

  4. 4)

    At a particular time step, a node may receive numerous neighbor messages. We use an averaging strategy, called \(\rho \), to deal with the imbalance and maintain fairness. Nodes typically receive multiple vectors from their neighbours during the information propagation phase.To address the imbalance between active (high degree) and inactive (low degree) nodes nodes, we use a reduction function that consolidates multiple messages into one, each node receives only one message per batch.

  5. 5)

    After receiving information, nodes summarize the historical states of their neighbors using a self-update function \(\varvec{\psi }\). To update the vector list, we use a queue data structure with a First Input First Output(FIFO) mechanism. This queue structure ensures that the latest information remains in the vectorlist and obsolete information gets discarded.

4.2.2 Vector list storage strategy

Considering that the arrival order of each neighbor’s information varies, it is necessary to apply positional encoding to each neighbor’s information, while simultaneously setting the number of neighboring nodes. The positional information is converted into a one-hot format and then inputted into the embedding lookup layer using the absolute positional encoding method.

For a given input sequence of length n, let t represent the order of arrival of neighbors, and \(p_n\subseteq R^d \) represent the vector corresponding to position t, where d is the dimension of the vector. Function \(f:N \rightarrow R^d\) is used to generate the positional vector \(p_n\) and is defined as follows:

$$\begin{aligned} p_n^i=f(t)^i=\left\{ \begin{aligned} sin(w_k \cdot t) \quad i&= 2k\\ cos(w_k \cdot t) \quad i&= 2k+1\\ \end{aligned} \right. \end{aligned}$$
(8)

The frequency \(\omega _k\) is defined as follows:

$$\begin{aligned} w_k=\frac{1}{1000^{\frac{d}{2k}}}. \end{aligned}$$
(9)

From the function definition, it can be observed that the frequency decreases along the vector dimension. As a result, it forms a geometric series in terms of wavelength, ranging from \(2\pi \) to \(1000 \cdot 2\pi \). The positional encoding \(p_n\) is a combination of cosine and sine functions for each frequency, and d is divisible by 2.

$$\begin{aligned} p_n=\left\{ \begin{aligned}&sin(w_1 \cdot t) \\&cos(w_1 \cdot t) \\&sin(w_2 \cdot t) \\&cos(w_2 \cdot t) \\&\cdots \\&cos(w_{\frac{d}{2}} \cdot t) \\ \end{aligned} \right. \end{aligned}$$
(10)

For a node’s information \(N(t)=(neighbor_1 , neighbor_2,\cdots ,\)\( neighbor_m)\), the positional encoding layer combines the positional information into the original vector list as follows:

$$\begin{aligned} \widehat{N(t)} = N(t)+P(t) = [neighbor_1 + p_1,\cdot ,neighbor_2 + p_2], \end{aligned}$$
(11)

where \(\widehat{N(t)}, N(t) ,P(t) \in R_n \times d\), with n being the maximum length of the vector list and d representing the vector dimension.

Fig. 5
figure 5

Multi-head attention module

4.3 Multi-head attention process

We employ a multi-head attention mechanism to compel the model to learn different aspects of knowledge. The multi-head attention achieves this by utilizing multiple independent attention heads, which compute attention weights separately, and then concatenate or weight-sum their results to obtain richer representations, as illustrated in Fig. 5.

The primary computational process is represented through the following equations:

$$\begin{aligned} Attention(Q,K,V)= & softmax(\frac{QK^T}{\sqrt{d}})V,\end{aligned}$$
(12)
$$\begin{aligned} Q= & z(t^-)W_Q,\end{aligned}$$
(13)
$$\begin{aligned} K= & \widehat{N(t)} W_K,\end{aligned}$$
(14)
$$\begin{aligned} V= & \widehat{N(t)} W_V, \end{aligned}$$
(15)

where Q represents Query, which can be interpreted as whether there is an association between the current node and other nodes; K represents Key, understood as the key information of the node when matched with Q; V represents Value, interpretable as the node’s important features. The dot product value between ‘Q’ and ‘K’ determines the contribution to the final output of ‘V’; the larger the dot product, the greater its contribution to V. \(W_Q, W_K, W_v \in R^{d \times d_h}\) are learnable weight matrices used to capture the interaction between node information and the information of neighboring nodes after positional encoding, where \(d_h\) is the output dimension. \(\sqrt{d}\) serves as a scaling factor for data normalization. Through this mechanism, the attention module can determine how to update node embeddings based on both the node itself and its neighboring information.

It is necessary to use a normalisation strategy to limit the mean and variance of the outputs, as attentional outputs may differ between nodes. As complex attention mechanisms can disrupt the statistical distribution within piles, this paper uses layer normalisation in its calculations. Layer normalization achieves this goal by computing the mean and variance as follows:

$$\begin{aligned} a= & MultiHead(Q,K,V)+z(t^-),\end{aligned}$$
(16)
$$\begin{aligned} \mu= & \frac{1}{d}\sum \limits _{i=1}^{d}a_i ,\end{aligned}$$
(17)
$$\begin{aligned} \sigma= & \sqrt{ \frac{1}{d}\sum \limits _{i=1}^{d}(a_i-\mu )^2} ,\end{aligned}$$
(18)
$$\begin{aligned} \overline{a}= & f(\frac{g}{\sqrt{\sigma ^2}}\odot (a-\mu )+b) , \end{aligned}$$
(19)

where d represents the dimension of the input to layer normalization, \(\mu \) and \(\sigma \) represent the mean and variance shared by all hidden units in the layer. \(\odot \) denotes element-wise multiplication between two vectors. The learnable parameters b and g are defined as bias and gain to ensure that the normalization operation does not affect the original information.

4.4 Model training

For various downstream tasks, it is imperative to choose diverse loss functions and metrics to appraise the performance of different models. A time-varying negative sampling strategy has been proposed by us to construct positive-negative sample pairs for training using cross-entropy loss function.

$$\begin{aligned} l= & \sum \limits _{(v_i,v_j,e{ij},t)\in G} -log(\sigma (-z_i(t)^Tz_j(t)))\nonumber \\ & -E_{v_n~p_n}log(\sigma (-z_i(t)^Tz_j(t))), \end{aligned}$$
(20)

where, the summation occurs when nodes i and j interact during training. \(\sigma \) represents the sigmoid activation function, and \(p_n (v)\) is the negative sampling distribution. The overall flow of DGNN-MN is presented in Algorithm 1.

Algorithm 1
figure g

Overall DGNN-MN learning algorithm.

5 Experiments

In this section, we introduce the datasets used for the experiment and thecomparison algorithm and evaluation index. Then the effectiveness of the proposed model is verifed by edge prediction, node classification, ablation experiment, hyperparameter analysis and time complexity analysis.

5.1 Datasets

In this paper, we utilize five real-world dynamic graph datasets to evaluate the effectiveness of the model. Table 2 presents the detailed information about these datasets.

  • RedditFootnote 1: This dataset represents data about posts made by users on subreddits within a month. Its nodes correspond to users and posts, and its edges indicate posting requests with timestamps.

  • WikipediaFootnote 2: This dataset covers edits made to Wikipedia pages within a month. Its nodes represent human editors and Wikipedia pages, with edges representing edits with timestamps..

  • EnronFootnote 3: This dataset is a communication network, with edges representing email communication between core employees over several years.

  • MOOCFootnote 4: This dataset involves online courses, where nodes denote students and course content units, such as videos and problem sets, and edges indicate students’ access to specific units.

  • MeituanFootnote 5: This dataset is a collection of food preference data from the Meituan platform, with nodes representing users and dishes, and edges representing user click behavior.

Table 2 Dataset statistics

5.2 Baselines

We compare the proposed method with nine dynamic graph representation learning models, categorized based on how they handle the data: discrete dynamic graph representation learning models: EvolveGCN [14], and continuous-time dynamic graph representation learning models: JODIE [1], TGAT [3], Dyrep [38], TGN [2], APAN [13], MERIT [15], EdgeBank [40], DyGformer [30].

  • JODIE(2019) [1]: Two recurrent neural networks are used to update the embedding of users and items at each interaction. A novel projection operator is introduced that learns and estimates the user’s embedding at any time in the future, and therefore models the future embedding trajectories of the user and item.

  • Dyrep(2019) [38]: Considering the changes in network topology and node interactions, the dynamic network can be divided into two processes: the association process and the communication process. This suggests that learning can act as a mediator that effectively connects the two processes.

  • EvolveGCN(2020) [14]: A graph convolutional network model is employed in the time dimension without the aid of node embedding. The method evolves the GCN parameters by using RNN to capture the dynamics of the graph sequence.

  • TGAT(2020) [3]: Node embeddings should incorporate both static node features and dynamic topological features. Based on the self-attention mechanism and leveraging the classic Bochner theorem from harmonic analysis, we have developed a novel temporal encoding technique.

  • TGN(2020) [2]: A dynamic graph model that is both general and efficient has been proposed. The model incorporates memory modules and graph convolution operations, enabling it to learn from the data order while maintaining efficient parallel processing.

  • APAN(2021) [13]: It introduces an asynchronous propagation attention graph neural network algorithm that decouples model inference from graph computations, enabling independent execution of graph computations without impeding the speed of model inference.

  • MERIT(2022) [15]: By employing a context-aware attention mechanism to sequentially locate the most relevant neighbors in a temporal order, we jointly capture contextual information at multiple levels on the temporal graph. Ultimately, multiple aggregation and propagation steps are carried out to explore and leverage high-level structural information for downstream tasks.

  • EdgeBank(2022) [40]: Two visualisation techniques are proposed to understand recurring patterns of edges over time, and a more rigorous evaluation procedure is proposed to predict specific dynamic maps.

  • DyGformer(2023) [30]: Extensible coding interfaces and comprehensive evaluation protocols are proposed, demonstrating their effectiveness in capturing node correlation and long-term temporal dependencies.

  • OUR : We introduce an approach that employs memory regeneration alongside neighbour propagation. Initially, the memory regenerate mechanism effectively captures node temporal features. Then, neighbour propagation independently stores high-level neighbour information. Finally, multi-head attention selects critical information to improve node representations.

5.3 Evaluation metrics

In this paper, we employ Average Precision (AP) and Area Under Curve (AUC) as the evaluation metrics.

Among these, AP can be utilized to assess the detection performance of a system, specifically its ability to correctly predict targets across different thresholds within the test sample. It results in the Precision-Recall (PR) curve, computed based on the following formula:

$$\begin{aligned} Precision= & \frac{TP}{TP+FP} ,\end{aligned}$$
(21)
$$\begin{aligned} Recall= & \frac{TP}{TP+FN} , \end{aligned}$$
(22)

By integrating the area under the PR curve, one can obtain the AP value for that particular category, as expressed by the following formula:

$$\begin{aligned} AP = \int _{0}^{R} PdR , \end{aligned}$$
(23)

In this context, TP stands for true positives, FP represents false positives, FN denotes false negatives, P corresponds to precision, and R signifies recall.

AUC is employed to assess the performance of binary classification models. It results in the Receiver Operating Characteristic (ROC) curve, calculated based on the following formula:

$$\begin{aligned} FPR= & \frac{FP}{FP+TN},\end{aligned}$$
(24)
$$\begin{aligned} TPR= & \frac{TP}{TP+FN}, \end{aligned}$$
(25)

By integrating the area under the Receiver Operating Characteristic (ROC) curve, one can obtain the AUC value for that particular category, as expressed by the following formula:

$$\begin{aligned} AUC = \int _{0}^{F} TdF , \end{aligned}$$
(26)

In this context, FP represents false positives, TN denotes true negatives, TP stands for true positives, FN signifies false negatives, F corresponds to false positive rate (FPR), and T represents true positive rate (TPR).

5.4 Experimental settings

For the five datasets mentioned above, we employed the Adam optimizer with a learning rate set to 0.0001 for training. The data was split into training, validation, and testing sets with a ratio of 70\(\%\), 15\(\%\), and 15\(\%\), respectively. A batch size of 200 was used during training. To prevent overfitting, a dropout rate of 0.1 was applied, and early stopping was set with a patience of 5. The number of attention heads was configured as 2, and the message-passing layers were set to 2. In both the encoder and decoder, a two-layer feedforward neural network (MLP) was used, and no extensive hyperparameter tuning was conducted in the experimental results.

Table 3 Transductive learning results (%) compared to baselines
Table 4 Inductive learning results (%) compared to baselines

5.5 Link prediction

We employed AP and AUC as evaluation metrics to assess the model’s performance in link prediction tasks. In this task, we estimated the standard deviation of the samples (across more than 10 random seeds) and reported the average AP and AUC results as percentages.

We treat the observed node future link prediction as a transductive learning task and the future link prediction on unseen nodes as an inductive learning task. The transductive learning results are presented in Table 3, and the inductive learning results are shown in Table 4. The best results are indicated in bold, while the second-best results are underscored.Since the source code for the MERIT model is not provided, only a portion of the experimental results are shown.

The results in Tables 3 and 4 demonstrate the superior performance of our approach in inductive and transductive learning tasks. In the transductive learning task, the proposed DGNN-MN model consistently outperforms all baseline models in link prediction. Compared to models like APAN that do not require memory regenerate, DGNN-MN achieves a maximum improvement of 6.63\(\%\) and 4.01\(\%\) in AP and AUC metrics, respectively. This result underscores the significant enhancement that the construction of memory regenerate units brings to DGNN-MN by effectively preserving node information.

In the inductive learning task, DGNN-MN demonstrates improvements in both AP and AUC, surpassing most models and significantly outperforming classical models like TGAT and TGN. For these two metrics, it shows improvements of at least around 5\(\%\), highlighting the effectiveness of the position encoding in the neighbor information propagation module and the design of vector list queries in compensating for the drawback of limited neighbor information acquired through memory regenerate. We compared DGNN-MN with DyGformer(2023) and EdgeBank(2022), and we can find that our model overall outperforms the newer model, obtaining good performance results. For the MOOC and Enron datasets, DGNN-MN achieves a maximum improvement of 8.59\(\%\) in AP and 14.08\(\%\) in AUC. This underscores the significant advantages of our model in scenarios with fewer nodes but complex interactions among them, validating the effectiveness of DGNN-MN.

Fig. 6
figure 6

Ablation experiment

In summary, the DGNN-MN model demonstrates outstanding performance in both inductive and transductive learning tasks, specifically in managing node information through memory regeneration and in situations with fewer nodes but intricate connections. These findings validate the effectiveness of the model and its potential for a broad range of applications.

5.6 Node classification

We also conducted specific experiments on the downstream task of node classification, where the dataset was divided into training, validation, and test sets with a ratio of 70\(\%\), 15\(\%\), and 15\(\%\). For the node classification task, we estimated the standard deviation of the samples (across more than 10 random seeds) and reported the AUC results as percentages. The results are presented in Table 5, with the best results indicated in bold and the second-best results underscored.

Table 5 Node classification results (%) compared to baselines

As we can see from Table 5, it is evident that DGNN-MN performs well in node classification tasks. For the Reddit dataset, which has a larger dataset and more edge features, the MERIT model uses the Coco attention mechanism to capture multi-level context effectively, resulting in better node classification performance, about 5.34\(\%\) higher than DGNN-MN that employs only multi-head attention. However, for the Wikipedia and Meituan datasets, DGNN-MN achieves a maximum improvement of 1.93\(\%\) and 4.08\(\%\) in AUC, respectively. This suggests that the embeddings learned after memory regenerate and neighbor propagation are capable of performing well in node classification tasks.

Table 6 Ablation experiment variants

To summarise, the DGNN-MN model performs adeptly in classification tasks for nodes, particularly on specific datasets where its value is increased through learning features for embedding via memory regeneration and neighbour propagation, resulting in its superiority over other models. Nevertheless, on occasions such as the Reddit dataset, other models may outperform it based on criteria associated with the dataset and task.

Fig. 7
figure 7

Parameter analysis

5.7 Ablation experiments

To assess the contribution of each module used in the model to the final recommendation performance, we conducted ablation experiments by removing specific modules. In this section, we designed three variants to evaluate the extent of performance improvement from different modules in DGNN-MN. The symbols used in these experiments are detailed in Table 6, and the results are presented in Fig. 6.

As shown in Fig. 6, the performance advantage of the DGNN-MN model across four different datasets, especially in comparison to its variant models, is evident. This indicates that different components play varying roles in improving the overall model performance. It also implies the need to comprehensively consider the effects of each component when constructing and optimizing dynamic graph models to better leverage information within dynamic graphs.

Due to the unique structure of the MOOC dataset, which has fewer nodes but extensive interactions between them, the neighbor propagation module plays a crucial role in improving the performance on this dataset. This underscores the idea that the model may exhibit different advantages on different datasets due to variations in data structure.

In conclusion, the results of the ablation experiments highlight the excellent performance of the DGNN-MN model across multiple datasets, while emphasizing the importance of considering dataset characteristics and the interplay of different components in the model’s design and optimization.

Fig. 8
figure 8

DGNN-MN time consumption compared with baseline model

5.8 Parameter analysis

In this section, we investigated the hyperparameter of neighbor sampling. We set the neighbor sampling size to 0, 10, 20, and 30 to explore the specific impact of this hyperparameter on the model. The results are presented in Fig. 7.

As is shown in the Fig. 7, we can observe that as the number of neighbors increases beyond 10, the experimental results on the Wikipedia and MOOC datasets remain almost constant, with the curve in a very stable state, indicating minimal impact on the model’s robustness. However, for the Enron dataset, the experimental results clearly show an upward trend. This suggests that the setting of the neighbor sampling size is of great significance to the overall model.

5.9 Model efficiency analysis

In the link prediction task, we computed the memory and training time for each epoch in the Wikipedia dataset. Each batch contained 200 interactions, and the model utilized two network layers. Models with lower memory consumption are positioned closer to the lower-left corner. The horizontal axis represents the average time required for model training epochs, and the vertical axis indicates the memory required for training the model. To ensure the fairness of the experiments, all experiments were conducted on a Windows system with an RTX-3060 GPU. The experimental results are depicted in the Fig. 8.

As shown in Fig. 8, it is evident that among all models, DGNN-MN has the shortest training time and relatively lower memory consumption. This is because DGNN-MN does not need to query neighboring node information in the graph; it simply reads information from the stored vector lists, significantly reducing computational overhead. Additionally, TGAT is the most competitive model among those based on continuous dynamic graphs, but due to its extensive attention coefficient calculations between nodes, the graph querying process is more time-consuming.

6 Conclusion

In this paper, we present the DGNN-MN framework, a dynamic graph neural network approach that addresses the challenges of the intricate dependencies and long-term dependencies of dynamic graph data. In response to the first chalenge, We propose a model for gathering node information based on a memory update strategy and optimize node representations via a memory update strategy. For the second challenge, We propose a method for retrieving neighbour information using a strategy called Neighbour Propagation, this module stores vast amounts of high-order neighbour data through an independent neighbour information propagation module. Finally, through a multi-head attention mechanism, we achieve fast learning of dynamic graphs. Comprehensive experiments on real datasets show that DGNN-MN outperforms current state-of-the-art methods.

In our future work we will conduct in-depth research and exploration of the scalability of the model to provide more comprehensive and enhanced evidence to support the scalability of the proposed method, and also explore more practical application scenarios, especially in the areas of Dynamic Network Analysis, Social Network Mining, and Recommender Systems, to validate the applicability and scalability of the method in different scenarios.In addition, investigating the development of visualisation tools and interpretability measures to help end-users understand dynamic graph models is an important focus for further research and applications.