Abstract
The sequentiality of sequences plays a crucial role in modeling the dynamic evolution of the user’s interests. Sequential recommendation models have significantly improved with the introduction of neural networks, offering users more personalized experiences. However, most models rely on a single type of behavior data and perform single-class target optimization on that type, overlooking the Click-Favorite-Purchase process that precedes a user’s final interaction with the item, and different behaviors in this process will have a significant impact on users’ interest. In this paper, we propose behavior sessions and time-aware for multi-targets sequential recommendation model (BTMT), which captures users’ interest changes from various behavior information. BTMT learns the influence factors of different behavior sessions as weights for each behavior. These weights introduce behavior information into a temporal attention network to dynamically model user’s interests in conjunction with time information. Furthermore, we distinguish the prediction of different users’ behaviors and perform multi-target joint optimization. Extensive experiments on four datasets demonstrate that BTMT’s prediction performance for each behavior target significantly outperforms various sequence models. These results validate the effectiveness of distinguishing behavior information in improving recommendation performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
As the issue of information overload becomes increasingly severe, the task of filtering information for users and making recommendations is crucial. Faced with evolving users’ interests, how to predict users’ next interactive item to a great extent is a goal worth exploring continuously. By analyzing the sequentiality of users’ historical behavior, we can better understand their interest’s evolution and behavior patterns, thereby enabling more accurate recommendations.
To accurately model users’ dynamic interests, extensive research has been conducted on users’ historical interaction sequences [1,2,3,4]. Early classical methods modeled users’ behavior sequences using Markov Chains (MC) [3, 5, 6], simplifying raw data to reduce computational complexity. However, these methods could not capture more complex scenarios and dynamic changes. To cope with these demands, machine learning and deep learning methods have been widely applied to behavior sequential modeling. These methods can automatically learn patterns and regularities in data to better predict users’ next interaction [7,8,9,10]. Currently, with the introduction of Transformer [11], the self-attention mechanism has become a favored approach. This mechanism can automatically focus on crucial segments and excels at handling long sequence data, establishing long-distance dependencies [12,13,14,15]. Although these deep learning models possess powerful representation capabilities, they require large-scale data to capture the subtle differences between users’ interests and interactions.
In real life, the process of a user purchasing an item consists of a series of actions, each corresponding to different behavior types. For example, suppose a user wants to buy a laptop. In that case, he/she might start by clicking on various products in the browsing interface, then select several laptops that meet his/her requirements to add to favorites, and finally, purchase one of them. In the process, the action of adding to favorites represents the user’s interests more strongly than clicking, and the action of purchasing represents his/her interests more strongly than adding to favorites. Clearly, when a user eventually makes a purchase, his/her preference for that item is the highest. In other words, different behavior types of users reflect their interests to varying degrees and influence their next interaction. Furthermore, different time intervals between adjacent items also affect users’ interests, making it crucial to consider the time relationship. Most traditional recommendation models rely on a single type of behavior data and perform single-class target optimization on that type, and they do not consider sequentiality [16, 17]. On the other hand, typical sequential recommendation models only consider whether users interact with items and treat all actions as the same type, making it impossible to distinguish the impact of different behavior types on their preferences.
In this paper, we propose behavior sessions and time-aware for multi-targets sequential recommendation model (BTMT). This model introduces the behavior information of interaction from users’ historical sequences to investigate the impact of each behavior type on users’ interest changes. BTMT learns attention weights that can reflect the importance of each behavior type while capturing the impact of different time intervals under the same sequential relationship and dynamically models the changing process of users’ interests through the combination of behavior information and time information. In addition, we are not limited to predicting a single behavior of users but utilize associations between different behaviors to predict and jointly optimize for each behavior. This multi-category target prediction method can pay more comprehensive attention to each interaction of users and try the exploratory recommendation, providing more accurate and effective support for the personalized recommendation. Our main contributions are as follows:
-
We consider attention weights of different behaviors and input them as correction factors into the temporal attention network, modeling both behavior and time information.
-
We treat each behavior type as a target, making predictions for multiple targets, respectively. Furthermore, we conduct joint training and optimization of these targets, achieving parameter sharing.
-
We extensively experimented with datasets from four distinct platforms, and the results revealing that our model outperforms the latest baselines across all targets.
2 Related work
Conventional recommendation methods [18,19,20,21] use explicit or implicit interactions to model users’ preferences, primarily aimed at understanding users’ long-term preferences, and these are usually static or change slowly over time. However, these methods ignore the order of users’ interaction with items, leading to significant limitations in capturing users’ dynamic preferences. Regarding the above issue, sequential recommendation focuses on predicting the user’s interactive behavior or interests at the next moment based on the user’s behavior interaction sequence within a period of time. Early traditional sequential recommendation methods were mainly based on sequential pattern mining [22, 23] and Markov chains [3, 5, 6]. Sequential pattern mining methods often require multiple passes through the entire sequence dataset, resulting in there will be a large number of redundant patterns. Hidasi et al. [24] pioneered the use of RNN (Recurrent Neural Network) in sequential recommendation, controlling the information transfer in user sequences through the propagation of the GRU (Gate Recurrent Unit) network’s hidden state, facilitating the fusion of long-term and timely information. Inspired by image recognition, CNN (Convolutional Neural Networks) [4, 8, 25] and GNN (Graph Neural Networks) [26,27,28] have also been introduced to sequential recommendation. Caser [4] used CNN to extract sequential patterns of point-level, union-level, and skip behaviors. Wu et al. [26] constructed session graphs and used the adjacency matrix as input to the GRU model, allowing the learning of node representations. Recently, FMLP-Rec [29] simplified increasingly complex model structures, adopting an MLP (MultiLayer Perceptron) architecture in the frequency domain to simulate filtering mechanisms, reducing model complexity while improving performance on sparse datasets. However, although the simpler FMLP-Rec can mitigate overfitting issues of complex models on sparse datasets, it still has limitations on dense datasets.
The attention mechanism can simulate the human paying attention to the more critical aspects of things when observing them, allowing limited computational resources to be allocated to more critical information, thus improving the ability of model to handle long sequences. So, the attention mechanism is widely employed in sequential recommendation [12, 30,31,32]. Ren et al. [33] considered the phenomenon of repeat purchases in daily consumption and integrated the attention mechanism into RNN. Li et al. [30] constructed an Encoder-Decoder framework that combines the attention mechanism and RNN to model users’ preferences from sequences. With the rise of the Transformer, the self-attention mechanism replaced the traditional models (such as CNN, RNN, etc.) and achieved state-of-the-art performance in sequential recommendation [12,13,14,15]. These self-attention-based models excel in handling long sequences and exhibit better generalization performance. Kang et al. [12] were the first to use the self-attention mechanism to extract users’ historical behavior information, assigning attention weights to past items automatically at each time step. Zhang et al. [14] combined the self-attention mechanism and the metric learning framework to model relationships between items, considering both a user’s short-term and long-term preferences. These models utilize the position of each item as input, thereby capturing the sequential relationships in users’ historical interactions. Some studies [13, 34, 35] extended this approach by introducing time intervals between items, distinguishing the importance of two interactions at different time intervals under the same sequence. When faced with the problem of sparse data, various features of items were incorporated into the model, allowing the extraction of more available information within limited interaction sequences [36, 37]. Recently, the Temporal Lag aware model TLSRec [38] took into account both local fluctuations and global stability of users’ preferences, employing the hierarchical self-attention network to learn a user’s short and long-term preferences. However, most of these methods generally overlook the influence of different types of behavior on users’ interests and only optimize for a single target. Our work preserves time intervals between interactions and introduces behavior types of each interaction of users to distinguish the importance of different behaviors. Furthermore, we treat each behavior type as a target, predict it separately, and then perform joint optimization to achieve information sharing between targets.
3 Methodology
We propose a sequential recommendation model, BTMT, that integrates users’ multi behaviors information and temporal information. The model consists of an embedding layer, behavior weight layer, temporal attention network layer, and multi-target prediction layer. The specific construction of the BTMT model will be elaborated in this section. The architecture of BTMT is illustrated in Fig. 1.
3.1 Data preprocessing
In this section’s sequential recommendation task, let \(\mathcal {U}\), \(\mathcal {I}\), and \(\mathcal {B}\) represent the user set, item set, and behavior types set, respectively. The behavior types set follows the order of users’ actions from discovery to purchasing an item (e.g., Click-Favorites-Purchase). We provide a user’s interaction sequence \(L^u = (l_1^u, l_2^u, \cdots , l_{|L^u |}^u)\), where \(u \in \mathcal {U}\), \(l_i^u \in \mathcal {I}\), the behavior types sequence \(B^u = (b_1^u, b_2^u, \cdots , b_{|B^u |}^u)\) corresponding to each interaction, where \(b_i^u \in \mathcal {B}\), and the timestamp sequence \(T^u = (t_1^u, t_2^u, \cdots ,t_{|T^u |}^u)\) corresponding to each interaction, where \(|L^u |=|B^u |=|T^u |\).
This task aims at different behavior types and performs multi-target prediction on the next potential interaction items for the user. Therefore, we take the item sequence \((l_1^u, l_2^u, \cdots , l_{|L^u |-1}^u)\) (excluding the last item) in the interaction sequence, the corresponding behavior types sequence \((b_1^u, b_2^u, \cdots , b_{|B^u |-1}^u)\), and the corresponding timestamp sequence \((t_1^u, t_2^u, \cdots ,t_{|T^u |-1}^u)\) as inputs to the model. Additionally, we consider the next item sequence \((l_2^u, l_3^u, \cdots , l_{|L^u |}^u)\) of the interaction corresponding to each time step as the expected prediction output. It is worth noting that when making predictions, we divide them according to the behavior types corresponding to the different expected output items and perform multi-target prediction and optimization. Our notation is summarized in Table 1.
To enable the input of interaction sequences with different lengths into the model, we set a fixd length value \(n \in \mathbb {N}\) of each user’s interaction sequence. We assume that once the number of the user’s interactions surpasses the value, the impact of the initial several interacted items on predicting the next item for the user can be considered negligible. For a user u, we fix the length of his/her historical interaction sequence \((l_1^u, l_2^u, \cdots , l_{|L^u |-1}^u)\) to form a new input sequence \(l = (l_1, l_2, \cdots , l_n)\). If \(|L^u |- 1 > n\), we only select the most recent n interactions; if \(|L^u |- 1 < n\), we add padding items with zeros to add to the left side of the sequence until its length reaches n. At the same time, corresponding behavior types sequence \((b_1^u, b_2^u, \cdots , b_{|B^u |-1}^u)\) and timestamp sequence \((t_1^u, t_2^u, \cdots ,t_{|T^u |-1}^u)\) also undergo similar transformations, resulting in sequences \(b = (b_1, b_2, \cdots , b_n)\) and \(t = (t_1, t_2, \cdots , t_n)\), respectively. Unlike the input sequence, the padding items for the behavior types sequence are set to the first behavior type in \(\mathcal {B}\). For the timestamp sequence, the padding items are set to the timestamp corresponding to the earliest interaction item \(l_i\) in l. Furthermore, we utilize the behavior types sequence \((b_1^u, b_2^u, \cdots , b_{|B^u |-1}^u)\) of user u, extracting each interaction item corresponding to each behavior type from the historical interaction sequence \((l_1^u, l_2^u, \cdots , l_{|L^u |-1}^u)\) of the user, and we reorganize the items into m new sequences \(\{ L_1^u,L_2^u,\cdots ,L_m^u \}\) according to different behavior types, where the items in each subsequence are arranged in time order. For the interaction sequence \(L_i^u\) of the i-th behavior, we fix its sequence length to \(n_i \in \mathbb {N}\), forming a behavior session \(l^i=(l_1^i,l_2^i,\cdots ,l_{n_i}^i)\), which serves as the input to the behavior weight layer after passing through the embedding layer. Similarly, when \(|L_i^u |- 1 > n_i\), we only select the most recent \(n_i\) interactions of that behavior type. When \(|L_i^u |- 1 < n_i\), we add padding items of zero to the left of the sequence until its length is \(n_i\).
Similar to [13], we introduce the absolute time relationship to model the impact of users’ interaction behaviors at different time intervals on predicting the next item. We denote the absolute time relationship between two interactions as their relative time interval. For the fixed-length time sequence \(t = (t_1, t_2, \cdots , t_n)\), the absolute time relationship between item i and item j can be expressed as:
Where \(t_{\text {min}} \in \mathbb {N}\) represents the minimum value of users’ absolute time intervals, and \(r_{ij} \in \mathbb {N}\). We give a maximum time interval \(t_{\text {max}}\), so that when the relative time interval becomes excessively large, it is limited to \(t_{\text {max}}\). After this constraint, the absolute time relationships between all interactions form users’ absolute time relationship matrix \(R \in \mathbb {N}^{n \times n}\):
where the elements on the main diagonal are all set to zero.
3.2 Embedding layer
We establish an item embedding matrix \(I \in \mathbb {R}^{|\mathcal {I} |\times d}\) and a behavior embedding matrix \(B \in \mathbb {R}^{|\mathcal {B} |\times d}\), retrieving item embeddings and behavior embeddings of users’ input sequence, where d is the latent vector dimension of the model. To better capture users’ intent under different behaviors, we introduce the behavior types of the next item corresponding to each time step in the sequence as a known condition into the input sequence, forming an input embedding matrix \(E \in \mathbb {R}^{n \times d}\). Specifically, we introduce the behavior information of the next item using an element-wise addition, which helps avoid the large number of parameters associated with other feature cross methods (e.g., concatenation):
In addition to the absolute time relationships, we still employ learnable relative position embeddings to model the sequentiality of the input sequence. For the keys and values of the temporal attention network layer, we denote their relative positions as \(P^K \in \mathbb {R}^{n \times d}\) and \(P^V \in \mathbb {R}^{n \times d}\), respectively:
To distinguish the extracted interaction sequences for each behavior and the overall historical interaction sequence of users, ensuring that they do not influence each other, we established a new item embedding matrix \(I^B \in \mathbb {R}^{|\mathcal {I} |\times d_b}\), where \(d_b\) is the latent vector dimension of multi behaviors. The item embeddings for each session in \(\{ l^1,l^2,\cdots ,l^m \}\) are retrieved from \(I^B\), obtaining the behavior session embedding matrix, which serves as the input to the behavior weight layer:
Where \(E_i^B \in \mathbb {R}^{n_i \times d_b}\) is the embedding matrix for the i-th behavior session. As the behavior weight layer cannot capture the position of each item of the sequence, we introduce different learnable relative position embeddings to each behavior session embedding:
Where \(p_j^i \in \mathbb {R}^{d_b}\) is the relative position embedding for the j-th item in the i-th behavior interaction sequence. Similar to the model’s relative position embeddings, we use \(R^K \in \mathbb {R}^{n \times n \times d}\) and \(R^V \in \mathbb {R}^{n \times n \times d}\) to represent the absolute time relationships for the keys and values of the temporal attention network layer:
Where \(r_{ij}^K \in \mathbb {R}^d\), \(r_{ij}^V \in \mathbb {R}^d\).
3.3 Behavior weight layer
We employ a self-attention mechanism [11] to capture users’ attention to different behavior types, where the computed dot-product attention is defined as:
Where Q represents the query, K represents the key, V represents the value, and the scaling factor \(\sqrt{d}\) is used to scale the dot-product attention to avoid excessively large values.
To prevent interference between parameters of different behaviors, we obtain attention weights for each behavior type separately by applying different attention modules to different behavior session embeddings. Specifically, for the i-th behavior session embedding \(\hat{E}_i^B\), it is converted into three matrices through linear projection, corresponding to Q, K, and V of \(Attention(\cdot )\), respectively:
Where \(W_i^Q, W_i^K, W_i^V \in \mathbb {R}^{d_b \times d_b}\) is the projection matrix of queries, keys, and values. The attention module consists of a self-attention layer followed by a two-layer MLP, and the output of the self-attention layer for the i-th behavior is:
After the self-attention layer, we use two MLP layers to model the nonlinear relationship between items and obtain the attention weight of the i-th behavior:
Where \(W_i^{MLP} \in \mathbb {R}^{d_b \times 1}\), \(W_i^{att} \in \mathbb {R}^{n_i \times 1}\),\(b_i^{MLP} \in \mathbb {R}^{n_i \times 1}\) and \(b_i^{att} \in \mathbb {R}\) are learnable parameters, \(\alpha _i\) is users’ attention weight of the i-th behavior. We use Tanh as the activation function. Compared to the softmax function, Tanh can map \(\alpha _i\) to the interval (-1,1), providing a more intuitive reflection of whether the behavior should be emphasized or ignored.
3.4 Temporal attention network layer
In other traditional sequential recommendation methods, each item in the model’s input sequence can be considered multiplied by a coefficient that is always 1, thereby blurring the impact of different behavior types. Therefore, we introduce behavior information by modifying this coefficient. To ensure that the user’s historical interaction sequence carries behavior information without affecting the model’s ability to capture the sequentiality of historical interactions, we extract the attention weights of each behavior type in the fixed-length behavior types sequence \((b_1, b_2, \cdots , b_n)\), getting a new behavioral attention sequence \(A = (\alpha ^1, \alpha ^2, \cdots , \alpha ^n)\) where \(\alpha ^i\) represents the attention weight for behavior types corresponding to the i-th interaction.
We use attention weights as correction factors, applying them to the original inputs to obtain a new input embedding matrix \(\hat{E} \in \mathbb {R}^{n \times d}\):
It is worth noting that the input matrix currently carries both item information, the corresponding behavior’s preference degree, and the next item’s behavior type.
We utilize the self-attention mechanism to model the order relationships of the user’s historical interaction behaviors. In the process of calculating proportional dot product attention [11], drawing inspiration from [13], we take into account both the relative position of the input sequence and the absolute time relationship between each item of the sequence.
Temporal attention layer
For a given input embedding \(\hat{E}\), a new sequence \(S = (S_1,S_2,\cdots ,S_n)\) is computed using self-attention weights, where each element \(S_i \in \mathbb {R}^{d}\) is the weighted sum of the linearly transformed input element, the relative position embedding, and the absolute time relationship embedding:
Where \(W^V \in \mathbb {R}^{d \times d}\) is the projection matrix of values, \(m_{ij}\) is self-attention weight, defined as:
Where \(W^Q \in \mathbb {R}^{d \times d}\) is the projection matrice of the query and \(W^K \in \mathbb {R}^{d \times d}\) is the projection matrice of key.
To prevent information leakage, i.e., the output at time step t containing the impact of subsequent items on predicting the next item, we employ a mask to disable subsequent keys from being connected to the current query at time step t, except for the behavior type of the next item.
Point-wise feed-forward network
In order to introduce nonlinearity to the model, we incorporate a point-wise two-layer feed-forward network following the self-attention layer, incorporating the ReLU in the middle:
Where \(W_1 \in \mathbb {R}^{d \times d}\) and \(W_2 \in \mathbb {R}^{d \times d}\) are weight matrix, \(b_1 \in \mathbb {R}^{d}\) and \(b_2 \in \mathbb {R}^{d}\) are bias vectors.
Stacking blocks
Drawing inspiration from [12], we cross-stack self-attention layers and feed-forward network layers to learn more complex item transformations. To counter challenges such as overfitting and gradient vanishing that are very common in stacking, we apply Dropout, layer normalization and residual connections operations to each layer, following the approach in [12]:
Where g(x) represents the self-attention layer or the point-wise feed-forward network layer.
3.5 Multi-target prediction layer
After passing through c self-attention layers and feed-forward network layers, we obtain a user preference representation that incorporates behavior information and time intervals. At time step t , we distinguish the \((t+1)\)-th behavior type of different users for multi-target prediction and calculate the user’s prediction score for item \(l_i\) when the behavior type is predicted \(b_{t+1}\):
Where \(s_{i, t}^{b_{t+1}}\) represents the probability that the next item is \(l_i\) when the target type is \(b_{t+1}\). We generate a list of K (e.g., 10) items with the highest scores for each behavior type and recommend them to the user.
3.6 Model training
We distinguish each prediction target and conduct joint training for different behavior types inspired by multi-task learning [39, 40]. At time step t, we set the behavior type \(b_{t+1}\) of the next item as the current prediction target. Specifically, for the fixed-length interaction sequence \(l = (l_1, l_2, \cdots , l_n)\), our expected output sequence is \(O=(o_1, o_2, \cdots , o_n)\), where the expected output at time step t is defined as:
Where \(<pad>\) represents the padding item. At time step t, we perform negative sampling for the positive sample \(o_t\) to obtain the negative sample \(f_t \notin L^u\). To distinguish different behavior targets, we introduce the attention weight \(\alpha ^{t+1}\) corresponding to \(b_{t+1}\) for weighting and use binary cross-entropy as the objective function for joint training:
Where \(\sigma (\cdot )\) is sigmoid function. Joint training allows parameter sharing among multiple targets while balancing the importance of different targets in a weighted manner. We optimize the model with the ADAM optimizer [41], which is a variant of stochastic gradient descent (SGD) incorporating adaptive moment estimation.
4 Experiments
In this section, we conduct extensive evaluations on different types of datasets to verify the performances of our model. Our experiments aim to address the following questions:
RQ1: Can our model outperform state-of-the-art baselines?
RQ2: Can our model capture the level of attention of different behavior types of users, and how will establishing a connection between the current item and the next behavior type during the training phase affect the model’s performance?
RQ3: Can the multi-target joint training approach outperform single-target recommendation methods?
4.1 Experimental settings
Datasets
We selected four datasets in real scenarios to conduct sufficient experiments. These datasets contain behavior information and time information of each interaction:
-
Games: It is one of the sub-datasets of the Amazon dataset, which contains users’ purchases and ratings generated from the Amazon e-commerce platform [42] by McAuley et al.
-
MovieLens: They are sourced from the GroupLens Research research team of Minnesota University and are categorized into four versions based on user count. We considered one of the versions, "ML-1M".
-
Tmall: It is a dataset from the Tmall marketplace, containing six months of shopping records. It includes four types of the user’s behaviors: click, add-to-favorite, add-to-cart, and purchase. We removed interaction records corresponding to the ’add-to-cart’ behavior to control the number of parameters.
-
YooChoose: These datasets were released by the RecSys Challenge 2015 competition and include six months of users’ clickstreams on an e-commerce website. The dataset consists of two types of users’ behaviors: click and purchase.
For all datasets, we sorted each interaction based on timestamps, and we ensured that click and add to favorites behavior for the same item did not occur after a purchase behavior. Since rating level is similar to different behavior types, they can all have different impacts on target users’ preferences, so we considered only the Games and ML-1M datasets, which only include rating data. In these two datasets, we treated interactions with a rating of 5 (maximum rating) as purchases, interactions with a rating of 4 as favorites, and all other interactions as clicks. We filtered out users with fewer than five interactions and their corresponding interactions in all datasets. On all datasets, we distinguish the behavior types corresponding to the last two items of each user’s interaction sequence, use the last two items as the validation set and test set for the corresponding behavior targets, and use all the remaining items as the training set. Table 2 summarizes the detailed data of the dataset.
Evaluation metrics
To evaluate the performance of each model, we employed the metrics of Hit Rate (HR@N) and Normalized Discounted Cumulative Gain (NDCG@N) as evaluation indicators [20]. These metrics are defined as follows:
Where N represents the length of the recommendation list, M represents the number of users, hits(i) represents the target item of user i (i.e., whether the next real interactive item is recommended). If it is recommended, then mark \(hits(i)=1\), or else mark \(hits(i)=0\). Additionally, \(p_i\) represents the position of the target item of user i in the recommendation list, and the further back the position is, the bigger \(p_i\) is. If the item cannot be retrieved from the list, \(p_i \rightarrow \infty \). Notably, for BTMT, we evaluate each behavior target using the above indicators (i.e., distinguishing the behavior types of real interactive items for evaluation). In this section’s experiments, we set N to be 10. For each user, we add 100 randomly chosen negative samples [43] following the target item and calculate the ranking of these 101 items in the recommendation list, thus obtaining HR@10 and NDCG@10. A higher score for these two metrics indicates better model’s performance.
Baselines
We conducted several experiments, comparing BTMT with the following classical methods and state-of-the-art methods as baselines to validate the effectiveness of our approach:
-
BPR [44]: It employs latent vectors to represent users and items, learning the preference order of items for users instead of traditional rating prediction, making it more suitable for implicit feedback data.
-
FMC/FPMC [6]: FMC introduces the factorization machine to capture complex relationships between users and items. FPMC extends FMC by combining the Markov chain and the factorization machine to model users’ sequential interaction behavior.
-
TransRec [45]: It models relationships between users and items using translation operations, utilizing the shared embedding space to capture the relative position relationships between users and items.
-
Caser [4]: It uses CNN to learn feature representations in sequences, capturing relationships between items at different distances.
-
SASRec [12]: It employs a unidirectional self-attention mechanism to capture the order relationships of users’ interaction behavior, adaptively learning weights of different positions in the sequence.
-
SSE-PT [15]: It introduces personalized attention mechanisms and uses SSE-SE [46] regularization to mitigate overfitting caused by the inclusion of user embedding vectors.
-
TiSASRec [13]: It introduces a time interval-aware module to capture users’ interest changes under different time intervals.
-
FMLP-Rec [29]: It utilizes MLP structure and introduces a filtering enhancement mechanism to handle noise and redundancy in users’ historical behavior sequences.
-
TLSRec [38]: It learns users’ long-term and short-term preferences by dividing sessions and using a hierarchical self-attention network, achieving the fine-grained fusion of long-term and short-term preferences through the neural time gate.
Implementation details
In our experiments, we select the model’s latent dimension d and the multi-behavior embedding dimension \(d_b\) from 10, 20, 30, 40, 50. We choose the regularization parameter \(l_2\) from 0.0001, 0.001, 0.01, 0.1, 1 and consider learning rates from 0.1, 0.01, 0.001, 0.0001. For other hyperparameters, we follow the recommendations of the respective method authors and employ their suggested initialization strategies. Furthermore, we closely monitor validation performance during training and make necessary adjustments to hyperparameters based on the validation results. If we do not observe any improvement in validation performance within 20 epochs, the model terminates the training.
Our model is implemented using PyTorch, the optimizer is Adam [41], the learning rate is set to 0.001, and the batch size is set to 128. In the time-attention network layer, we stack two self-attention layers and feed-forward network layers, i.e., \(c=2\). We determine the maximum sequence length n based on the average sequence length of each dataset and set the behavior session length (\(n_{click}\), \(n_{favorite}\) and \(n_{purchase}\)) based on the number of each behavior type. Specifically, for the ML-1M dataset, we set n to 200, the behavior session length of click \(n_{click}\) to 100, the behavior session length of adding to favorite \(n_{favorite}\) to 50, and the behavior session length of purchase \(n_{purchase}\) to 50. For the Games dataset, we set n to 50, \(n_{click}\) to 30, \(n_{favorite}\) to 10, and \(n_{purchase}\) to 10. For the Tmall dataset, n is set to 100, \(n_{click}\) to 50, \(n_{favorite}\) to 30, and \(n_{purchase}\) to 20. For the YooChoose dataset, n is set to 50, \(n_{click}\) to 30, and \(n_{purchase}\) to 20. Additionally, to accommodate characteristics of time intervals of different datasets, we set the maximum time interval \(t_{\text {max}}\) to 512 for the ML-1M dataset and 256 for remaining datasets. The dropout rate across all datasets is set at 0.5.
4.2 Performance comparison (RQ1)
Table 3 demonstrates the primary results of BTMT and all baseline methods on the four platform datasets. In this table, we evaluate BTMT for each behavior target prediction, computing the score of click targets, adding to favorites targets, and purchase targets separately. We observe that on all datasets, the final scores of each behavior target exceed those of all baselines, indicating that our proposed BTMT outperforms these state-of-the-art methods. This difference mainly stems from the Markov chain focuses more on relationships between items, and the neural network is better at capturing long sequences. However, the order in the neural network model is usually set to a small value, lacking dynamic adjustment capabilities. Consequently, Caser exhibits inferior performance compared to SASRec, which can be adaptively adjusted. FMLP-Rec performs exceptionally well on sparse datasets, but it performs poorly on dense datasets like ML-1M, performing worse than TiSASRec and earlier self-attention models. This might be related to the fact that FMLP-Rec uses the frequency-domain MLP layer instead of the more complex Transformer architecture. This method simplifies the model to effectively address the overfitting problem arising from an imbalance in the number of available information and the model’s parameters caused by data sparsity. TLSRec is the best-performing model among the baselines, suggesting that considering both order relationships between items and dependency relationships between sessions can better learn users’ interests.
For Games and Tmall, BTMT improved NDCG@10 for all behavior targets by over 4%, and the best performance was observed for the click target, which improved by 9% and 6%, respectively. The possible reason is that, on the one hand, our model introduces behavior information, allowing it to capture the impact of different behaviors on users’ interests while increasing available information for sparse data. On the other hand, we conduct joint training on different behavior targets, facilitating information sharing among them, leading to a more effective utilization of available data. On YooChoose, BTMT increased NDCG@10 for all behavior targets by over 2%, and the performance of the click target and the purchase target are similar. Even on the relatively dense ML-1M dataset, BTMT increased NDCG@10 for the most critical purchase target by over 2%. It indicates that, on both sparse and dense data, BTMT can utilize more useful information while reducing overfitting. The experimental results demonstrate that we introduce behavior information to explore the impact of different behaviors on users’ interests and use multi-target joint training to establish the connection between behavior targets, which is significant for the personalized recommendation.
4.3 Behavior information fusion (RQ2)
In real-life scenarios, users often go through many click actions before performing a smaller number of adding to favorites actions and eventually purchasing an item they like the most. Therefore, users’ preferences for items they click should be weaker than for items they add to favorites and much weaker than for items they purchase. In other words, users’ attention towards items they purchase should be the strongest. To verify whether BTMT captures this distinction, we visualize the attention weights for different behaviors learned through behavior sessions. Figure 2 shows the attention weights for several users regarding clicking, adding to favorites, and purchasing behaviors calculated by the behavior weight layer on the ML-1M and Tmall datasets, and color depth represents the magnitude of the attention weight. We observe that on the sparse Tmall and dense ML-1M datasets, all users exhibit the highest attention weights for purchase behavior, followed by adding to favorites behavior and the lowest weights for click behavior. It means that the introduction of behavior information helps the model to learn the degree of importance of different behaviors for users, leading to a more accurate modeling of users’ interests. For ML-1M, the differences in the learned weights for the three behaviors are similar. However, for Tmall, the weight of the adding to favorites behavior learned by BTMT is closer to the purchase behavior. It may be because the way we treat the rating data as behavior types makes ML-1M less correlated between behavior types (each user will only have one score for each movie), and as a result, the weights of each behavior reflect the rating magnitude and show similar differences. In comparison, Tmall has more items that have had more than two types of interactions by the same user. Consequently, the interrelationships between the different behavior types are higher, and there is a high probability of purchasing after adding to favorites for users.
In real interactive scenarios, different orders of behaviors often reflect different intentions of users. To better distinguish users’ intentions under different behaviors, we introduced the next behavior corresponding to each time step as a known condition during the training phase. We removed the target behavior embedding in the training phase in BTMT to obtain a new model BTMT_mask and compare it with BTMT to explore the impact on the performance of establishing a connection between the current item and the next behavior type. The results in Fig. 3 show that, across all datasets, the performance of each behavior target in BTMT is superior to that in BTMT_mask. It demonstrates that incorporating the next behavior type as a known condition helps the model better establish order relationships between different behaviors.
Because we need to introduce the target behavior embedding in the prediction phase to distinguish the results of different targets, introducing the target type corresponding to each time step in the training phase can enable the model to more fully learn the connection between the historical interaction sequence and the target behavior, thereby capturing users’ preferences under different intentions.
4.4 Multi-target joint (RQ3)
The purpose of proposing BTMT is to more accurately recommend items that users may interact with (including various behaviors), so we use users’ historical behavior types to capture their future intentions and make recommendations based on different behavior intentions. However, in real life, each platform hopes to understand the items that users are likely to purchase, thereby paying more attention to optimizing purchasing targets while ignoring click and adding to favorites targets. Although the ultimate target of users is to purchase, there is a strong connection between different behavior types. Therefore, click and adding to favorites targets will impact the prediction results of purchase targets. For the three behavior targets, we considered multiple optimization approaches: 1) Without considering the true behavior type of the target item, setting all the predicted behavior types in the prediction layer to purchase and optimizing only for the purchase target. We denote this model as BTMT_all_purchase. 2) Distinguishing different behavior targets and training and optimizing them separately, unaffected by other targets during this process. This optimization approach ignores the input data of all targets other than the current target during the prediction phase. We denote each behavior target of this model as BTMT_o_click, BTMT_o_favorite, and BTMT_o_purchase. 3) During the joint optimization, removing the weights of different behavior types in the loss function, assigning the same importance to each behavior type during model optimization. We denote each behavior target of this model as BTMT_s_click, BTMT_s_favorite, and BTMT_s_purchase.
Table 4 shows the comparison results of BTMT with the above models and the optimal single-target baseline model TLSRec. We conducted comparisons on all datasets, revealing that all optimization approaches of BTMT outperform TLSRec. Specifically: 1) BTMT_all_purchase is essentially a single-target recommendation model, as it predicts and optimizes only for the purchase target. In this case, the performance of BTMT_all_purchase remains superior to TLSRec, which is also a single-target recommendation model. It indicates that introducing behavior information and distinguishing the importance of different behaviors enables more accurate prediction of users’ preferences. 2) BTMT_o yields lower results for the purchase target on the Games, Tmall, and YooChoose sparse datasets compared to BTMT_all_purchase, but its performance on ML-1M is nearly equivalent. This discrepancy may be because optimizing data of different behavior targets separately will reduce the amount of data for each behavior target and weaken the correlation between data, which has a great impact on sparse datasets. However, for relatively dense datasets, appropriately reducing the amount of data can mitigate issues related to excessive data leading to overfitting. 3) In comparison to BTMT_o, BTMT_s exhibits significantly improved performance, surpassing BTMT_all_purchase for the purchase target on all datasets. It suggests that distinguishing multiple behavior targets and jointly optimizing them allows each behavior target to share information and enhance the model’s generalization performance. 4) Our final BTMT demonstrates optimal performance across all behavior targets, indicating that multi-target joint training and distinguishing the importance of different behaviors during training can more accurately capture user preferences.
4.5 Ablation study
The impact of model’s latent dimensions d
Figure 4 illustrates the variation of NDCG@10 for some neural network-based baseline models and BTMT across different models’ latent dimensions \(d \in \{ 10,20,30,40,50 \}\). BTMT outperforms the baseline models across all targets at various dimension values, with the model’s performance improving as dimensions increase. The optimal performance for all models is achieved when \(d=50\).
The impact of maximum sequence length n
The length of average interaction sequences varies across different datasets. For Games and YooChoose, we consider different values \(n \in \{ 10,20,30,40,50 \}\); for ML-1M, we use \(n \in \{ 50,100,150,200,250 \}\); and for Tmall, we use \(n \in \{ 40,70,100,130,160 \}\). Figure 5 illustrates the variation of NDCG@10 for the mentioned approaches across different maximum sequence lengths. We observe that, for datasets with relatively small average sequence lengths like Games and YooChoose, BTMT performs best at \(n=50\). For ML-1M and Tmall, the optimal performance is achieved when \(n=200\) and \(n=100\), respectively, with a decline in performance when \(n>200\) and \(n>100\), respectively. This may be because when the gap between the maximum sequence length and the average interaction sequence length is too large, irrelevant information will be introduced, thus interfering with the learning of the model.
The impact of the number of stacked self-attention blocks c
Figure 6 illustrates the variation of NDCG@10 for some specific baseline models and BTMT across different numbers of stacked blocks \(c \in \{ 1,2,3,4,5,6 \}\). For all models and datasets, optimal performance is achieved when \(c=2\). When \(c>2\), the performance of all models begins to decline. This is because adding attention blocks enhances the model’s fitting capacity but introduces the risk of overfitting.
The impact of the maximum time interval \(t_{\text {max}}\)
We selected varying magnitudes of \(t_{\text {max}}\) for different datasets to investigate its influence on model’s performance. Figure 7 illustrates the variation of NDCG@10 for the baseline model TiSASRec and BTMT with different values of \(t_{\text {max}}\) on ML-1M and Tmall. The results indicate that \(t_{\text {max}}=512\) is the optimal value for all behavior targets of TiSASRec and BTMT for ML-1M; the performance of each model is best when \(t_{\text {max}}=256\) for Tmall. When the maximum time interval exceeds the optimal value, the model’s performance begins to decline. This may be because when \(t_{\text {max}}\) is too large, it can cause items with otherwise low relevance to interfere with the learning of the model.
The impact of behavior session lengths \(n_{click}\), \(n_{favorite}\) and \(n_{purchase}\)
Similar to the maximum sequence length, we set the behavior session length of each behavior based on the number of interactions for different behaviors. We compared the performance of BTMT with different behavior session lengths on all datasets. The results in Table 5 indicate that, for Games and ML-1M datasets, in which the number of interactions for adding to favorites and purchasing is close, the optimal performance of BTMT is achieved when \(n_{favorite}=n_{purchase}\). For Tmall, the model performs best when \(n_{click}=50\), \(n_{favorite}=30\) and \(n_{purchase}=20\) while showing the worst performance when \(n_{favorite}=n_{purchase}\). For YooChoose, the model performs best when \(n_{click}=30\) and \(n_{purchase}=20\). The experimental results are consistent with the number of interactions for each behavior in each data set, i.e., the greater the number of behaviors, the longer the session length required.
5 Conclusion
In this work, we propose a novel behavior session and time-aware for multi-target sequential recommendation. Our model introduces behavior information, divides sessions based on different behaviors, and computes weights for each behavior. This design allows us to capture users’ attention to each behavior, thereby discerning users’ intent. Furthermore, we treat each behavior type as a target and perform joint training and optimization for these targets. This approach facilitates information sharing among different targets, enabling more precise predictions for each target. Extensive experiments demonstrate that BTMT outperforms state-of-the-art baselines across all targets. Additionally, our multi-target joint optimization approach significantly enhances recommendation performance compared to single-target methods.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Wang S, Cao L, Wang Y, Sheng QZ, Orgun MA, Lian D (2021) A survey on session-based recommender systems. ACM Comput Surv 54(7):1–38
Chen X, Xu H, Zhang Y, Tang J, Cao Y, Qin Z, Zha H (2018) Sequential recommendation with user memory networks. In: Proceedings of the eleventh ACM international conference on web search and data mining. pp 108–116
Shani G, Heckerman D, Brafman RI, Boutilier C (2005) An mdp-based recommender system. J Mach Learn Technol 6(9)
Tang J, Wang K (2018) Personalized top-n sequential recommendation via convolutional sequence embedding. In: Proceedings of the eleventh ACM international conference on web search and data mining. pp 565–573
Hosseinzadeh Aghdam M, Hariri N, Mobasher B, Burke R (2015) Adapting recommendations to contextual changes using hierarchical hidden markov models. In: Proceedings of the 9th ACM conference on recommender systems. pp 241–244
Rendle S, Freudenthaler C, Schmidt-Thieme L (2010) Factorizing personalized markov chains for next-basket recommendation. In: Proceedings of the 19th international conference on world wide web. pp 811–820
Shin Y, Choi J, Wi H, Park N (2024) An attentive inductive bias for sequential recommendation beyond the self-attention. Proc AAAI Conf Artif Intell 38:8984–8992
Yuan F, He X, Jiang H, Guo G, Xiong J, Xu Z, Xiong Y (2020) Future data helps training: modeling future contexts for session-based recommendation. In: Proceedings of the web conference 2020. pp 303–313
Lin G, Gao C, Zheng Y, Chang J, Niu Y, Song Y, Gai K, Li Z, Jin D, Li Y et al (2024) Mixed attention network for cross-domain sequential recommendation. In: Proceedings of the 17th ACM international conference on web search and data mining. pp 405–413
Yue Z, Wang Y, He Z, Zeng H, McAuley J, Wang D (2024) Linear recurrent units for sequential recommendation. In: Proceedings of the 17th ACM international conference on web search and data mining. pp 930–938
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Kang W-C, McAuley J (2018) Self-attentive sequential recommendation. In: 2018 IEEE International Conference on Data Mining (ICDM). IEEE, pp 197–206
Li J, Wang Y, McAuley J (2020) Time interval aware self-attention for sequential recommendation. In: Proceedings of the 13th international conference on web search and data mining. pp 322–330
Zhang S, Tay Y, Yao L, Sun A, An J (2019) Next item recommendation with self-attentive metric learning. In: Thirty-Third AAAI conference on artificial intelligence, vol. 9
Wu L, Li S, Hsieh C-J, Sharpnack J (2020) Sse-pt: sequential recommendation via personalized transformer. In: Fourteenth ACM conference on recommender systems. pp 328–337
Qin J, Zhang W, Wu X, Jin J, Fang Y, Yu Y (2020) User behavior retrieval for click-through rate prediction. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp 2347–2356
Li Z, Zhao H, Liu Q, Huang Z, Mei T, Chen E (2018) Learning from history and present: next-item recommendation via discriminatively exploiting user behaviors. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp 1734–1743
Linden G, Smith B, York J (2003) Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Comput 7(1):76–80
Koren Y, Rendle S, Bell R (2022) Advances in collaborative filtering. Recommender systems handbook. pp 91–142
He X, Liao L, Zhang H, Nie L, Hu X, Chua T-S (2017) Neural collaborative filtering. In: Proceedings of the 26th international conference on world wide web. pp 173–182
Chen J, Yu J, Lu W, Qian Y, Li P (2021) Ir-rec: an interpretive rules-guided recommendation over knowledge graph. Inf Sci 563:326–341
Mooney RJ, Roy L (2000) Content-based book recommending using learning for text categorization. In: Proceedings of the fifth ACM conference on digital libraries. pp 195–204
Yap G-E, Li X-L, Yu PS (2012) Effective next-items recommendation via personalized sequential pattern mining. In: Database systems for advanced applications: 17th international conference, DASFAA 2012, Busan, South Korea, April 15-19, 2012, Proceedings, Part II 17. Springer, pp 48–64
Hidasi B, Karatzoglou A, Baltrunas L, Tikk D (2015) Session-based recommendations with recurrent neural networks. arXiv:1511.06939
Tuan TX, Phuong TM (2017) 3d convolutional networks for session-based recommendation with content features. In: Proceedings of the eleventh ACM conference on recommender systems. pp 138–146
Wu S, Tang Y, Zhu Y, Wang L, Xie X, Tan T (2019) Session-based recommendation with graph neural networks. Proc AAAI Conf Artif Intell 33:346–353
Wang Z, Wei W, Cong G, Li X-L, Mao X-L, Qiu M (2020) Global context enhanced graph neural networks for session-based recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp 169–178
Song W, Xiao Z, Wang Y, Charlin L, Zhang M, Tang J (2019) Session-based social recommendation via dynamic graph attention networks. In: Proceedings of the twelfth ACM international conference on web search and data mining. pp 555–563
Zhou K, Yu H, Zhao WX, Wen J-R (2022) Filter-enhanced mlp is all you need for sequential recommendation. In: Proceedings of the ACM web conference 2022. pp 2388–2399
Li J, Ren P, Chen Z, Ren Z, Lian T, Ma J (2017) Neural attentive session-based recommendation. In: Proceedings of the 2017 ACM on conference on information and knowledge management. pp 1419–1428
Liu Q, Zeng Y, Mokhosi R, Zhang H (2018) Stamp: short-term attention/memory priority model for session-based recommendation. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp 1831–1839
Santana MR, Soares A (2021) Hybrid model with time modeling for sequential recommender systems. arXiv:2103.06138
Ren P, Chen Z, Li J, Ren Z, Ma J, De Rijke M (2019) Repeatnet: a repeat aware neural recommendation machine for session-based recommendation. In: Proceedings of the AAAI conference on artificial intelligence, vol 33. pp 4806–4813
Cho SM, Park E, Yoo S (2020) Meantime: mixture of attention mechanisms with multi-temporal embeddings for sequential recommendation. In: Fourteenth ACM conference on recommender systems. pp 515–520
Fan Z, Liu Z, Zhang J, Xiong Y, Zheng L, Yu PS (2021) Continuous-time sequential recommendation with temporal graph collaborative transformer. In: Proceedings of the 30th ACM international conference on information & knowledge management. pp 433–442
Singer U, Roitman H, Eshel Y, Nus A, Guy I, Levi O, Hasson I, Kiperwasser E (2022) Sequential modeling with multiple attributes for watchlist recommendation in e-commerce. In: Proceedings of the fifteenth ACM international conference on web search and data mining. pp 937–946
Zhang Y, Chen R, Hu J, Zhang G, Zhu J, Liao W (2023) Multi-aspect features of items for time-ordered sequential recommendation. J Intell Fuzzy Syst 1–17 (Preprint)
Chen L, Yang N, Yu PS (2022) Time lag aware sequential recommendation. In: Proceedings of the 31st ACM international conference on information & knowledge management. pp 212–221
Tang H, Liu J, Zhao M, Gong X (2020) Progressive layered extraction (ple): a novel multi-task learning (mtl) model for personalized recommendations. In: Proceedings of the 14th ACM conference on recommender systems. pp 269–278
Vandenhende S, Georgoulis S, Proesmans M, Dai D, Van Gool L (2020) Revisiting multi-task learning in the deep learning era. 2(3). arXiv:2004.13379
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
McAuley J, Targett C, Shi Q, Van Den Hengel A (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. pp 43–52
Koren Y (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. pp 426–434
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2012) Bpr: Bayesian personalized ranking from implicit feedback. arXiv:1205.2618
He R, Kang W-C, McAuley J (2017) Translation-based recommendation. In: Proceedings of the eleventh ACM conference on recommender systems. pp 161–169
Wu L, Li S, Hsieh C-J, Sharpnack JL (2019) Stochastic shared embeddings: data-driven regularization of embedding layers. Adv Neural Inf Process Syst 32
Funding
The work is supported by the National Natural Science Foundation of China (No.61702063, 72161005), the Natural Science Foundation Project of Chongqing (No. CSTB2023NSCQ-MSX0343), the Science and Technology Research Project of Chongqing Municipal Education Commission (No. KJZD-K202101105), Humanities and Social Sciences Research Project of Chongqing Municipal Education Commission (No.22SKGH302), Chongqing Municipal Entrepreneurship and Innovation Support Project for Returned Overseas (NO. cx2021087), Research Projects of the Science and Technology Plan of Guizhou Province (grant no. QianKeHeJiChu-ZK [2022] General 184), the Action Plan for High-Quality Development of Graduate Education of Chongqing University of Technology (NO.gzlcx20232104).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Ethical Approval
This article does not contain any studies with human participants performed by any of the authors.
Informed Consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, R., Zhang, Y., Hu, J. et al. Behavior sessions and time-aware for multi-target sequential recommendation. Appl Intell 54, 9830–9847 (2024). https://doi.org/10.1007/s10489-024-05678-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05678-6