Abstract
Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to offpolicy algorithms. However, there have been some cases in which ER has been applied for onpolicy algorithms, suggesting that offpolicyness might be a sufficient condition for applying ER. This paper reconsiders more strict “experience replayable conditions” (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actorcritic, an onpolicy algorithm. Moreover, its learning performance is comparable to that of a soft actorcritic, a stateoftheart offpolicy algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the past decade, (deep) reinforcement learning (RL) [51] has made significant progress and has gained attention in various application fields, including game AI [39, 57], robot control [22, 60], autonomous driving [7, 10], and even finance [19]. RL agents can optimize their action policies for unknown environments in a trialanderror manner. However, the trialanderror process is computationally expensive, and the low sample efficiency of RL remains a challenge.
To improve sample efficiency, many RL algorithms utilize experience replay (ER) [34]. ER stores the empirical data obtained during the trialanderror process into a replay buffer instead of immediately streaming them. By randomly and repeatedly replaying the stored empirical data for learning, the impact of each empirical data for learning, i.e. sample efficiency, can be increased. Although ER is a simple technique, it is highly useful due to its compatibility with recent deep learning libraries and abundant computational resources. However, it is claimed that ER is only applicable to offpolicy algorithms [13]. For example, proximal policy optimization (PPO), an onpolicy algorithm, learns after collecting a certain amount of empirical data but generally does not reuse them [47].
Nevertheless, a survey of past cases suggests the efficacy of ER to onpolicy algorithms. For example, previous studies [3, 65] reported that ER was successfully applied to state–action–reward–state–action (SARSA) [51], a typical onpolicy algorithm.^{Footnote 1} This SARSA is also used to learn value functions in soft actorcritic (SAC) [17] and twin delayed deep deterministic policy gradient (TD3) [14], both of which are offpolicy algorithms.^{Footnote 2} In addition, the author have successfully applied ER to PPO and its variants without any learning breakdowns [28] (also see Appendix B).
From the above suggestive cases, it is expected that being an offpolicy algorithm corresponds to a sufficient condition for “experience replayable conditions” (ERC). Then, what is necessary and sufficient conditions for ERC? This study hypothesizes that i) policy improvement algorithms in RL have the corresponding sets of empirical data that can be acceptable for learning, and that ii) ER can be applied to them if only the empirical data belonging to the respective sets are replayed. With these hypotheses, one can expect that offpolicy algorithms naturally satisfy ERC because the acceptable set for them coincides with the whole set of empirical data. As a remark, based on the past cases, it is assumeed that there is no ERC in the context of learning value functions (this assumption will become clear in the numerical verification of this paper).
To demonstrate the validity of hypotheses, this paper first reveals the instability factors that destabilize learning by ER. Specifically, one can focus on the fact revealed in [26] that the standard policy improvement algorithms can be derived as a triplet loss in metric learning [56], from the viewpoint of control as inference (CaI) [32]. That is, the instability factors of triplet loss, i.e. i) repulsive forces from negative samples (socalled hard negative) [8, 61] and ii) replays of inappropriate experiences (socalled distribution shift) [45, 62], are inherited by the policy improvement algorithms. Since the way of sampling data makes the instability factors active, ER also inherit them by randomly selecting empirical data, causing learning breakdowns.
To alleviate the identified instability factors, the two corresponding stabilization tricks, a counteraction and a mining, are proposed respectively based on an experience discriminator. For the hypothesized ERC, i) the counteraction is responsible for expanding the acceptable set of empirical data, and ii) the mining is responsible for selectively replaying the empirical data belonging to it (see Fig. 1). Hence, ERC should hold by setting both of the counteraction and mining appropriately.
With the proposed stabilization tricks, it is verified whether ER is applicable to an advantage actorcritic (A2C) [37], an onpolicy algorithm, by satisfying ERC. First, a rough grid search of hyperparameters is performed to identify the range of counteraction and mining in which ERC holds. Second, ablation tests for the counteraction and mining with the found hyperparameters show the needs for both. Finally, the A2C with the proposed stabilization tricks is evaluated on more practical multidimensional tasks, suggesting the learning performance comparable to SAC [17], which is the stateoftheart offpolicy algorithm.
In summary, this paper contributes to the following three folds:

1.
It is revealed that the reason why ER is limited to offpolicy algorithms is due to the two instability factors associated with the triplet loss hidden in the policy improvement algorithms.

2.
The two stabilization tricks, the counteraction and mining, are developed to alleviate the corresponding instability factors respectively.

3.
With the two stabilization tricks, A2C, an onpolicy algorithm, can satisfy ERC and achieve learning performance comparable to SAC.
2 Preliminaries
2.1 Basics of reinforcement learning
Let’s briefly introduce a problem statement of RL [51]. RL aims to maximize the sum of future rewards, socalled return, under Markov decision process (MDP). Mathematically, MDP is defined with the tuple \((\mathcal {S}, \mathcal {A}, p_e, r)\): \(\mathcal {S}\) and \(\mathcal {A}\) denote the state and action spaces,^{Footnote 3} respectively; \(p_e: \mathcal {S} \times \mathcal {A} \times \mathcal {S} \mapsto [0, \infty )\) gives the stochastic state transition function; and \(r: \mathcal {S} \times \mathcal {A} \mapsto \mathcal {R}\) defines the reward function for the current situation with \(\mathcal {R} \subset \mathbb {R}\) the set of rewards.
Under the above MDP, an agent encounters the current state \(s_t \in \mathcal {S}\) of an (unknown) environment at the current time step \(t \in \mathbb {N}\). The agent (stochastically) decides its action \(a_t \in \mathcal {A}\) according to its learnable policy \(\pi (a_t \mid s_t): \mathcal {S} \times \mathcal {A} \mapsto [0, \infty )\). By interacting with the environment by \(a_t\), the agent enters the next state \(s_{t+1} =: s_t^\prime \in \mathcal {S}\) according to \(p_e(s_t^\prime \mid s_t, a_t)\). At the same time, the reward \(r_t = r(s_t, a_t)\) is obtained from the environment.
By repeating the above transition, the agent obtains the return, \(R_t = \sum _{k=0}^\infty \gamma ^k r_{t+k}\), with \(\gamma \in [0, 1)\) the discount factor. RL wants to maximize the expected return from t by optimizing \(\pi (a_t \mid s_t)\) as follows:
where \(\rho ^\pi \) denotes the probability to generate the stateaction trajectory, \(\tau _t = [s_t^\prime , a_{t+1}, \ldots ]\), with the combination of \(\pi \) and \(p_e\). Here, the expected return can be defined as the following learnable value functions.
where \(V(s_t) = \mathbb {E}_{a_t \sim \pi (a_t \mid s_t)}[Q(s_t, a_t)]\) holds. The value functions are useful in acquring \(\pi ^*\), and therefore, most RL algorithms learn them.
2.2 RL algorithms with experience replay
ER stores the empirical data, \((s, a, s^\prime , r, b)\) (\(b = \pi (a \mid s) \in \mathbb {R}_+\) the policy likelihood to generate a, which is used for the proposed method), in a replay buffer \(\mathcal {B}\) with finite size \(\mathcal {B}\) [34]. For the sake of simplicity, this paper focuses on ER for learning algorithms with each single transition, although ER for longterm sequences has also been developed [20, 23]. That is, the loss function w.r.t. the learnable functions in RL (i.e. \(\pi \), V, and/or Q mainly) to be minimized is required to be computed only with each empirical data. The following minimization problem is therefore given under ER.
where f indicates the function(s) to be optimized and \(\ell _{f}^\textrm{alg}\) denotes the algorithmdependent loss function for f (even if other functions are used for computing it, they are not optimized through minimizing it). Although various variants have been proposed as the attention to practicality of ER has increased so far (e.g. prioritizing the important empirical data with high replay frequency [43, 44] and ignoring the inappropriate empirical data [38, 48]), this paper employs the most standard implementation with the FIFOtype replay buffer and uniform sampling of empirical data.
2.2.1 Soft actorcritic
Here, two algorithms mainly used in this paper are introduced briefly. The first one is SAC [17], which is known as a representative offpolicy algorithm, as the comparison. SAC learns two functions, \(\pi \) and Q, through minimization of the following loss functions.^{Footnote 4}
where \(\bar{Q}\) denotes the target value (without computational graph) and \(\alpha \ge 0\) denotes the magnitude of policy entropy, which can be autotuned by [29]. The two expectations w.r.t. \(\pi \) are roughly approximated by a onesample Monte Carlo method.
As for the optimization of Q, the expected SARSA [55], which can be an offpolicy algorithm, is employed. Indeed, the target value of Q is computed with \(\pi \), which is different from the policy used when obtaining the empirical data (socalled the behavior policy). In addition, the optimization of \(\pi \) is offpolicy as well since Q to be minimized in \(\ell _{\pi }\) can be dependent on arbitrary policies (as like [33]). In this way, SAC belongs to the offpolicy algorithm in total. As a result, SAC can optimize \(\pi \) and Q with arbitrary empirical data, making it possible to use ER freely.
2.2.2 Advantage actorcritic
Next, A2C [37], an onpolicy algorithm, is introduced as the baseline for the proposed method. In this paper, the advantage function is approximated by the temporal difference (TD) error \(\delta \) for simplicity. In other words, A2C in this paper learns \(\pi \) and V by minimizing the following loss functions.
where \(\bar{V}\) denotes the target value as like \(\bar{Q}\).
As V necessarily depends on \(\pi \) by definition, the above learning rule for V requires \(\pi \)dependent r (and \(s^\prime \)). The learning rule for \(\pi \) is also derived based on the policy gradient theorem, and again the dependence of \(\pi \) in V is assumed. In this way, A2C belongs to the onpolicy algorithm in total. As a result, A2C should not be theoretically able to learn with ER, which stores the empirical data independent of \(\pi \). Note that the policy gradient algorithm can theoretically be converted to the offpolicy algorithm by utilizing importance sampling [11], but due to computational instability and distribution shift, various countermeasures must be implemented [12, 58, 64] (see Section 6.1 in details).
2.3 Metric learning with triplet loss
Metric learning [5] is a methodology for extracting an embedding space, in which the similarity of the input data \(x \in \mathcal {X}\) (e.g. image) can be measured. The triplet loss relevant to this study [56], which is one of the loss functions to extract the embedding space \(\mathcal {Z}\), is briefly introduced. Three types of the input data, the anchor, positive, and negative data (x, \(x^+\), and \(x^\) respectively), are required to compute it. These are fed into the common networks f, outputting the corresponding features in \(\mathcal {Z}\). A distance function \(d: \mathcal {Z} \times \mathcal {Z} \mapsto \mathbb {R}_+\) is then prepared to learn the desired distance relationship in the inputs. That is, x should be close to \(x^+\) and away from \(x^\), as shown in Fig. 2.
By minimizing the following triplet loss, this relationship can be acquired.
where \(m \ge 0\) denotes the margin between the positive and negative clusters. The max operator is employed to preclude the divergence into negative infinity. As a result, the anchor data can be embedded near the positive data while being away from the negative data to some extent.
3 Reasons for instabilization by experience replay
3.1 Policy improvements via control as inference
CaI has been proposed for a new interpretation of RL [32]. To indicate that the policy improvements (e.g. A2C) can be regarded as a kind of triplet loss, CaI is utilized as below (see [26] for the details of derivation).
In this concept, an optimal variable \(O = \{0, 1\}\) is introduced. The probability mass functions for it, \(p(O=1 \mid s)\) and \(p(O=1 \mid s, a)\), can be defined with V and Q, respectively.
where \(\beta > 0\) denotes the inverse temperature parameter. That is, if \(\beta \) is small, these probabilities are likely to be near 1/2, increasing ambiguity; and if \(\beta \) is large, they often converge to 0 or 1, making them deterministic. C is an unknown value for convenience to satisfy \(V(s)  C \le 0\) and \(Q(s, a)  C \le 0\) so that the above equations satisfy the definition of probability and corresponds to the maximum value of the value function. Since numerical computation is impossible with C remaining, C must be excluded from the learning rule that is eventually derived. Note that since O is binary, the probability of \(O=0\) can be given as \(p(O=0 \mid s) = 1  p_V(s)\) and \(p(O=0 \mid s, a) = 1  p_Q(s, a)\).
Using these probabilities, the optimal and nonoptimal policies are inferred using Bayesian theorem. That is, \(\pi (a \mid s, O)\) is given as follows:
where \(b(a \mid s)\) denotes the behavior policy to sample a, which can be different from the current learnable policy \(\pi \).
Now, to optimize \(\pi \), the following minimization problem is solved.
where \(\textrm{KL}(\cdot \mid \cdot )\) denotes KullbackLeibler (KL) divergence between two probability distributions. The gradient of these two terms w.r.t. \(\pi \), \(g_\pi \), is derived as follows:
Here, by assuming \(\beta \rightarrow \infty \), this gradient can be simplified. In addition, \(Q  V = A\) the advantage function can be approximated by TD error \(\delta \) (see Section 2.2.2). As mentioned above, when \(\beta \) is large, \(p(O=1)\) deterministically takes 0 or 1, and with \(\beta \rightarrow \infty \), \(O=1\) is not obtained unless \(V = Q = C\) (i.e. the value function is maximized by the optimized policy, which is consistent with (1)).
The minimization of (13) using the surrogated gradient is consistent with the minimization of the loss function for A2C, \(\ell _\pi ^\textrm{A2C}\), by (stochastic) gradient descent. The original minimization problem can be interpreted as trying to move the anchor data \(\pi \) closer to the positive data \(\pi ^+\) and away from the negative data \(\pi ^\) by employing KL divergence as the distance function d.^{Footnote 5} Compared to (9), the margin m and the max operator are not used, but this is because \(m=0\) and \(\pi ^{+,}\) is centered on b, preventing the divergence to infinity.
3.2 Instability factors hidden in triplet loss
As the policy improvements correspond to minimizing triplet loss, then its characteristics during learning should also inherit that of minimizing triplet loss. Ideally, the anchor data should approach the positive data, resulting in \(\pi \rightarrow \pi ^+\). On the other hand, the minimization of triplet loss is different from simple supervised learning problems, and several factors that destabilize learning have been reported. These instability factors are influenced by the way triplets are selected from the dataset. In the policy improvements, therefore, they are activated by randomly sampling empirical data from ER (and using them for learning). Under this assumption, the following two instability factors are raised, along with their solution guidelines noted in the context of metric learning, which should be useful for RL combined with ER.
First, selecting the inappropriate anchor, positive, and negative data might cause \(\textrm{KL}(\pi \mid \pi ^+) > \textrm{KL}(\pi \mid \pi ^)\). In this case, the repulsion from \(\pi ^\) is stronger than the attraction to \(\pi ^+\) and the optimal solution cannot be found, updating \(\pi \) by the divergent behavior. This is known as hard negative [8, 61]. To alleviate this instability factor, the exclusion of hardnegative triplets and/or the regularization that suppresses the repulsion would be required.
Second, from all the triplets that can be constructed, only a few can be used for optimization. In fact, \(\pi ^+\) and \(\pi ^\) are linked via the behavior policy b in the policy improvements, and no arbitrary triplet can be constructed. This might cause distribution shift, inducing a large bias in learning [45, 62]. To alleviate this instability factor, it is desirable to eliminate triplets that are prone to bias and/or to regularize the distribution of selected triplets.
4 Tricks for experience replayable conditions
4.1 Experience discriminator
The instability factors induced by the above triplet loss have to be suppressed to satisfy ERC. To this end, two stabilization tricks, counteraction and mining, are heuristically designed (see Fig. 3). As a common module for them, experience discriminator, is first introduced. In other words, it estimates whether the empirical data of ER is suitable for triplet construction.
Specifically, it is required to judge whether the stateaction pair (s, a) in the buffer \(\mathcal {B}\) can be regarded as one generated by the current policy \(\pi \). According to the density trick, the following density ratio d satisfies this requirement.
where \(\sigma (\cdot ) \in [0, 1]\) is the sigmoid function. Note that d should be equal to or less than 0.5 since (s, a) is actually generated from the behavior policy with its likelihood b.
In addition, robust judgements should be considered w.r.t. the stochasticity of actions and the nonstationarity of policies. For this purpose, \(D:\mathcal {S} \mapsto [0, 1]\), which marginalizes d by a and has only s as input, is defined as a learnable model. As D corresponds to the probability parameter of Bernoulli distribution, it can be optimized by minimizing its negative loglikelihood.
where the above d is employed as the supervised signal.
4.2 Counteraction of deviations from nonoptimal policies
First trick, socalled the counteraction, is proposed as a countermeasure against the hard negative mainly. In the previous work [8], the regularization to change the ratio of two terms in (9) has been proposed to counteract the repulsion from the negative data. This concept can be reproduced in (13) as follows:
where \(\lambda \ge 0\) denotes the gain for imbalancing the positive and negative terms.^{Footnote 6}
Now, how this extension works in theory is shown. As in the original minimization problem, the gradient for the added regularization is derived as below.
Now, the second term can be decomposed to (14) by using the definitions of \(p_V\) and \(p_Q\) in (10) and (11), respectively.
That is, the second term can be absorbed into the original loss.
The added regularization, therefore, has a role to constrain \(\pi \rightarrow b\) only. As b should be located between \(\pi ^+\) and \(\pi ^\), this regularization is expected to avoid the hard negative and stabilize \(\pi \rightarrow \pi ^+\). However, to implement this regularization directly, the behavior policy \(b(a \mid s)\) at each experience must be retained, which is costly. As an alternative regularization way, the adversarial learning to (16) is implemented in this paper.
That is, it takes advantage of the fact that the higher the misidentification rate of D means \(\pi \sim b\), reducing the risk of hard negative (see Fig. 3a). Since d has the computational graph w.r.t. \(\pi \), the following loss function can be given as the counteraction trick for regularizing \(\pi \) to b.
where \(\omega \ge 0\) denotes the gain, which is designed below. One can find that the term related to D also behaves as a gain and is larger when \(D \ll 1\) (i.e. no misidentification). In addition, if d is clipped to 0.5 (i.e. \(\pi \simeq b\) holds), the gradient w.r.t. \(\pi \) is zero. Note that, in the actual implementation, the gradient reversal layer [15] is useful to lump (16) and (20) together.
As for \(\omega \), \(d \ll 1\) has the small gradient due to the sigmoid function, not like \(D \ll 1\). To compensate it and converge to \(\pi \rightarrow b\) faster even in such a case, \(\omega \) is designed in the manner of PI control, referring to the literature [50].
where \(\hat{d}\) means d without the computational graph, and \(\eta _C \ge 0\) denotes the hyperparameter for this counteraction. \(I \ge 0\) is the integral term with saturation by multiplying 0.5 (its initial value must be zero).
4.3 Mining of indistinguishable experiences
Second trick, socalled the mining, is proposed to mitigate the effect of distribution shift mainly. In the previous studies, semihard triplets, in which \(d(x, x^+) + m < d(x, x^)\) holds, are considered useful [45]. Since the optimization problem in this study sets \(m=0\), \(\textrm{KL}(\pi \mid \pi ^+) < \textrm{KL}(\pi \mid \pi ^)\) seems better. However, this is also regarded to be hard triplets, which would induce the selection bias. Of course, \(\textrm{KL}(\pi \mid \pi ^+) > \textrm{KL}(\pi \mid \pi ^)\) is still not suitable for learning because of the hard negative relationship described above. Hence, the triplets with \(\textrm{KL}(\pi \mid \pi ^+) \simeq \textrm{KL}(\pi \mid \pi ^)\) are desired to be mined.
Specifically, it takes advantage of the fact that, in this study, the anchor data is determined by the current policy \(\pi \), and the positive and negative data, \(\pi ^+\) and \(\pi ^\), are located around the behavior policy b. In other words, the indistinguishable empirical data with \(\pi \simeq b\) is likely to achieve the desired relationship (see Fig. 3b). The mining trick is therefore given as the following stochastic dropout [49].
where \(\mathcal {U}(l, u)\) is the uniform distribution within [l, u], and \(\eta _M \ge 0\) denotes the hyperparameter for this mining. When sampling the empirical data from the replay buffer \(\mathcal {B}\), each data is screend by the mining: if \(M=1\), it is used for the optimization; if \(M=0\), it is excluded.
5 Simulations
5.1 Overview
Here, it is verified that ERC holds by the two stabilization tricks. To do so, A2C [37], which is the onpolicy algorithm and generally considered inapplicable for ER, is employed as a baseline (see Appendix A for detailed settings). Learning is conducted only with the empirical data replayed from ER at the end of each episode (without online learning).
The following three steps are performed for the stepbystep verification. With them, it is shown that alleviating the instability factors hidden in triplet loss is effective not only to satisfy ERC, but also to learn the optimal policy with comparable sample efficiency to SAC [17].

1.
The hyperparameters of the respective stabilization tricks, \(\eta _{C,M} \ge 0\), are roughly examined their effective ranges in a toyproblem, determining the values to be used in the subsequent verification.

2.
It is confirmed that both of the two stabilization tricks are complementarily essential to satisfy ERC and to learn the optimal policy through ablation tests on three major tasks implemented in Mujoco [53].

3.
Finally, the A2C with the proposed stabilization tricks is evaluated on two tasks of relatively large problem scale in dm_control [54] in comparison to SAC, which is the latest offpolicy algorithm.
Note that training for each task and condition is conducted 12 times with different random seeds in order to evaluate the performance with their statistics. In the above, the stabilization tricks demonstrate that ER can be applied to A2C with the excellent learning performance. In addition, the application of the two stabilization tricks to the other onpolicy algorithm, i.e. PPO [47], and their quantitative contribution to the learning performance are investigated in Appendix B.
5.2 Effective range of hyperparameters
A toyproblem named DoublePendulum, where a pendulum agent with two passive joints tries to balance in a standing position by moving its base, is solved. This has only onedimensional action space and the state space limited by a terminal condition (i.e. excessive tilting). Thus, as this is a relatively simple problem and many empirical data should naturally satisfy ERC, the effect of \(\eta _{C,M}\) should behave in a Gaussian manner, enabling adjustment statistically.
To roughly check the effective parameter ranges of \(\eta _{C,M}\), \(5 \times 5 = 25\) conditions with \(\eta _{C,M} = \{0.1, 0.5, 1.0, 5.0, 10.0\}\) are compared. The test results, which are evaluated after learning, are summarized in Fig. 4. The results suggest the following two points.
First, \(\eta _M\) must be of a certain scale to activate the selection of empirical data, or else the performance can be significantly degraded. However, an excessively large \(\eta _M\) would result in too little empirical data being replayed. Therefore, \(\eta _M\) is considered reasonable in the vicinity of 1.
Second, \(\eta _C\) seems to have an appropriate range depending on \(\eta _M\). In other words, if \(\eta _M\) is excessively large and the replayable empirical data is too limited, it is desirable to increase them by increasing \(\eta _C\). On the other hand, if the empirical data are moderately selected around \(\eta _M \simeq 1\), it seems essential to relax the counteraction to some extent as \(\eta _C \le 1\).
Based on the above two trends, and as a natural setting, \((\eta _C, \eta _M)\) are decided to be (0.5, 2.0) for the remaining experiments. The learning performance at this time is also shown in Fig. 4 (the black bar), and is higher than the rough grid search results (i.e. \(\sim 416\) while the others did not exceed 400).
5.3 Ablation tests
Next, the needs for the two stabilization tricks are demonstrated through ablation tests. Specifically, the presence or absence of the two stabilization tricks (labeled with their initials ‘C’ and ‘M’) is switched by setting \(\eta _{C,M} = 0\). The four conditions of that combination are compared in the following three tasks in Mujoco [53]: Reacher; Hopper; and HalfCheetah.
The learning curves for these returns are shown in Fig. 5. As can be seen from the results, Vanilla (a.k.a. the standard A2C) without the two stabilization tricks did not learn any tasks at all, probably because it does not satisfy ERC. Adding the counteraction trick alone did not improve learning as well, but adding the mining trick improved somewhat. Nevertheless, it seems that either of them does not satisfy ERC or learn the optimal policy. On the other hand, only the case using the two stabilization tricks was able to learn all tasks. From these results, it can be concluded that both of the two stabilization tricks are necessary (and sufficient) to satisfy ERC and learn the optimal policy.
Now let’s see why both of the stabilization tricks are required. To this end, the internal parameters for the respective stabilization tricks, \(\omega (s, a, b)\) and \(p(M \mid s, a, b)\), are depicted in Figs. 6 and 7. One can find that the cases only with one of the stabilization tricks saturated the respective internal parameters immediately. That is, if using the counteraction trick only, its regularization to \(\pi \rightarrow b\) is not enough for satisfying ERC; and if using the mining trick only, its selection is too strict for learning the optimal policy. In contrast, by using both, the internal parameters converged to taskspecific values without saturation. That is why the two stabilization tricks need to be employed together to increase the amount of replayable empirical data to reach a level where the optimal policy can be learned while still satisfying ERC.
5.4 Performance comparison
Finally, the learning performance of the A2C with the proposed stabilization tricks (labeled ERC) is compared to the stateoftheart offpolicy algorithm, SAC [17]. Two practical tasks with more than 10dimensional action space, QuadrupedWalk (with \(\mathcal {A} = 12\)) and Swimmer15D (with \(\mathcal {A} = 14\)) in dm_control [54], are solved.
The learning curves and test results after learning are depicted in Fig. 8. As can be seen, while SAC solved Quadruped at a high level, it did not solve Swimmer consistently. On the other hand, ERC sometimes delayed convergence on Quadruped, but showed high performance on Swimmer. In addition, the trends of the learning curves suggested that SAC learns faster, while ERC learns more stably. Thus, it can be said that ERC has a learning performance comparable to that of SAC, although they have different strengths in different tasks.
As a remark, the difference in performance between ERC and SAC might be attributed to the difference in the policy gradients used. Actually, ERC learns the policy with likelihoodratio gradients, while SAC does so with reparameterization gradients [41]. The former obtains the gradients over the entire (multidimensional) policy, which cannot optimize each component of actions independently. As a result, this is useful to learn tasks that require synchonization of actions. The latter, on the other hand, obtains the gradients for each component of actions, and thus can quickly learn tasks that do not require precise synchronization. These learning characteristics correspond to the two types of tasks solved in this study. Hence, it is suggested that SAC performed well for Quadruped, where synchronization at each leg is sufficient, and ERC performed well for Swimmer, where synchronization at all joints is needed. Note that policy improvements by mixing them have been proposed, which would be expected to be useful [16]. The proposed stabilization tricks allow for the policy improvements combined with ER regardless of which gradient is used. That would facilitate the development of such hybrid algorithms and improve learning performance in a complementary manner.
6 Discussions
6.1 Relations to conventional algorithms
As demonstrated in the above section, ERC hypothesized in this study can be achieved by limiting the replayed empirical data to ones acceptable by the applied RL algorithm. At the same time, the improvement in learning efficiency due to ER can be increased by expanding the acceptable set of empirical data. These were achieved by the two stabilization tricks proposed, with \(\pi \simeq b\) being the key to both. This key corresponds to the concept of onpolicyness, where the empirical data contained in the replay buffer can be regarded as generated from the current policy \(\pi \) [13]. In other words, the stabilization tricks aimed to stabilize the triplet loss and consequently increased the onpolicyness. Conversely, the usefulness of increasing the onpolicyness in the previous studies [38, 48] can be explained in terms of mitigating the instability factors hidden in the triplet loss.
Actually, ER can be applied to make A2C offpolicy [11]. To this end, the most important trick is to change the sampler by importance sampling, as described above. This trick involves weighting the original loss function by \(\pi /b\), which is unstable when \(b \simeq 0\) (i.e. rare actions are sampled). Several heuristics have been proposed to resolve this instability [12, 58, 64]. Specifically, the most famous offpolicy actorcritic algorithm with ER, ACER [58], modifies the update rules of not only the policy function but also the value function, while the proposed tricks indicated no need to modify the value function. Although GeoffPAC [64] and P3O [12] only modify the policy improvements, the former is tedious and lacks the intuitive explanatory capability of ERC, while the latter is inappropriate for confirming ERC because it uses the freshly collected empirical data for learning separately from those sampled from ER.
Thus, previous studies have only focused on offpolicyness and improved learning performance of the algorithms, without any discussion why ER is applicable. On the other hand, this study explored ERC and implicitly obtained utility similar to making the target algorithm offpolicy. Indeed, the proposed stabilization tricks selectively replay the empirical data with \(\pi \simeq b\) while increasing the number of acceptable ones. As a result, the weight in this important sampling can be ignored as \(\pi /b \simeq 1\). That is why the A2C with the proposed stabilization tricks, although not explicitly offpolicy, made ER available as if they were offpolicy.
As a remark, the report that PPO (and its variants), which is regarded to be onpolicy [47], could utilize ER empirically [28] is related to the above considerations and the proposed stabilization tricks. That is, PPO applies regularization aiming at \(\pi \simeq b\), and if \(\pi \) deviates from b, it clips the weight of the importance sampling, resulting in excluding that data. Thus, PPO can increase the onpolicyness of empirical data and exclude from replays that are unacceptable, so ER is actually applicable (see the empirical results in Appendix B).
Finally, another interpretation of why DDPG [33], SAC [17], and TD3 [14], which are considered offpolicy, can stably learn the optimal policy using ER. These algorithms optimize the current policy \(\pi \) by using reparameterization gradients through the actions generated from \(\pi \) fed into the action value function Q. As mentioned in the SAC paper, this can be derived from the minimization of KL divergence between \(\pi \) and the optimal policy based on Q only. In other words, unlike algorithms that use likelihoodratio gradients such as A2C, they do not deal with the triplet loss. Therefore, it can be considered that ER is applicable since there is no instability factors induced by the triplet loss in the first place.
6.2 Limitations of stabiliation tricks
For ERC holding, this paper proposed the two stabilization tricks, the contributions of which were evaluated in the above section. However, these naturally leave room for improvements probably because they have been designed heuristically. First, using them requires a discriminator D learned through (16), which needs an extra computational cost. While it is possible to implement the two stabilization tricks using only d in (15) without D, d is an unstable variable, so a lightweight stabilizer will be needed to replace D.
It is also important to note that the decision on D and d is based on the likelihood \(\pi \) and b. That is, when the policy deals with a highdimensional action space, even a small deviation in action may be judged as \(\pi \ne b\) severely. In fact, when the RL benchmark with musculoskeletal model, i.e. myosuite [6], was tested with \(\mathcal {A}=39\), \(p(M \mid s, a, b)\) often converged to one depending on the conditions. This means that the empirical data were rarely replayed. To avoid this excessive exclusion of empirical data, it would be useful to either determine the final exclusion of empirical data by summarizing the respective judgements on individual action dimension separately; or to optimize the policy by masking the inappropriate action dimension only.
In addition, the counteraction alone did not satisfy ERC, as indicated in the ablation tests. This is probably due to saturation of the regularization gain \(\omega \) (see Fig. 6). Although a nonsaturated gain made learning unstable empirically, it would be better to relax the saturation to some extent. Alternatively, an autotuning trick, which is possible by once interpreting the regularization as the corresponding constraint [18, 29], is considered to be useful.
7 Conclusion
This study reconsidered the factors that determine whether ER is applicable to RL algorithms. To this end, the instability factors that ER might induce especially in onpolicy algorithms were first revealed. In other words, it was found that the policy gradient algorithms can be regarded as the minimization of triplet loss in metric learning, inheriting its instability factors. To alleviate them, the two stabilization tricks, the counteraction and mining, were proposed as countermeasures attached to arbitrary RL algorithms. The counteraction and mining are responsible for i) expanding the set of acceptable empirical data for each RL algorithm and ii) excluding empirical data outside the set, respectively. Through multiple simulations, ERC indeed holded by using these two stabilization tricks. Furthermore, the standard onpolicy algorithm with them achieved the learning performance comparable to the stateoftheart offpolicy algorithm.
As described in the discussion, the two stabilization tricks proposed to satisfy ERC leave some room for improvements. In particular, since the hypotheses formulated in this study is now deeply related to the onpolicyness of the empirical data in the replay buffer, they would be improved based on this perspective. Afterwards, other RL algorithms will be integrated with the stabilization tricks for further investigations of ERC. On the other hand, although the simplest ER method was tested for simplicity since the proposed tricks allow for arbitrary ER methods, it would be interesting to consider more sophisticated methods [35, 59] and methods modified to suit the problem [9] or algorithm [1]. In particular, by following the latter direction and developing an ER method suitable for onpolicy RL algorithms, a significant performance improvement might be expected.
Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Notes
Theoretically, their learning rules corresponds to the expected SARSA [55], which is an offpolicy algorithm, but their implementations are with a rough Monte Carlo approximation and would lack rigor.
Continuous spaces are assumed without loss of generality to match the experiments in this paper.
The actual implementation has two Q functions to conservatively compute the target value.
Although KL divergence does not actually satisfy the definition of distance, it is widely used in probability geometry because of its distancelike property, which is nonnegative and zero only when two probability distributions coincide.
The same is true if the second term is multiplied by the gain \(\lambda \in (0, 1)\).
References
Banerjee C, Chen Z, Noman N (2024) Improved soft actorcritic: mixing prioritized offpolicy samples with onpolicy experiences. IEEE Trans Neural Netw Learn Syst 35(3):3121–3129
Barron JT (2021) Squareplus: a softpluslike algebraic rectifier. arXiv preprint arXiv:2112.11687
Bejjani W, Papallas R, Leonetti M et al (2018a) Planning with a receding horizon for manipulation in clutter using a learned value function. arXiv:1803.08100
Bejjani W, Papallas R, Leonetti M et al (2018b) Planning with a receding horizon for manipulation in clutter using a learned value function. In: IEEERAS international conference on humanoid robots. IEEE, pp 1–9
Bellet A, Habrard A, Sebban M (2022) Metric learning. Springer Nature
Caggiano V, Wang H, Durandau G et al (2022) Myosuite–a contactrich simulation suite for musculoskeletal motor control. In: Learning for dynamics and control conference. PMLR, pp 492–507
Chen J, Li SE, Tomizuka M (2021) Interpretable endtoend urban autonomous driving with latent deep reinforcement learning. IEEE Trans Intell Transp Syst 23(6):5068–5078
Cheng D, Gong Y, Zhou S et al (2016) Person reidentification by multichannel partsbased cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344
Christianos F, Schäfer L, Albrecht S (2020) Shared experience actorcritic for multiagent reinforcement learning. Adv Neural Inf Process Syst 33:10707–10717
Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sampleefficient model predictive controlbased reinforcement learning approach. J Field Robot 38(3):331–354
Degris T, White M, Sutton RS (2012) Offpolicy actorcritic. In: International conference on machine learning, pp 179–186
Fakoor R, Chaudhari P, Smola AJ (2020) P3O: policyon policyoff policy optimization. In: Uncertainty in artificial intelligence. PMLR, pp 1017–1027
Fedus W, Ramachandran P, Agarwal R et al (2020) Revisiting fundamentals of experience replay. In: International conference on machine learning. PMLR, pp 3061–3071
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actorcritic methods. In: International conference on machine learning. PMLR, pp 1587–1596
Ganin Y, Ustinova E, Ajakan H et al (2016) Domainadversarial training of neural networks. J Mach Learn Res 17(59):1–35
Gu SS, Lillicrap T, Turner RE et al (2017) Interpolated policy gradient: merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. Adv Neural Inf Process Syst 30:3849–3858
Haarnoja T, Zhou A, Abbeel P et al (2018a) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning. PMLR, pp 1861–1870
Haarnoja T, Zhou A, Hartikainen K et al (2018b) Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905
Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Math Financ 33(3):437–503
Hansen S, Pritzel A, Sprechmann P et al (2018) Fast deep reinforcement learning using online adjustments from the past. Adv Neural Inf Process Syst 31:10590–10600
Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: adaptive tdistribution estimated robust moments for noiserobust stochastic gradient optimization. Neurocomputing 557:126692
Kalashnikov D, Irpan A, Pastor P et al (2018) Scalable deep reinforcement learning for visionbased robotic manipulation. In: Conference on robot learning. PMLR, pp 651–673
Kapturowski S, Ostrovski G, Quan J et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International conference on learning representations
Kobayashi T (2019) Studentt policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49(12):4335–4347
Kobayashi T (2022a) L2c2: locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International conference on intelligent robots and systems. IEEE, pp 4032–4039
Kobayashi T (2022b) Optimistic reinforcement learning by forward kullbackleibler divergence optimization. Neural Netw 152:169–180
Kobayashi T (2023a) Intentionallyunderestimated value function at terminal state for temporaldifference learning with misdesigned reward. arXiv preprint arXiv:2308.12772
Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results Control Optim 10:100192
Kobayashi T (2023c) Soft actorcritic algorithm with trulysatisfied inequality constraint. arXiv preprint arXiv:2303.04356
Kobayashi T (2024) Consolidated adaptive tsoft update for deep reinforcement learning. In: IEEE World congress on computational intelligence
Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Adv Robot 37(12):719–736
Levine S (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909
Lillicrap TP, Hunt JJ, Pritzel A et al (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations
Lin LJ (1992) Selfimproving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8(3–4):293–321
Liu X, Zhu T, Jiang C et al (2022) Prioritized experience replay based on multiarmed bandit. Expert Syst Appl 189:116023
Mnih V, Kavukcuoglu K, Silver D et al (2015) Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533
Mnih V, Badia AP, Mirza M et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937
Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International conference on machine learning. PMLR, pp 4851–4860
Oh I, Rho S, Moon S et al (2021) Creating prolevel ai for a realtime fighting game using deep reinforcement learning. IEEE Trans on Games 14(2):212–220
Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Adv Neural Inf Process Syst 31:8626–8638
Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International conference on artificial intelligence and statistics. PMLR, pp 4078–4086
Paszke A, Gross S, Massa F et al (2019) Pytorch: an imperative style, highperformance deep learning library. Adv Neural Inf Process Syst 32:8026–8037
Saglam B, Mutlu FB, Cicek DC et al (2023) Actor prioritized experience replay. J Artif Intell Res 78:639–672
Schaul T, Quan J, Antonoglou I et al (2016) Prioritized experience replay. In: International conference on learning representations
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823
Schulman J, Moritz P, Levine S et al (2016) Highdimensional continuous control using generalized advantage estimation. In: International conference on learning representations
Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Sinha S, Song J, Garg A et al (2022) Experience replay with likelihoodfree importance weights. In: Learning for dynamics and control conference. PMLR, pp 110–123
Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International conference on machine learning. PMLR, pp 9133–9143
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press
Tai JJ, Wong J, Innocente M et al (2023) Pyflyt–uav simulation environments for reinforcement learning research. arXiv preprint arXiv:2304.01305
Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for modelbased control. In: IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 5026–5033
Tunyasuvunakool S, Muldal A, Doron Y et al (2020) dm_control: software and tasks for continuous control. Softw Impacts 6:100022
Van Seijen H, Van Hasselt H, Whiteson S et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning. IEEE, pp 177–184
Wang J, Song Y, Leung T et al (2014) Learning finegrained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393
Wang X, Song J, Qi P et al (2021) Scc: an efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning. PMLR, pp 10905–10915
Wang Z, Bapst V, Heess N et al (2017) Sample efficient actorcritic with experience replay. In: International conference on learning representations
Wei W, Wang D, Li L et al (2024) Reattentive experience replay in offpolicy reinforcement learning. Machine Learning, pp 1–23
Wu P, Escontrela A, Hafner D et al (2023) Daydreamer: world models for physical robot learning. In: Conference on robot learning. PMLR, pp 2226–2240
Xuan H, Stylianou A, Liu X et al (2020) Hard negative examples are hard, but useful. In: European conference on computer vision, pp 126–142
Yu B, Liu T, Gong M et al (2018) Correcting the triplet selection bias for triplet loss. In: European conference on computer vision, pp 71–87
Zhang B, Sennrich R (2019) Root mean square layer normalization. Adv Neural Inf Process Syst 32:12381–12392
Zhang S, Boehmer W, Whiteson S (2019) Generalized offpolicy actorcritic. Adv Neural Inf Process Syst 32:2001–2011
Zhao D, Wang H, Shao K et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence. IEEE, pp 1–6
Acknowledgements
This research was supported by “Strategic Research Projects” grant from ROIS (Research Organization of Information and Systems).
Author information
Authors and Affiliations
Contributions
Taisuke Kobayashi contributed to everything for this paper: Conceptualization, Methodology, Software, Validation, Investigation, Visualization, Funding acquisition, and Writing.
Corresponding author
Ethics declarations
Competing Interests
The author declares that there is no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Compliance with Ethical Standards
The data used in this study was exclusively generated by the author. No research involving human participants or animals has been performed.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Details of implementation
The algorithms used in this paper is implemented with Pytorch [42]. This implementation is based on the one in the literature [27]. The characteristic hyperparameters in this implementation are listed up in Table 1.
The policy and value functions are approximated by the fullyconnected neural networks consisting of two hidden layers with 100 neurons for each. As activation functions, Squish function [2, 31] and RMSNorm [63] are combined. AdaTerm [21], which is the noiserobust optimizer, is adopted for robustly optimizing parameters against noises caused by bootstrapped learning in RL. Similarly, the target networks that are updated by CATsoft update [30] are employed to stabilize learning (and to prevent learning speed degradation). In A2C, the target networks are applied to both the policy \(\pi \) and the value function V, while in SAC they are applied only to the action value function Q. In addition, A2C enhances output continuity by using L2C2 [25] with default parameters, although SAC does not so due to reproduction of its standard implementation.
Both A2C and SAC policies are modeled as Student’s tdistribution with high global exploration capability [24]. Therefore, the outputs from the networks are three model parameters: position, scale, and degrees of freedom. However, as in the standard implementation of SAC, the process of converting the generated action to a bounded range is also explicitly considered as a probability distribution. On the other hand, in A2C, this process is performed implicitly on the environment side and is not reflected in the probability distribution.
SAC approximates the two action value functions \(Q_{1,2}\) with independent networks as in the standard implementation, and aims at conservative learning by selecting the smaller value. On the other hand, A2C aims at stable learning by outputting 10 values from shared networks and using the median as the representative value. In order to enhance the effect of ensemble learning, the outputs are computed with both learnable and unlearnable parameters, so that each output can easily take on different values (especially in unlearned regions) [40].
A2C and SAC share the settings of ER, with a buffer size of 102,400 and a batch size of 256. The replay buffer is in FIFO format that deletes the oldest empirical data when the buffer size is exceeded. At the end of each episode, half of the empirical data stored in ER is replayed uniformly at random.
Appendix B: Application to PPO
The proposed stabilization tricks are applied to PPO [47], a latest onpolicy algorithm other than A2C. PPO multiplies the policy improvement by the policy likelihood ratio using importance sampling, and it is clipped to force the gradient to zero if it is excessive. The recommended clipping threshold, 0.2, is employed. Since this paper uses an ER that accumulates empirical data for the simplest single transition, GAE [46], which is often used in conjunction with PPO, is ignored. In addition, to check the regularization effects by the stabilization tricks, an extra regularization term, i.e. the policy entropy, is also omitted. As mentioned in Introduction, as PPO is empirically ERapplicable, it is possible to quantitatively compare the learning trends and changes in the final policies due to each stabilization trick.
With the above setup, the results of solving QuadXWaypointsv2 and FixedwingWaypointsv2 provided by PyFlyt [52] are summarized below. Note that these tasks intend to control different drones (i.e. a quadrotor and a fixedwing drone), respectively. The learning curves for 7 trials and test results after learning in the four conditions diverging with and without the two stabilization tricks are depicted in Fig. 9.
First, the vanilla PPO shows the remarkable but expected results. Although PPO is originally considered an onpolicy algorithm, as described in the text, it achieved learning of both tasks in combination with ER even without the addition of the proposed stabilization tricks. This is because, as discussed in Section 6.1, PPO performs regularization and clipping heuristically such that \(\pi \simeq b\) (i.e. onpolicyness) holds. In fact, PPO with only one of the stabilization trick did not saturate the corresponding internal parameter, indicating that ERC was satisfied without the lack of its functionality. Note, however, that PPO by itself may not be sufficient to satisfy ERC, since GAE, which was omitted from the implementation this time, relies more heavily on the behavior policy than the simple advantage function (i.e. TD error).
Next, the contributions of the proposed stabilization tricks are confirmed. First, it is easy to see that the mining trick increases learning speed, while the counteraction trick tends to decrease it. This may be due to the fact that the direction of policy improvements becomes clearer by making the mining trick replay only the empirical data that are useful for learning, while the counteraction trick restricts policy improvements by binding \(\pi \simeq b\). For the former, actually, the scale of TD error, which implies the learning direction, was increased when the mining trick was added.
On the other hand, the counteraction trick seems to improve the exploration capability in exchange for the learning speed. The return at the end of learning was maximized by including the counteraction trick on Fixedwing, although the return on Quadrotor with it was lower than others due to slow convergence. This may be due to the increase in entropy of \(\pi \) by regularizing it to various the behavior policies in the replay buffer. In fact, the addition of the counteraction trick yielded the decrease of \(\ln \pi \), which corresponds to the negative entropy, during learning.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kobayashi, T. Revisiting experience replayable conditions. Appl Intell 54, 9381–9394 (2024). https://doi.org/10.1007/s10489024056857
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489024056857