1 Introduction

In the past decade, (deep) reinforcement learning (RL) [51] has made significant progress and has gained attention in various application fields, including game AI [39, 57], robot control [22, 60], autonomous driving [7, 10], and even finance [19]. RL agents can optimize their action policies for unknown environments in a trial-and-error manner. However, the trial-and-error process is computationally expensive, and the low sample efficiency of RL remains a challenge.

To improve sample efficiency, many RL algorithms utilize experience replay (ER) [34]. ER stores the empirical data obtained during the trial-and-error process into a replay buffer instead of immediately streaming them. By randomly and repeatedly replaying the stored empirical data for learning, the impact of each empirical data for learning, i.e. sample efficiency, can be increased. Although ER is a simple technique, it is highly useful due to its compatibility with recent deep learning libraries and abundant computational resources. However, it is claimed that ER is only applicable to off-policy algorithms [13]. For example, proximal policy optimization (PPO), an on-policy algorithm, learns after collecting a certain amount of empirical data but generally does not reuse them [47].

Nevertheless, a survey of past cases suggests the efficacy of ER to on-policy algorithms. For example, previous studies [3, 65] reported that ER was successfully applied to state–action–reward–state–action (SARSA) [51], a typical on-policy algorithm.Footnote 1 This SARSA is also used to learn value functions in soft actor-critic (SAC) [17] and twin delayed deep deterministic policy gradient (TD3) [14], both of which are off-policy algorithms.Footnote 2 In addition, the author have successfully applied ER to PPO and its variants without any learning breakdowns [28] (also see Appendix B).

Fig. 1
figure 1

Concept of two stabilization tricks for satisfying ERC: Under the hypothesis that different algorithms have different sets of acceptable empirical data, the first trick, counteraction, expands that set, while the second trick, mining, selects the empirical data to be replayed

From the above suggestive cases, it is expected that being an off-policy algorithm corresponds to a sufficient condition for “experience replayable conditions” (ERC). Then, what is necessary and sufficient conditions for ERC? This study hypothesizes that i) policy improvement algorithms in RL have the corresponding sets of empirical data that can be acceptable for learning, and that ii) ER can be applied to them if only the empirical data belonging to the respective sets are replayed. With these hypotheses, one can expect that off-policy algorithms naturally satisfy ERC because the acceptable set for them coincides with the whole set of empirical data. As a remark, based on the past cases, it is assumeed that there is no ERC in the context of learning value functions (this assumption will become clear in the numerical verification of this paper).

To demonstrate the validity of hypotheses, this paper first reveals the instability factors that destabilize learning by ER. Specifically, one can focus on the fact revealed in [26] that the standard policy improvement algorithms can be derived as a triplet loss in metric learning [56], from the viewpoint of control as inference (CaI) [32]. That is, the instability factors of triplet loss, i.e. i) repulsive forces from negative samples (so-called hard negative) [8, 61] and ii) replays of inappropriate experiences (so-called distribution shift) [45, 62], are inherited by the policy improvement algorithms. Since the way of sampling data makes the instability factors active, ER also inherit them by randomly selecting empirical data, causing learning breakdowns.

To alleviate the identified instability factors, the two corresponding stabilization tricks, a counteraction and a mining, are proposed respectively based on an experience discriminator. For the hypothesized ERC, i) the counteraction is responsible for expanding the acceptable set of empirical data, and ii) the mining is responsible for selectively replaying the empirical data belonging to it (see Fig. 1). Hence, ERC should hold by setting both of the counteraction and mining appropriately.

With the proposed stabilization tricks, it is verified whether ER is applicable to an advantage actor-critic (A2C) [37], an on-policy algorithm, by satisfying ERC. First, a rough grid search of hyperparameters is performed to identify the range of counteraction and mining in which ERC holds. Second, ablation tests for the counteraction and mining with the found hyperparameters show the needs for both. Finally, the A2C with the proposed stabilization tricks is evaluated on more practical multi-dimensional tasks, suggesting the learning performance comparable to SAC [17], which is the state-of-the-art off-policy algorithm.

In summary, this paper contributes to the following three folds:

  1. 1.

    It is revealed that the reason why ER is limited to off-policy algorithms is due to the two instability factors associated with the triplet loss hidden in the policy improvement algorithms.

  2. 2.

    The two stabilization tricks, the counteraction and mining, are developed to alleviate the corresponding instability factors respectively.

  3. 3.

    With the two stabilization tricks, A2C, an on-policy algorithm, can satisfy ERC and achieve learning performance comparable to SAC.

2 Preliminaries

2.1 Basics of reinforcement learning

Let’s briefly introduce a problem statement of RL [51]. RL aims to maximize the sum of future rewards, so-called return, under Markov decision process (MDP). Mathematically, MDP is defined with the tuple \((\mathcal {S}, \mathcal {A}, p_e, r)\): \(\mathcal {S}\) and \(\mathcal {A}\) denote the state and action spaces,Footnote 3 respectively; \(p_e: \mathcal {S} \times \mathcal {A} \times \mathcal {S} \mapsto [0, \infty )\) gives the stochastic state transition function; and \(r: \mathcal {S} \times \mathcal {A} \mapsto \mathcal {R}\) defines the reward function for the current situation with \(\mathcal {R} \subset \mathbb {R}\) the set of rewards.

Under the above MDP, an agent encounters the current state \(s_t \in \mathcal {S}\) of an (unknown) environment at the current time step \(t \in \mathbb {N}\). The agent (stochastically) decides its action \(a_t \in \mathcal {A}\) according to its learnable policy \(\pi (a_t \mid s_t): \mathcal {S} \times \mathcal {A} \mapsto [0, \infty )\). By interacting with the environment by \(a_t\), the agent enters the next state \(s_{t+1} =: s_t^\prime \in \mathcal {S}\) according to \(p_e(s_t^\prime \mid s_t, a_t)\). At the same time, the reward \(r_t = r(s_t, a_t)\) is obtained from the environment.

By repeating the above transition, the agent obtains the return, \(R_t = \sum _{k=0}^\infty \gamma ^k r_{t+k}\), with \(\gamma \in [0, 1)\) the discount factor. RL wants to maximize the expected return from t by optimizing \(\pi (a_t \mid s_t)\) as follows:

$$\begin{aligned} \pi ^*(a_t \mid s_t) = \arg \max _{\pi } \mathbb {E}_{a_t \sim \pi (a_t \mid s_t), \tau _t \sim \rho ^\pi }[R_t \mid s_t] \end{aligned}$$
(1)

where \(\rho ^\pi \) denotes the probability to generate the state-action trajectory, \(\tau _t = [s_t^\prime , a_{t+1}, \ldots ]\), with the combination of \(\pi \) and \(p_e\). Here, the expected return can be defined as the following learnable value functions.

$$\begin{aligned} V(s_t)&= \mathbb {E}_{a_t \sim \pi (a_t \mid s_t), \tau _t \sim \rho ^\pi }[R_t \mid s_t] \end{aligned}$$
(2)
$$\begin{aligned} Q(s_t, a_t)&= \mathbb {E}_{\tau _t \sim \rho ^\pi }[R_t \mid s_t, a_t] \end{aligned}$$
(3)

where \(V(s_t) = \mathbb {E}_{a_t \sim \pi (a_t \mid s_t)}[Q(s_t, a_t)]\) holds. The value functions are useful in acquring \(\pi ^*\), and therefore, most RL algorithms learn them.

2.2 RL algorithms with experience replay

ER stores the empirical data, \((s, a, s^\prime , r, b)\) (\(b = \pi (a \mid s) \in \mathbb {R}_+\) the policy likelihood to generate a, which is used for the proposed method), in a replay buffer \(\mathcal {B}\) with finite size \(|\mathcal {B}|\) [34]. For the sake of simplicity, this paper focuses on ER for learning algorithms with each single transition, although ER for long-term sequences has also been developed [20, 23]. That is, the loss function w.r.t. the learnable functions in RL (i.e. \(\pi \), V, and/or Q mainly) to be minimized is required to be computed only with each empirical data. The following minimization problem is therefore given under ER.

$$\begin{aligned} \min _{f} \mathbb {E}_{(s, a, s^\prime , r, b) \sim \mathcal {B}}[\ell _{f}^\textrm{alg}(s, a, s^\prime , r, b)] \end{aligned}$$
(4)

where f indicates the function(s) to be optimized and \(\ell _{f}^\textrm{alg}\) denotes the algorithm-dependent loss function for f (even if other functions are used for computing it, they are not optimized through minimizing it). Although various variants have been proposed as the attention to practicality of ER has increased so far (e.g. prioritizing the important empirical data with high replay frequency [43, 44] and ignoring the inappropriate empirical data [38, 48]), this paper employs the most standard implementation with the FIFO-type replay buffer and uniform sampling of empirical data.

2.2.1 Soft actor-critic

Here, two algorithms mainly used in this paper are introduced briefly. The first one is SAC [17], which is known as a representative off-policy algorithm, as the comparison. SAC learns two functions, \(\pi \) and Q, through minimization of the following loss functions.Footnote 4

$$\begin{aligned} \ell _{Q}^\textrm{SAC}(s, a, s^\prime , r)&= \frac{1}{2} \{ r + \gamma \mathbb {E}_{a^\prime \sim \pi (a^\prime \mid s^\prime )}[\bar{Q}(s^\prime , a^\prime ) \nonumber \\&- \alpha \ln \pi (a^\prime \mid s^\prime )] - Q(s, a) \}^2 \end{aligned}$$
(5)
$$\begin{aligned} \ell _{\pi }^\textrm{SAC}(s)&= \mathbb {E}_{a \sim \pi (a \mid s)}[- Q(s, a) + \alpha \ln \pi (a \mid s)] \end{aligned}$$
(6)

where \(\bar{Q}\) denotes the target value (without computational graph) and \(\alpha \ge 0\) denotes the magnitude of policy entropy, which can be auto-tuned by [29]. The two expectations w.r.t. \(\pi \) are roughly approximated by a one-sample Monte Carlo method.

As for the optimization of Q, the expected SARSA [55], which can be an off-policy algorithm, is employed. Indeed, the target value of Q is computed with \(\pi \), which is different from the policy used when obtaining the empirical data (so-called the behavior policy). In addition, the optimization of \(\pi \) is off-policy as well since Q to be minimized in \(\ell _{\pi }\) can be dependent on arbitrary policies (as like [33]). In this way, SAC belongs to the off-policy algorithm in total. As a result, SAC can optimize \(\pi \) and Q with arbitrary empirical data, making it possible to use ER freely.

2.2.2 Advantage actor-critic

Next, A2C [37], an on-policy algorithm, is introduced as the baseline for the proposed method. In this paper, the advantage function is approximated by the temporal difference (TD) error \(\delta \) for simplicity. In other words, A2C in this paper learns \(\pi \) and V by minimizing the following loss functions.

$$\begin{aligned} \delta (s, s^\prime , r)&= r + \gamma \bar{V}(s^\prime ) - V(s) \nonumber \\ \ell _{V}^\textrm{A2C}(s, s^\prime , r)&= \frac{1}{2} \delta (s, s^\prime , r)^2 \end{aligned}$$
(7)
$$\begin{aligned} \ell _{\pi }^\textrm{A2C}(s, a, s^\prime , r)&= - \delta (s, s^\prime , r) \ln \pi (a \mid s) \end{aligned}$$
(8)

where \(\bar{V}\) denotes the target value as like \(\bar{Q}\).

As V necessarily depends on \(\pi \) by definition, the above learning rule for V requires \(\pi \)-dependent r (and \(s^\prime \)). The learning rule for \(\pi \) is also derived based on the policy gradient theorem, and again the dependence of \(\pi \) in V is assumed. In this way, A2C belongs to the on-policy algorithm in total. As a result, A2C should not be theoretically able to learn with ER, which stores the empirical data independent of \(\pi \). Note that the policy gradient algorithm can theoretically be converted to the off-policy algorithm by utilizing importance sampling [11], but due to computational instability and distribution shift, various countermeasures must be implemented [12, 58, 64] (see Section 6.1 in details).

2.3 Metric learning with triplet loss

Metric learning [5] is a methodology for extracting an embedding space, in which the similarity of the input data \(x \in \mathcal {X}\) (e.g. image) can be measured. The triplet loss relevant to this study [56], which is one of the loss functions to extract the embedding space \(\mathcal {Z}\), is briefly introduced. Three types of the input data, the anchor, positive, and negative data (x, \(x^+\), and \(x^-\) respectively), are required to compute it. These are fed into the common networks f, outputting the corresponding features in \(\mathcal {Z}\). A distance function \(d: \mathcal {Z} \times \mathcal {Z} \mapsto \mathbb {R}_+\) is then prepared to learn the desired distance relationship in the inputs. That is, x should be close to \(x^+\) and away from \(x^-\), as shown in Fig. 2.

By minimizing the following triplet loss, this relationship can be acquired.

$$\begin{aligned} \ell _{f}^\textrm{triplet} = \max (0, d(x, x^+) + m - d(x, x^-)) \end{aligned}$$
(9)

where \(m \ge 0\) denotes the margin between the positive and negative clusters. The max operator is employed to preclude the divergence into negative infinity. As a result, the anchor data can be embedded near the positive data while being away from the negative data to some extent.

Fig. 2
figure 2

Desired distance relationship in metric learning with triplet loss: The anchor data x should be in the same cluster as the positive data \(x^+\), and the negative data \(x^-\) should be taken from a different cluster away from it

3 Reasons for instabilization by experience replay

3.1 Policy improvements via control as inference

CaI has been proposed for a new interpretation of RL [32]. To indicate that the policy improvements (e.g. A2C) can be regarded as a kind of triplet loss, CaI is utilized as below (see [26] for the details of derivation).

In this concept, an optimal variable \(O = \{0, 1\}\) is introduced. The probability mass functions for it, \(p(O=1 \mid s)\) and \(p(O=1 \mid s, a)\), can be defined with V and Q, respectively.

$$\begin{aligned} p(O=1 \mid s)&= e^{\beta (V(s) - C)} =: p_V(s) \end{aligned}$$
(10)
$$\begin{aligned} p(O=1 \mid s, a)&= e^{\beta (Q(s, a) - C)} =: p_Q(s, a) \end{aligned}$$
(11)

where \(\beta > 0\) denotes the inverse temperature parameter. That is, if \(\beta \) is small, these probabilities are likely to be near 1/2, increasing ambiguity; and if \(\beta \) is large, they often converge to 0 or 1, making them deterministic. C is an unknown value for convenience to satisfy \(V(s) - C \le 0\) and \(Q(s, a) - C \le 0\) so that the above equations satisfy the definition of probability and corresponds to the maximum value of the value function. Since numerical computation is impossible with C remaining, C must be excluded from the learning rule that is eventually derived. Note that since O is binary, the probability of \(O=0\) can be given as \(p(O=0 \mid s) = 1 - p_V(s)\) and \(p(O=0 \mid s, a) = 1 - p_Q(s, a)\).

Using these probabilities, the optimal and non-optimal policies are inferred using Bayesian theorem. That is, \(\pi (a \mid s, O)\) is given as follows:

$$\begin{aligned} \pi (a \mid s, O)&= \frac{p(O \mid s, a) b(a \mid s)}{p(O \mid s)} \nonumber \\&= {\left\{ \begin{array}{ll} \frac{p_Q(s, a)}{p_V(s)} b(a \mid s) =: \pi ^+(a \mid s) & O = 1 \\ \frac{1 - p_Q(s, a)}{1 - p_V(s)} b(a \mid s) =: \pi ^-(a \mid s)& O = 0 \end{array}\right. } \end{aligned}$$
(12)

where \(b(a \mid s)\) denotes the behavior policy to sample a, which can be different from the current learnable policy \(\pi \).

Now, to optimize \(\pi \), the following minimization problem is solved.

$$\begin{aligned} \min _\pi \mathbb {E}_{s \sim \mathcal {S}}[\textrm{KL}(\pi (a \mid s) \mid \pi ^+(a \mid s)) - \textrm{KL}(\pi (a \mid s) \mid \pi ^-(a \mid s))] \end{aligned}$$
(13)

where \(\textrm{KL}(\cdot \mid \cdot )\) denotes Kullback-Leibler (KL) divergence between two probability distributions. The gradient of these two terms w.r.t. \(\pi \), \(g_\pi \), is derived as follows:

$$\begin{aligned} g_\pi&= \mathbb {E}_{s \sim \mathcal {S}} \left[ \nabla \mathbb {E}_{a \sim \pi (a \mid s)} \left[ \ln \frac{\pi (a \mid s)}{\pi ^+(a \mid s)} - \ln \frac{\pi (a \mid s)}{\pi ^-(a \mid s)} \right] \right] \nonumber \\&= \mathbb {E}_{s \sim \mathcal {S}, a \sim \pi (a \mid s)} \left[ - \nabla \ln \pi (a \mid s) \left\{ \beta (Q(s, a) - V(s)) + \ln \frac{1 - p_Q(s, a)}{1 - p_V(s)} \right\} \right] \nonumber \\&\propto \mathbb {E}_{(s, a) \sim \mathcal {B}} \left[ - (Q(s, a) - V(s)) \nabla \ln \pi (a \mid s) \right] \nonumber \\&\propto \mathbb {E}_{(s, a, s^\prime , r) \sim \mathcal {B}} \left[ - \delta (s, s^\prime , r) \nabla \ln \pi (a \mid s) \right] \end{aligned}$$
(14)
Fig. 3
figure 3

Two stabilization tricks: According to the judgements of experience discriminator, (a) counteraction applies a regularization to \(\pi \) to increase the misidentification rate to \(\pi \simeq b\); and (b) mining masks empirical data judged to be \(\pi \ne b\) and excludes them from replay/training

Here, by assuming \(\beta \rightarrow \infty \), this gradient can be simplified. In addition, \(Q - V = A\) the advantage function can be approximated by TD error \(\delta \) (see Section 2.2.2). As mentioned above, when \(\beta \) is large, \(p(O=1)\) deterministically takes 0 or 1, and with \(\beta \rightarrow \infty \), \(O=1\) is not obtained unless \(V = Q = C\) (i.e. the value function is maximized by the optimized policy, which is consistent with (1)).

The minimization of (13) using the surrogated gradient is consistent with the minimization of the loss function for A2C, \(\ell _\pi ^\textrm{A2C}\), by (stochastic) gradient descent. The original minimization problem can be interpreted as trying to move the anchor data \(\pi \) closer to the positive data \(\pi ^+\) and away from the negative data \(\pi ^-\) by employing KL divergence as the distance function d.Footnote 5 Compared to (9), the margin m and the max operator are not used, but this is because \(m=0\) and \(\pi ^{+,-}\) is centered on b, preventing the divergence to infinity.

3.2 Instability factors hidden in triplet loss

As the policy improvements correspond to minimizing triplet loss, then its characteristics during learning should also inherit that of minimizing triplet loss. Ideally, the anchor data should approach the positive data, resulting in \(\pi \rightarrow \pi ^+\). On the other hand, the minimization of triplet loss is different from simple supervised learning problems, and several factors that destabilize learning have been reported. These instability factors are influenced by the way triplets are selected from the dataset. In the policy improvements, therefore, they are activated by randomly sampling empirical data from ER (and using them for learning). Under this assumption, the following two instability factors are raised, along with their solution guidelines noted in the context of metric learning, which should be useful for RL combined with ER.

First, selecting the inappropriate anchor, positive, and negative data might cause \(\textrm{KL}(\pi \mid \pi ^+) > \textrm{KL}(\pi \mid \pi ^-)\). In this case, the repulsion from \(\pi ^-\) is stronger than the attraction to \(\pi ^+\) and the optimal solution cannot be found, updating \(\pi \) by the divergent behavior. This is known as hard negative [8, 61]. To alleviate this instability factor, the exclusion of hard-negative triplets and/or the regularization that suppresses the repulsion would be required.

Second, from all the triplets that can be constructed, only a few can be used for optimization. In fact, \(\pi ^+\) and \(\pi ^-\) are linked via the behavior policy b in the policy improvements, and no arbitrary triplet can be constructed. This might cause distribution shift, inducing a large bias in learning [45, 62]. To alleviate this instability factor, it is desirable to eliminate triplets that are prone to bias and/or to regularize the distribution of selected triplets.

4 Tricks for experience replayable conditions

4.1 Experience discriminator

The instability factors induced by the above triplet loss have to be suppressed to satisfy ERC. To this end, two stabilization tricks, counteraction and mining, are heuristically designed (see Fig. 3). As a common module for them, experience discriminator, is first introduced. In other words, it estimates whether the empirical data of ER is suitable for triplet construction.

Specifically, it is required to judge whether the state-action pair (sa) in the buffer \(\mathcal {B}\) can be regarded as one generated by the current policy \(\pi \). According to the density trick, the following density ratio d satisfies this requirement.

$$\begin{aligned} d(s, a, b)&= \min \left( 0.5, \frac{\pi (a \mid s)}{\pi (a \mid s) + b} \right) \nonumber \\&= \min (0.5, \sigma (\ln \pi (a \mid s) - \ln b)) \end{aligned}$$
(15)

where \(\sigma (\cdot ) \in [0, 1]\) is the sigmoid function. Note that d should be equal to or less than 0.5 since (sa) is actually generated from the behavior policy with its likelihood b.

In addition, robust judgements should be considered w.r.t. the stochasticity of actions and the non-stationarity of policies. For this purpose, \(D:\mathcal {S} \mapsto [0, 1]\), which marginalizes d by a and has only s as input, is defined as a learnable model. As D corresponds to the probability parameter of Bernoulli distribution, it can be optimized by minimizing its negative log-likelihood.

$$\begin{aligned} \mathcal {L}_{D} = \mathbb {E}_{(s, a, b) \sim \mathcal {B}}[- d(s, a, b) \ln D(s) - (1 - d(s, a, b)) \ln (1 - D(s))] \end{aligned}$$
(16)

where the above d is employed as the supervised signal.

4.2 Counteraction of deviations from non-optimal policies

First trick, so-called the counteraction, is proposed as a countermeasure against the hard negative mainly. In the previous work [8], the regularization to change the ratio of two terms in (9) has been proposed to counteract the repulsion from the negative data. This concept can be reproduced in (13) as follows:

$$\begin{aligned} \min _\pi \mathbb {E}_{s \sim \mathcal {S}}[(1 + \lambda )\textrm{KL}(\pi (a \mid s) \mid \pi ^+(a \mid s)) - \textrm{KL}(\pi (a \mid s) \mid \pi ^-(a \mid s))] \end{aligned}$$
(17)

where \(\lambda \ge 0\) denotes the gain for imbalancing the positive and negative terms.Footnote 6

Now, how this extension works in theory is shown. As in the original minimization problem, the gradient for the added regularization is derived as below.

$$\begin{aligned} & \nabla \mathbb {E}_{s \sim \mathcal {S}}[\textrm{KL}(\pi (a \mid s) \mid \pi ^+(a \mid s))] \nonumber \\= & \nabla \mathbb {E}_{s \sim \mathcal {S}, a \sim \pi }[\ln \pi (a \mid s) - \ln b(a \mid s)\nonumber \\ & - \ln p_Q(s, a) + \ln p_V(s)] \nonumber \\= & \nabla \mathbb {E}_{s \sim \mathcal {S}}[\textrm{KL}(\pi (a \mid s) \mid b(a \mid s))] \nonumber \\ & - \nabla \mathbb {E}_{s \sim \mathcal {S}, a \sim \pi }[\ln p_Q(s, a) - \ln p_V(s)] \end{aligned}$$
(18)

Now, the second term can be decomposed to (14) by using the definitions of \(p_V\) and \(p_Q\) in (10) and (11), respectively.

$$\begin{aligned} & - \nabla \mathbb {E}_{s \sim \mathcal {S}, a \sim \pi }[\ln p_Q(s, a) - \ln p_V(s)] \nonumber \\= & - \nabla \mathbb {E}_{s \sim \mathcal {S}, a \sim \pi }[\beta Q(s, a) - \beta C - \beta V(s) + \beta C] \nonumber \\= & \mathbb {E}_{s \sim \mathcal {S}, a \sim \pi }[- \nabla \ln \pi (a \mid s) \beta \{Q(s, a) - V(s)\}] \nonumber \\\propto & \mathbb {E}_{(s, a, s^\prime , r) \sim \mathcal {B}} \left[ - \delta (s, s^\prime , r) \nabla \ln \pi (a \mid s) \right] = (14) \end{aligned}$$
(19)

That is, the second term can be absorbed into the original loss.

The added regularization, therefore, has a role to constrain \(\pi \rightarrow b\) only. As b should be located between \(\pi ^+\) and \(\pi ^-\), this regularization is expected to avoid the hard negative and stabilize \(\pi \rightarrow \pi ^+\). However, to implement this regularization directly, the behavior policy \(b(a \mid s)\) at each experience must be retained, which is costly. As an alternative regularization way, the adversarial learning to (16) is implemented in this paper.

That is, it takes advantage of the fact that the higher the misidentification rate of D means \(\pi \sim b\), reducing the risk of hard negative (see Fig. 3a). Since d has the computational graph w.r.t. \(\pi \), the following loss function can be given as the counteraction trick for regularizing \(\pi \) to b.

$$\begin{aligned} \mathcal {L}_{C} = \mathbb {E}_{(s, a, b) \sim \mathcal {B}}[\omega d(s, a, b) \{ \ln D(s) - \ln (1 - D(s)) \}] \end{aligned}$$
(20)

where \(\omega \ge 0\) denotes the gain, which is designed below. One can find that the term related to D also behaves as a gain and is larger when \(D \ll 1\) (i.e. no misidentification). In addition, if d is clipped to 0.5 (i.e. \(\pi \simeq b\) holds), the gradient w.r.t. \(\pi \) is zero. Note that, in the actual implementation, the gradient reversal layer [15] is useful to lump (16) and (20) together.

As for \(\omega \), \(d \ll 1\) has the small gradient due to the sigmoid function, not like \(D \ll 1\). To compensate it and converge to \(\pi \rightarrow b\) faster even in such a case, \(\omega \) is designed in the manner of PI control, referring to the literature [50].

$$\begin{aligned} P(s, a, b)= & \eta _C (1 - 2 \hat{d}(s, a, b)) \nonumber \\ I= & \max (0, 0.5I + \mathbb {E}_{\mathcal {B}}[P(s, a, b)]) \nonumber \\ \omega (s, a, b)= & \max (0, P(s, a, b) + I) \end{aligned}$$
(21)

where \(\hat{d}\) means d without the computational graph, and \(\eta _C \ge 0\) denotes the hyperparameter for this counteraction. \(I \ge 0\) is the integral term with saturation by multiplying 0.5 (its initial value must be zero).

4.3 Mining of indistinguishable experiences

Second trick, so-called the mining, is proposed to mitigate the effect of distribution shift mainly. In the previous studies, semi-hard triplets, in which \(d(x, x^+) + m < d(x, x^-)\) holds, are considered useful [45]. Since the optimization problem in this study sets \(m=0\), \(\textrm{KL}(\pi \mid \pi ^+) < \textrm{KL}(\pi \mid \pi ^-)\) seems better. However, this is also regarded to be hard triplets, which would induce the selection bias. Of course, \(\textrm{KL}(\pi \mid \pi ^+) > \textrm{KL}(\pi \mid \pi ^-)\) is still not suitable for learning because of the hard negative relationship described above. Hence, the triplets with \(\textrm{KL}(\pi \mid \pi ^+) \simeq \textrm{KL}(\pi \mid \pi ^-)\) are desired to be mined.

Specifically, it takes advantage of the fact that, in this study, the anchor data is determined by the current policy \(\pi \), and the positive and negative data, \(\pi ^+\) and \(\pi ^-\), are located around the behavior policy b. In other words, the indistinguishable empirical data with \(\pi \simeq b\) is likely to achieve the desired relationship (see Fig. 3b). The mining trick is therefore given as the following stochastic dropout [49].

$$\begin{aligned} p(M \mid s, a, b)= & 2\max (0, 0.5 - \min (d(s, a, b), D(s))^{\eta _M}) \nonumber \\ M= & {\left\{ \begin{array}{ll} 1 & p(M \mid s, a, b) \le \epsilon \sim \mathcal {U}(0, 1) \\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(22)

where \(\mathcal {U}(l, u)\) is the uniform distribution within [lu], and \(\eta _M \ge 0\) denotes the hyperparameter for this mining. When sampling the empirical data from the replay buffer \(\mathcal {B}\), each data is screend by the mining: if \(M=1\), it is used for the optimization; if \(M=0\), it is excluded.

5 Simulations

5.1 Overview

Here, it is verified that ERC holds by the two stabilization tricks. To do so, A2C [37], which is the on-policy algorithm and generally considered inapplicable for ER, is employed as a baseline (see Appendix A for detailed settings). Learning is conducted only with the empirical data replayed from ER at the end of each episode (without online learning).

The following three steps are performed for the step-by-step verification. With them, it is shown that alleviating the instability factors hidden in triplet loss is effective not only to satisfy ERC, but also to learn the optimal policy with comparable sample efficiency to SAC [17].

  1. 1.

    The hyperparameters of the respective stabilization tricks, \(\eta _{C,M} \ge 0\), are roughly examined their effective ranges in a toyproblem, determining the values to be used in the subsequent verification.

  2. 2.

    It is confirmed that both of the two stabilization tricks are complementarily essential to satisfy ERC and to learn the optimal policy through ablation tests on three major tasks implemented in Mujoco [53].

  3. 3.

    Finally, the A2C with the proposed stabilization tricks is evaluated on two tasks of relatively large problem scale in dm_control [54] in comparison to SAC, which is the latest off-policy algorithm.

Note that training for each task and condition is conducted 12 times with different random seeds in order to evaluate the performance with their statistics. In the above, the stabilization tricks demonstrate that ER can be applied to A2C with the excellent learning performance. In addition, the application of the two stabilization tricks to the other on-policy algorithm, i.e. PPO [47], and their quantitative contribution to the learning performance are investigated in Appendix B.

Fig. 4
figure 4

Grid search of \((\eta _C, \eta _M)\) on DoublePendulum: \(\eta _{C,M} = \{0.1, 0.5, 1.0, 5.0, 10.0\}\) are searched roughly in terms of return from the environment, and then \((\eta _C, \eta _M)\) used later are decided to be (0.5, 2.0) referring to the grid search results

5.2 Effective range of hyperparameters

A toyproblem named DoublePendulum, where a pendulum agent with two passive joints tries to balance in a standing position by moving its base, is solved. This has only one-dimensional action space and the state space limited by a terminal condition (i.e. excessive tilting). Thus, as this is a relatively simple problem and many empirical data should naturally satisfy ERC, the effect of \(\eta _{C,M}\) should behave in a Gaussian manner, enabling adjustment statistically.

To roughly check the effective parameter ranges of \(\eta _{C,M}\), \(5 \times 5 = 25\) conditions with \(\eta _{C,M} = \{0.1, 0.5, 1.0, 5.0, 10.0\}\) are compared. The test results, which are evaluated after learning, are summarized in Fig. 4. The results suggest the following two points.

First, \(\eta _M\) must be of a certain scale to activate the selection of empirical data, or else the performance can be significantly degraded. However, an excessively large \(\eta _M\) would result in too little empirical data being replayed. Therefore, \(\eta _M\) is considered reasonable in the vicinity of 1.

Second, \(\eta _C\) seems to have an appropriate range depending on \(\eta _M\). In other words, if \(\eta _M\) is excessively large and the replayable empirical data is too limited, it is desirable to increase them by increasing \(\eta _C\). On the other hand, if the empirical data are moderately selected around \(\eta _M \simeq 1\), it seems essential to relax the counteraction to some extent as \(\eta _C \le 1\).

Fig. 5
figure 5

Returns of ablation tests: The addition of the two stabilization tricks enabled the learning with an on-policy algorithm, A2C, combined with ER

Fig. 6
figure 6

\(\omega (s, a, b)\) in the counteraction trick: While the counteraction trick alone saturated the regularization gain and made it difficult to perform \(\pi \simeq b\), the two tricks made the regularization work properly without saturation

Based on the above two trends, and as a natural setting, \((\eta _C, \eta _M)\) are decided to be (0.5, 2.0) for the remaining experiments. The learning performance at this time is also shown in Fig. 4 (the black bar), and is higher than the rough grid search results (i.e. \(\sim 416\) while the others did not exceed 400).

Fig. 7
figure 7

\(p(M \mid s, a, b)\) in the mining trick: While the mining trick alone saturated the dropout rate and made it unavailable for replaying enough empirical data, the two tricks made the selection work properly without saturation

Fig. 8
figure 8

Returns on practical tasks: From (a) and (b), it can be found that SAC has faster learning speed, while ERC has more stable learning ability, resulting in (c) learning performance of ERC comparable to SAC

5.3 Ablation tests

Next, the needs for the two stabilization tricks are demonstrated through ablation tests. Specifically, the presence or absence of the two stabilization tricks (labeled with their initials ‘C’ and ‘M’) is switched by setting \(\eta _{C,M} = 0\). The four conditions of that combination are compared in the following three tasks in Mujoco [53]: Reacher; Hopper; and HalfCheetah.

The learning curves for these returns are shown in Fig. 5. As can be seen from the results, Vanilla (a.k.a. the standard A2C) without the two stabilization tricks did not learn any tasks at all, probably because it does not satisfy ERC. Adding the counteraction trick alone did not improve learning as well, but adding the mining trick improved somewhat. Nevertheless, it seems that either of them does not satisfy ERC or learn the optimal policy. On the other hand, only the case using the two stabilization tricks was able to learn all tasks. From these results, it can be concluded that both of the two stabilization tricks are necessary (and sufficient) to satisfy ERC and learn the optimal policy.

Now let’s see why both of the stabilization tricks are required. To this end, the internal parameters for the respective stabilization tricks, \(\omega (s, a, b)\) and \(p(M \mid s, a, b)\), are depicted in Figs. 6 and 7. One can find that the cases only with one of the stabilization tricks saturated the respective internal parameters immediately. That is, if using the counteraction trick only, its regularization to \(\pi \rightarrow b\) is not enough for satisfying ERC; and if using the mining trick only, its selection is too strict for learning the optimal policy. In contrast, by using both, the internal parameters converged to task-specific values without saturation. That is why the two stabilization tricks need to be employed together to increase the amount of replayable empirical data to reach a level where the optimal policy can be learned while still satisfying ERC.

5.4 Performance comparison

Finally, the learning performance of the A2C with the proposed stabilization tricks (labeled ERC) is compared to the state-of-the-art off-policy algorithm, SAC [17]. Two practical tasks with more than 10-dimensional action space, QuadrupedWalk (with \(|\mathcal {A}| = 12\)) and Swimmer15D (with \(|\mathcal {A}| = 14\)) in dm_control [54], are solved.

The learning curves and test results after learning are depicted in Fig. 8. As can be seen, while SAC solved Quadruped at a high level, it did not solve Swimmer consistently. On the other hand, ERC sometimes delayed convergence on Quadruped, but showed high performance on Swimmer. In addition, the trends of the learning curves suggested that SAC learns faster, while ERC learns more stably. Thus, it can be said that ERC has a learning performance comparable to that of SAC, although they have different strengths in different tasks.

As a remark, the difference in performance between ERC and SAC might be attributed to the difference in the policy gradients used. Actually, ERC learns the policy with likelihood-ratio gradients, while SAC does so with reparameterization gradients [41]. The former obtains the gradients over the entire (multi-dimensional) policy, which cannot optimize each component of actions independently. As a result, this is useful to learn tasks that require synchonization of actions. The latter, on the other hand, obtains the gradients for each component of actions, and thus can quickly learn tasks that do not require precise synchronization. These learning characteristics correspond to the two types of tasks solved in this study. Hence, it is suggested that SAC performed well for Quadruped, where synchronization at each leg is sufficient, and ERC performed well for Swimmer, where synchronization at all joints is needed. Note that policy improvements by mixing them have been proposed, which would be expected to be useful [16]. The proposed stabilization tricks allow for the policy improvements combined with ER regardless of which gradient is used. That would facilitate the development of such hybrid algorithms and improve learning performance in a complementary manner.

6 Discussions

6.1 Relations to conventional algorithms

As demonstrated in the above section, ERC hypothesized in this study can be achieved by limiting the replayed empirical data to ones acceptable by the applied RL algorithm. At the same time, the improvement in learning efficiency due to ER can be increased by expanding the acceptable set of empirical data. These were achieved by the two stabilization tricks proposed, with \(\pi \simeq b\) being the key to both. This key corresponds to the concept of on-policyness, where the empirical data contained in the replay buffer can be regarded as generated from the current policy \(\pi \) [13]. In other words, the stabilization tricks aimed to stabilize the triplet loss and consequently increased the on-policyness. Conversely, the usefulness of increasing the on-policyness in the previous studies [38, 48] can be explained in terms of mitigating the instability factors hidden in the triplet loss.

Actually, ER can be applied to make A2C off-policy [11]. To this end, the most important trick is to change the sampler by importance sampling, as described above. This trick involves weighting the original loss function by \(\pi /b\), which is unstable when \(b \simeq 0\) (i.e. rare actions are sampled). Several heuristics have been proposed to resolve this instability [12, 58, 64]. Specifically, the most famous off-policy actor-critic algorithm with ER, ACER [58], modifies the update rules of not only the policy function but also the value function, while the proposed tricks indicated no need to modify the value function. Although Geoff-PAC [64] and P3O [12] only modify the policy improvements, the former is tedious and lacks the intuitive explanatory capability of ERC, while the latter is inappropriate for confirming ERC because it uses the freshly collected empirical data for learning separately from those sampled from ER.

Thus, previous studies have only focused on off-policyness and improved learning performance of the algorithms, without any discussion why ER is applicable. On the other hand, this study explored ERC and implicitly obtained utility similar to making the target algorithm off-policy. Indeed, the proposed stabilization tricks selectively replay the empirical data with \(\pi \simeq b\) while increasing the number of acceptable ones. As a result, the weight in this important sampling can be ignored as \(\pi /b \simeq 1\). That is why the A2C with the proposed stabilization tricks, although not explicitly off-policy, made ER available as if they were off-policy.

As a remark, the report that PPO (and its variants), which is regarded to be on-policy [47], could utilize ER empirically [28] is related to the above considerations and the proposed stabilization tricks. That is, PPO applies regularization aiming at \(\pi \simeq b\), and if \(\pi \) deviates from b, it clips the weight of the importance sampling, resulting in excluding that data. Thus, PPO can increase the on-policyness of empirical data and exclude from replays that are unacceptable, so ER is actually applicable (see the empirical results in Appendix B).

Finally, another interpretation of why DDPG [33], SAC [17], and TD3 [14], which are considered off-policy, can stably learn the optimal policy using ER. These algorithms optimize the current policy \(\pi \) by using reparameterization gradients through the actions generated from \(\pi \) fed into the action value function Q. As mentioned in the SAC paper, this can be derived from the minimization of KL divergence between \(\pi \) and the optimal policy based on Q only. In other words, unlike algorithms that use likelihood-ratio gradients such as A2C, they do not deal with the triplet loss. Therefore, it can be considered that ER is applicable since there is no instability factors induced by the triplet loss in the first place.

6.2 Limitations of stabiliation tricks

For ERC holding, this paper proposed the two stabilization tricks, the contributions of which were evaluated in the above section. However, these naturally leave room for improvements probably because they have been designed heuristically. First, using them requires a discriminator D learned through (16), which needs an extra computational cost. While it is possible to implement the two stabilization tricks using only d in (15) without D, d is an unstable variable, so a lightweight stabilizer will be needed to replace D.

It is also important to note that the decision on D and d is based on the likelihood \(\pi \) and b. That is, when the policy deals with a high-dimensional action space, even a small deviation in action may be judged as \(\pi \ne b\) severely. In fact, when the RL benchmark with musculoskeletal model, i.e. myosuite [6], was tested with \(|\mathcal {A}|=39\), \(p(M \mid s, a, b)\) often converged to one depending on the conditions. This means that the empirical data were rarely replayed. To avoid this excessive exclusion of empirical data, it would be useful to either determine the final exclusion of empirical data by summarizing the respective judgements on individual action dimension separately; or to optimize the policy by masking the inappropriate action dimension only.

In addition, the counteraction alone did not satisfy ERC, as indicated in the ablation tests. This is probably due to saturation of the regularization gain \(\omega \) (see Fig. 6). Although a non-saturated gain made learning unstable empirically, it would be better to relax the saturation to some extent. Alternatively, an auto-tuning trick, which is possible by once interpreting the regularization as the corresponding constraint [18, 29], is considered to be useful.

7 Conclusion

This study reconsidered the factors that determine whether ER is applicable to RL algorithms. To this end, the instability factors that ER might induce especially in on-policy algorithms were first revealed. In other words, it was found that the policy gradient algorithms can be regarded as the minimization of triplet loss in metric learning, inheriting its instability factors. To alleviate them, the two stabilization tricks, the counteraction and mining, were proposed as countermeasures attached to arbitrary RL algorithms. The counteraction and mining are responsible for i) expanding the set of acceptable empirical data for each RL algorithm and ii) excluding empirical data outside the set, respectively. Through multiple simulations, ERC indeed holded by using these two stabilization tricks. Furthermore, the standard on-policy algorithm with them achieved the learning performance comparable to the state-of-the-art off-policy algorithm.

As described in the discussion, the two stabilization tricks proposed to satisfy ERC leave some room for improvements. In particular, since the hypotheses formulated in this study is now deeply related to the on-policyness of the empirical data in the replay buffer, they would be improved based on this perspective. Afterwards, other RL algorithms will be integrated with the stabilization tricks for further investigations of ERC. On the other hand, although the simplest ER method was tested for simplicity since the proposed tricks allow for arbitrary ER methods, it would be interesting to consider more sophisticated methods [35, 59] and methods modified to suit the problem [9] or algorithm [1]. In particular, by following the latter direction and developing an ER method suitable for on-policy RL algorithms, a significant performance improvement might be expected.