How does Retrace fix Off-Policyness?

Why Off-Policy Correction Matters

Consider a typical deep RL training loop. Your agent collects a batch of experience using its current policy, stores it in a replay buffer, then updates the policy with gradient descent. By the time you sample that experience, the policy might have significantly changed as the data was collected by an older version of the policy, but you're trying to improve the current one.

What problem does this cause? In training, it results in us correcting how the policy used to act, in the past, not how the current policy would behave now. Therefore, we want a correction mechanism if our data-collecting policy (called the behaviour policy, μ) and the policy being improved (the target policy, π) disagree, preventing the target policy from being unfairly penalised for decisions it wouldn't take anymore. It can shows up everywhere: experience replay in DQN, distributed training where actors lag behind the learner (as in IMPALA), or learning from demonstrations and historical data.

Without correction, our value estimates inherit this mismatch: they become biased towards what the old policy did, not what the current policy would do. In the worst case, this bias compounds over updates and learning can diverge entirely.

The Core Problem

Data in the replay buffer was generated by a past version of the policy. Using it directly to update the current policy gives biased value estimates because the distribution of actions (and therefore trajectories) no longer matches.

Importance Sampling: The Basic Fix

The standard tool for correcting this distribution mismatch is importance sampling. The idea is simple: if the target policy is twice as likely to take some action compared to the behavior policy, we should count that transition twice as much. If it's half as likely, we should halve its contribution.

We capture this with the importance sampling ratio, which compares the probability each policy assigns to the action that was actually taken in state \(s_t\):

\rho_t = \frac{\pi(a_t | s_t)}{\mu(a_t | s_t)}

Here \(\pi(a_t | s_t)\) is the probability the target policy assigns to action \(a_t\) in state \(s_t\), and \(\mu(a_t | s_t)\) is the probability the behavior policy assigned to it. The ratio tells us how to reweight each transition:

ρ > 1: The target policy favors this action more than the behavior policy did, so we upweight this transition
ρ < 1: The target policy is less likely to take this action, so we downweight it
ρ = 1: Both policies agree on this action, so no correction is needed

With this ratio in hand, we can correct off-policy data. But how we apply it makes a big difference, and that's where things get interesting.

N-Step IS: The Natural Attempt

One-step TD only looks one step ahead: \(G_t = r_t + \gamma V(s_{t+1})\). If our value function \(V\) is inaccurate, this estimate inherits that error directly. A natural fix is to use multiple steps of real rewards before bootstrapping, giving us the n-step return:

G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})

More real rewards means less reliance on \(V\), which reduces bias. But there is a catch: those n steps were generated by the behavior policy \(\mu\), not the target policy \(\pi\). To correct for this, we weight the return by the product of importance sampling ratios across all n steps:

G_t^{IS} = \left(\prod_{k=0}^{n-1} \rho_{t+k}\right) \cdot G_t^{(n)}

This product is unbiased: it correctly reweights the trajectory from \(\mu\)'s distribution to \(\pi\)'s. But products of random variables have a nasty property: their variance grows multiplicatively.

The Variance Explosion

Each individual ratio \(\rho_t\) might be modest on its own. But when you multiply them together, the product can grow or shrink exponentially with the number of steps:

Exploding products: If the target policy consistently favours different actions from the behavior policy (each \(\rho \approx 2\)), after 10 steps the correction is \(2^{10} = 1024\). A single trajectory dominates the entire gradient update, injecting massive noise.
Vanishing products: Conversely, if the target policy is less likely to take the actions that were sampled (each \(\rho \approx 0.5\)), the product becomes \(0.5^{10} \approx 0.001\). The trajectory is effectively ignored, wasting the data entirely.

In both cases, learning suffers. Exploding products cause wild, unstable updates. Vanishing products mean we learn nothing from most of our data. The more steps we use, the worse this gets.

The Bias-Variance Dilemma

This leaves us stuck. Using fewer steps keeps the IS product small (low variance) but forces us to trust a possibly inaccurate value function (high bias). Using more steps reduces that reliance on \(V\) (low bias) but makes the IS product explode or vanish (high variance). We need a way to use multi-step returns without the correction weights blowing up.

Retrace(λ): Safe Off-Policy Learning

Retrace breaks out of this dilemma with one key modification: instead of multiplying raw importance ratios, it truncates each ratio before multiplying. It defines a trace coefficient:

c_t = \lambda \cdot \min(1, \rho_t)

The \(\min(1, \rho_t)\) caps each ratio at 1, and the λ parameter (between 0 and 1) provides additional decay. Since every \(c_t\) is at most λ, the product of trace coefficients can only shrink over time. No matter how different the two policies are, the correction weights stay bounded.

Retrace uses these trace coefficients to weight TD errors across multiple steps. The target for each timestep is:

G_t^{ret} = Q(s_t, a_t) + \sum_{k=0}^{T-t-1} \gamma^k \left(\prod_{j=1}^{k} c_{t+j}\right) \delta_{t+k}

where \(\delta_t = r_t + \gamma V(s_{t+1}) - Q(s_t, a_t)\) is the TD error at step t. In practice, this is computed efficiently with a backward pass using the recursive form:

G_t^{ret} = r_t + \gamma \left[ c_{t+1}(G_{t+1}^{ret} - Q(s_{t+1}, a_{t+1})) + V(s_{t+1}) \right]

Interactive Visualization

Explore how different algorithms handle off-policy data. Adjust the policy divergence to see how importance ratios affect each method.

π Target Policy

μ Behavior Policy

Policy Divergence

Medium

Lambda λ

0.95

Discount γ

0.99

Actions

Variance

Low

Computation Steps

Bias and Variance

These two properties govern the quality of any value estimate. Bias is systematic error: if our value function \(V\) is inaccurate, any estimate that relies on it inherits that inaccuracy. Variance is noise across different samples: even if an estimate is correct on average, individual samples might swing wildly, making each gradient update unreliable.

Ideally we want both low bias and low variance. In practice, there is a trade-off. Using more real rewards reduces bias (less reliance on \(V\)) but increases variance (more randomness from the environment and policy). Off-policy IS corrections make this worse by multiplying ratios together.

Method	Bias	Variance	Convergence
N-Step IS	Low	Explosive	Can diverge
Retrace(λ)	Low	Bounded	Guaranteed
V-trace	Low*	Bounded	Guaranteed

*With finite \(\bar{\rho}\), V-trace converges to a policy between \(\mu\) and \(\pi\), introducing slight bias relative to the true target policy.

N-Step IS

ρ₀ × ρ₁ × ... × ρₙ

Low bias but
Exponential variance

⟷

Retrace(λ)

Truncated IS

min(1,ρ₀) × min(1,ρ₁) × ...

Low bias with
Bounded variance

Implementation

Here's a practical implementation of Retrace in Python:

def compute_retrace(rewards, q_values, values,
                    target_probs, behavior_probs,
                    gamma, lambda_):
    """
    Compute Retrace targets for off-policy learning.

    Args:
        rewards: [T] rewards r_0 to r_{T-1}
        q_values: [T+1] Q(s_t, a_t) values
        values: [T+1] V(s_t) = E[Q(s_t, a)] values
        target_probs: [T] π(a_t|s_t) probabilities
        behavior_probs: [T] μ(a_t|s_t) probabilities
        gamma: discount factor
        lambda_: trace decay parameter
    """
    T = len(rewards)

    # Compute importance sampling ratios
    rhos = target_probs / behavior_probs

    # Truncate ratios (the key Retrace insight)
    cs = lambda_ * np.minimum(1.0, rhos)

    # Compute TD errors
    deltas = rewards + gamma * values[1:] - q_values[:-1]

    # Backward pass to compute targets
    targets = np.zeros(T)
    ret = 0.0

    for t in reversed(range(T)):
        # Accumulate trace-weighted TD errors
        ret = deltas[t] + gamma * cs[t] * ret
        targets[t] = q_values[t] + ret

    return targets

Why Truncation Works

The variance problem in n-step IS comes from upweighting. When the target policy strongly prefers an action (ρ >> 1), n-step IS massively amplifies that transition. But we only observed it once. Amplifying a single observation doesn't give us more information, it just adds noise.

Retrace's truncation (\(\min(1, \rho)\)) takes an asymmetric approach to fix this:

When ρ < 1 (target policy dislikes this action): the ratio passes through unchanged, correctly reducing this transition's influence.
When ρ > 1 (target policy prefers this action): the ratio is capped at 1, leaving the transition at its natural observed weight rather than inflating it.

The Key Insight

Retrace says: trust the data you have. If the behavior policy took an action the target policy would avoid, downweight it. But if the behavior policy took an action the target policy likes, just use it as-is. Never amplify beyond what was actually observed.

Because every trace coefficient satisfies \(c_t \leq \lambda \leq 1\), the product of coefficients across multiple steps can only shrink. This is what bounds the variance: no matter how different π and μ are, the correction weights stay controlled.

Special Cases

Retrace smoothly interpolates between familiar algorithms depending on the setting:

On-policy (π = μ): All ρ = 1, so c = λ, and Retrace reduces to TD(λ)
λ = 0: All traces are zeroed out, giving one-step TD (maximum bias, minimum variance)
λ = 1, π = μ: Equivalent to Monte Carlo returns (minimum bias, maximum variance)

V-trace: Scaling to Distributed Systems

V-trace (Espeholt et al., 2018) was developed for IMPALA, a massively distributed RL architecture where dozens of actors collect experience in parallel while a central learner updates the policy. In this setting, the behavior policy can be many updates behind the learner, making off-policy correction essential.

V-trace is closely related to Retrace but introduces a key structural change: it uses two separate clipping constants instead of one. The V-trace target is:

v_s = V(s_s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} \left(\prod_{i=s}^{t-1} \bar{c}_i\right) \bar{\rho}_t \, \delta_t

where the two clipping levels serve distinct roles:

\(\bar{\rho}_t = \min(\bar{\rho}, \rho_t)\) clips the importance weight applied to each TD error. This controls what value function V-trace converges to. With \(\bar{\rho} = \infty\) it converges to \(V^\pi\); with finite \(\bar{\rho}\) it converges to a policy between \(\mu\) and \(\pi\).
\(\bar{c}_t = \min(\bar{c}, \rho_t)\) clips the trace coefficient that determines how far corrections propagate backward. This is analogous to Retrace's \(c_t\) and controls the speed of convergence.

As with Retrace, this can be computed efficiently with a backward pass. Defining \(\tilde{\delta}_t = \bar{\rho}_t \delta_t\) as the clipped TD error, the recursive form is:

v_t = V(s_t) + \tilde{\delta}_t + \gamma \, \bar{c}_t \left( v_{t+1} - V(s_{t+1}) \right)

Relationship to Retrace

When \(\bar{\rho} = \bar{c} = 1\) and \(\lambda = 1\), V-trace and Retrace(1) produce the same targets. The key difference is that V-trace separates the two roles that clipping plays:

Retrace uses a single clip: \(c_t = \lambda \min(1, \rho_t)\) appears both as the trace coefficient and implicitly as the TD error weight.
V-trace decouples these: \(\bar{\rho}\) controls the fixed point while \(\bar{c}\) controls the trace length, and \(\bar{c} \leq \bar{\rho}\) is required.

This decoupling gives V-trace an extra knob. In highly distributed settings where the behavior policy can be very stale, using a moderate \(\bar{\rho}\) sacrifices convergence to the exact target policy in exchange for greater stability. In practice, the IMPALA paper found that \(\bar{\rho} = 1\) and \(\bar{c} = 1\) (equivalent to Retrace) works well, but having the option to tune \(\bar{\rho}\) independently is valuable when policies diverge significantly.

When to use which?

Use Retrace when you need guaranteed convergence to \(Q^\pi\) and the behavior policy is reasonably close. Use V-trace in distributed systems where actors may be many updates behind the learner and you need the extra stability from separate \(\bar{\rho}\) clipping.

References

Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and Efficient Off-Policy Reinforcement Learning. NeurIPS 2016.

Espeholt, L., Soyer, H., Munos, R., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. ICML 2018.

Precup, D., Sutton, R. S., & Singh, S. (2000). Eligibility Traces for Off-Policy Policy Evaluation. ICML 2000.