The Value of Autoresetting

Vectorised environments contain multiple sub-environments each creating their own episode. When one of these episodes end, it's sub-environment needs to be reset before a new episode can begin. We could wait until all the sub-environments have completed their episodes then calling env.reset() on them all however this would be incredibly inefficient if the environment contains episodes with numerous different lengths. A solution to this is to reset a sub-environment as soon as it finishes, referred to as autoresetting.

Waiting until all episodes are finished
env 1
env 2
env 3
Reset as soon as episode is finished
env 1
env 2
env 3

The challenge is that there are two subtly different ways of implementing this autoresetting functionality, referred to as Next-Step and Same-Step in Gymnasium. The difference is not just an implementation detail, it directly affects the observations and bootstrap values in your rollout buffer, and therefore how GAE, VTrace, Retrace and other training algorithms must be computed.

How the Two Modes Differ

Both modes autoreset, but they answer a single question differently: when a sub-environment steps, and it's episode ends, should it immediately reset returning the new episode's observations, or return the terminated observation waiting to the next timestep to reset. That timing difference determines where and when the terminal and reset observation are returned.

Next-Step Mode: s0 s1 s2 s0' s1' s2' Terminated Obs Reset Obs Same-Step Mode: s0 s1 s0' Reset Obs Terminated Obs (in info["final_obs"]) s2 s1' s2'

Next-Step mode

The reset is deferred. When a sub-environment terminates, step() returns the terminal observation as next_obs, but the reset does not happen yet. On the next vector step()call, the sub-environment is reset with the action provided ignored.

Importantly, this means that for rollouts, on episode boundaries, the observation / next-observations aren't valid timesteps (as they represent the final and first observations of episodes respectively) to compute an error for. Therefore, the final observation of every episode must be masked out of the training loss.

Only calling step or reset within an environment timestep does have throughput advantages in synchronous vectorized environments where the reset() is a meaningfully expensive call, for example, in video games like Atari and simulators. This is because using same-step, described below, if every sub-environment calls step() then wait for any reset() call to happen (which might be very few), then this can add significant time waste as the other sub-environments are idling. In comparison, this additional wait is significantly lower as only one of step() or reset() per timestep. As a result, some high-performance RL vectorizers prefer next-step due to the compute efficiency.

Same-Step mode

The reset is immediate. When a sub-environment terminates, the reset happens within the same step() call where the next_obs being the reset observation \(s_0'\), and the terminal observation is stored in info["final_obs"]. At the next step, the policy acts directly on \(s_0'\) being that there are no stale observations or masking required.

The trade-off appears at truncated boundaries (time limit reached, not truly terminal): since next_obs is already the reset state, you cannot bootstrap \(V(s_\text{last})\) from it directly. You need to run info["final_obs"] through the value network to get the correct bootstrap target.

Property Next-Step Mode Same-Step Mode
Reset timing Deferred to the next step() Within the same step()
next_obs at termination Terminal state \(s_\text{term}\) Reset state \(s_0'\)
Terminal obs location next_obs directly info["final_obs"]
Stale obs at episode boundaries Yes -- must mask No
Truncation bootstrap V(next_obs) directly V(info["final_obs"]) required
Sync env throughput Reset hidden in next batch Resets blocks the batch

How does Autoreset-Mode Affect GAE

As Generalised Advantage Estimation (GAE) in PPO operates on rollouts from vector environments then it's a helpful example to show how autoreset-mode impact the implementation. Critically, at an episode boundary, two things must hold:

  1. The TD error \(\delta_t\) must use the correct bootstrap value (zero for termination, \(\gamma V(s_{t+1})\) for truncation).
  2. The GAE accumulation must reset: advantages from the new episode must not bleed back into the old one.

Both modes satisfy condition 2 the same way: multiply the incoming GAE by (1 - episode_over[t]) when accumulating backwards. The difference is in condition 1.

The next_obs stored at a truncated step is the actual next state before the reset, so you can bootstrap directly from values[t+1]. For terminated steps, the bootstrap is zero. No special handling of info is needed. The catch: the step at \(t+1\) after any episode boundary carries a stale observation, so mask it from the PPO loss.

def compute_gae(
    observations,  # (rollout_length+1, num_envs)
    rewards,       # (rollout_length, num_envs)
    terminations,  # (rollout_length, num_envs)
    truncations,   # (rollout_length, num_envs)
    initial_episode_over,  # (num_envs,)
    gamma=0.99, gae_lambda=0.95,
):
    rollout_length, num_envs = rewards.shape

    values = value_fn(observations)

    advantages = np.zeros_like((rollout_length, num_envs))
    gae = np.zeros(num_envs)
    episode_over = np.logical_or(terminated, truncated)

    for t in reversed(range(rollout_length)):
        next_val = np.logical_not(terminations[t]) * values[t + 1]
        delta = rewards[t] + gamma * next_val - values[t]
        advantages[t] = gae = delta + gamma * gae_lambda * (1 - episode_over[t]) * gae

    # Mask the first timestep if the last rollout's timestep was episode over
    mask = np.concatenate([initial_episode_over, np.logical_not(episode_over[:-1])])
    return advantages, mask

Summary

Property Next-Step Same-Step
Wasted steps per boundary 1 (must mask) 0
Truncation bootstrap V(next_obs) — correct V(info["final_obs"]) — needs extra value call
info["final_obs"] required No For truncated only
All rollout steps usable No Yes

If you have the option, which mode should you use?

It depends on where your bottleneck is. Same-Step is the simpler choice for implementation: every step is valid, no masking required, though you need to compute values for both observations and next_observations buffers separately.

Next-Step is favoured by high-throughput libraries like EnvPool because it hides reset latency for environments where reset() is expensive (Atari, physics simulators). For environments with cheap resets -- random array initialisation, most gridworlds -- the throughput difference is negligible and the masking overhead of Next-Step is pure cost. The trade-off is one stale step per episode boundary that must be masked from the loss.

References

Towers, M., et al. (2024). Gymnasium: A Standard Interface for Reinforcement Learning Environments. github.com/Farama-Foundation/Gymnasium

Towers, M. (2025). Deep Dive: Autoreset Modes in Gymnasium v1.1 Vector Environments. farama.org