Twin Delayed Deep Deterministic Policy Gradient (TD3)
- class rlforge.agents.policy_gradient.td3.TD3Agent(state_dim, action_dim, policy_net_architecture=(256, 256), q_net_architecture=(256, 256), actor_lr=0.0001, critic_lr=0.001, discount=0.99, tau=0.005, update_frequency=1, buffer_size=1000000, mini_batch_size=256, update_start_size=256, action_low=None, action_high=None, noise_std=0.1, policy_delay=2, target_noise_std=0.2, target_noise_clip=0.5, device=None)
Twin Delayed Deep Deterministic Policy Gradient (TD3) Agent for continuous action spaces.
TD3 enhances the Deep Deterministic Policy Gradient (DDPG) algorithm with three core mechanisms: - Twin Critics: Two Q-networks to reduce overestimation bias. - Delayed Policy Updates: The actor (policy) is updated less frequently than the critics. - Target Policy Smoothing: Adds clipped noise to target actions for more stable training.
Parameters
- state_dimint
Dimension of the input state space.
- action_dimint
Dimension of the continuous action space.
- policy_net_architecturetuple of int, optional
Hidden layer sizes for the actor/policy network (default=(256, 256)).
- q_net_architecturetuple of int, optional
Hidden layer sizes for the critic/Q-networks (default=(256, 256)).
- actor_lrfloat, optional
Learning rate for the actor network (default=1e-4).
- critic_lrfloat, optional
Learning rate for the critic networks (default=1e-3).
- discountfloat, optional
Discount factor γ applied to future rewards (default=0.99).
- taufloat, optional
Polyak averaging factor for soft target network updates (default=0.005).
- update_frequencyint, optional
Frequency (in steps) of training updates (default=1).
- buffer_sizeint, optional
Maximum size of the replay buffer (default=1,000,000).
- mini_batch_sizeint, optional
Size of mini-batches sampled from the replay buffer (default=256).
- update_start_sizeint, optional
Minimum number of transitions before updates begin (default=256).
- action_lowfloat or np.ndarray, optional
Lower bound(s) for continuous actions.
- action_highfloat or np.ndarray, optional
Upper bound(s) for continuous actions.
- noise_stdfloat, optional
Standard deviation of Gaussian exploration noise added to actions (default=0.1).
- policy_delayint, optional
Delay factor for policy and target network updates (default=2).
- target_noise_stdfloat, optional
Standard deviation of noise added to target actions during critic updates (default=0.2).
- target_noise_clipfloat, optional
Clipping value for target action noise (default=0.5).
- devicestr or torch.device, optional
Device to run computations on (“cpu” or “cuda”). Defaults to CUDA if available.
- end(reward)
Complete an episode in a single environment.
Stores the final transition into the replay buffer.
Parameters
- rewardfloat
Final reward received at the end of the episode.
- end_batch(rewards)
Complete a batch of episodes.
Stores terminal transitions into the replay buffer and performs TD3 updates if conditions are met.
Parameters
- rewardsarray-like, shape (N,)
Final rewards received for each terminated environment.
- load(filepath)
Load the agent’s state from a file.
This method updates all active networks and optimizers, and immediately synchronizes the target networks to match the loaded weights using a hard copy.
Parameters
- filepathstr
The path to the file containing the saved state.
- reset()
Reset the agent state for a new run.
Clears the replay buffer, resets counters, and rebuilds networks and optimizers to start training from scratch.
Notes
Resets
total_stepsto zero.Clears cached previous state and action.
Calls
reset_nets_and_opts()to reinitialize networks.
Returns
- None
Agent state and networks are reset.
- reset_nets_and_opts()
Build or rebuild all networks and optimizers.
Initializes the policy network, twin Q-networks, and their target counterparts. Also sets up optimizers for the actor and both critics.
Workflow
Construct the policy network with Tanh activation on the output.
Construct twin Q-networks (Q1 and Q2).
Deep copy networks to create target policy and target critics.
Set target networks to evaluation mode (no gradient updates).
Initialize Adam optimizers for actor and critics.
Returns
- None
Networks and optimizers are rebuilt in-place.
- save(filepath)
Save the agent’s complete state to a file.
This saves the state_dicts for the policy network, both twin critics, and all three optimizers. This ensures that training can be resumed exactly where it left off.
Parameters
- filepathstr
The path to the file where the state should be saved.
- start(state, deterministic=False)
Begin a new episode in a single environment.
Parameters
- statearray-like
Initial state of the environment.
- deterministicbool, optional
If True, selects deterministic actions (default=False).
Returns
- np.ndarray
Selected action.
- start_batch(states, deterministic=False)
Begin a batch of episodes.
Selects actions for multiple environments simultaneously.
Parameters
- statesarray-like, shape (N, state_dim)
Batch of initial states.
- deterministicbool, optional
If True, selects deterministic actions (default=False).
Returns
- np.ndarray
Array of selected actions of shape (N, action_dim).
- step(reward, state, done=False, deterministic=False)
Take a step in a single environment.
Stores transition, performs updates if conditions are met, and selects the next action.
Parameters
- rewardfloat
Reward from the previous action.
- statearray-like
Next state observed.
- donebool, optional
Whether the episode has terminated (default=False).
- deterministicbool, optional
If True, selects deterministic actions (default=False).
Returns
- np.ndarray
Selected action.
- step_batch(rewards, next_states, dones, deterministic=False)
Take a step in multiple environments.
Stores transitions in the replay buffer, performs TD3 updates if conditions are met, and selects next actions.
Parameters
- rewardsarray-like, shape (N,)
Rewards from the previous actions.
- next_statesarray-like, shape (N, state_dim)
Next states observed.
- donesarray-like, shape (N,)
Boolean flags indicating episode termination.
- deterministicbool, optional
If True, selects deterministic actions (default=False).
Returns
- np.ndarray
Array of selected actions of shape (N, action_dim).