Deep Deterministic Policy Gradient (DDPG)

class rlforge.agents.policy_gradient.ddpg.DDPGAgent(state_dim, action_dim, policy_net_architecture=(256, 256), q_net_architecture=(256, 256), actor_lr=0.0001, critic_lr=0.001, discount=0.99, tau=0.001, update_frequency=1, buffer_size=1000000, mini_batch_size=64, update_start_size=256, action_low=None, action_high=None, noise_std=0.1, device=None)

Deep Deterministic Policy Gradient (DDPG) Agent for continuous action spaces.

DDPG is an off-policy actor-critic algorithm that learns a deterministic policy for continuous control tasks. It combines ideas from DPG and Q-learning, using a target actor and critic for stability, and adds exploration noise to actions.

This implementation is adapted for compatibility with vectorized environments and manages networks internally for proper reset.

Parameters

state_dimint: Dimension of the input state space.
action_dimint: Dimension of the continuous action space.
policy_net_architecturetuple of int, optional: Hidden layer sizes for the actor/policy network (default=(256, 256)).
q_net_architecturetuple of int, optional: Hidden layer sizes for the critic/Q-network (default=(256, 256)).
actor_lrfloat, optional: Learning rate for the actor network (default=1e-4).
critic_lrfloat, optional: Learning rate for the critic network (default=1e-3).
discountfloat, optional: Discount factor γ applied to future rewards (default=0.99).
taufloat, optional: Polyak averaging factor for soft target network updates (default=0.001).
update_frequencyint, optional: Frequency (in steps) of training updates (default=1).
buffer_sizeint, optional: Maximum size of the replay buffer (default=1,000,000).
mini_batch_sizeint, optional: Size of mini-batches sampled from the replay buffer (default=64).
update_start_sizeint, optional: Minimum number of transitions before updates begin (default=256).
action_lowfloat or np.ndarray, optional: Lower bound(s) for continuous actions.
action_highfloat or np.ndarray, optional: Upper bound(s) for continuous actions.
noise_stdfloat, optional: Standard deviation of Gaussian exploration noise (default=0.1).
devicestr or torch.device, optional: Device to run computations on (“cpu” or “cuda”). Defaults to CUDA if available.

end(reward)

Complete an episode in a single environment.

Stores the final transition into the replay buffer.

Parameters

rewardfloat: Final reward received at the end of the episode.

end_batch(rewards)

Complete a batch of episodes.

Stores terminal transitions into the replay buffer and performs DDPG updates if conditions are met.

Parameters

rewardsarray-like, shape (N,): Final rewards received for each terminated environment.

Notes

Each terminal transition is stored as (S_t, A_t, R_t, S_{t+1}=S_t, Done=True).
DDPG stores the noisy action that was executed in the environment.
Training is triggered after storing transitions if the replay buffer contains at least update_start_size samples.

load(path)

Load the agent’s parameters and optimizer states from a file.

Parameters

pathstr: The file path from which to load the agent’s state.

reset()

Reset the agent state for a new run.

Clears the replay buffer, resets counters, and rebuilds networks and optimizers to start training from scratch.

Notes

Resets total_steps to zero.
Clears cached previous state and action.
Calls reset_nets_and_opts() to reinitialize networks and optimizers.

Returns

None: Agent state and networks are reset.

reset_nets_and_opts()

Build or rebuild all networks and optimizers.

Initializes the policy network, Q-network, and their target counterparts. Also sets up optimizers for actor and critic.

Workflow

Construct policy network with Tanh activation on the output.
Construct Q-network with state+action input and scalar output.
Deep copy networks to create target policy and target critic.
Set target networks to evaluation mode (no gradient updates).
Initialize Adam optimizers for actor and critic.

Returns

None: Networks and optimizers are rebuilt in-place.

save(path)

Save the agent’s parameters and optimizer states to a file.

Parameters

pathstr: The file path where the agent’s state should be saved.

start(state, deterministic=False)

Begin a new episode in a single environment.

Parameters

statearray-like: Initial state of the environment.
deterministicbool, optional: If True, selects deterministic actions (default=False).

Returns

np.ndarray: Selected action.

start_batch(states, deterministic=False)

Begin a batch of episodes.

Selects actions for multiple environments simultaneously. Adds Gaussian noise for exploration if not deterministic.

Parameters

statesarray-like, shape (N, state_dim): Batch of initial states.
deterministicbool, optional: If True, selects deterministic actions (default=False).

Returns

np.ndarray: Array of selected actions of shape (N, action_dim).

step(reward, state, done=False, deterministic=False)

Take a step in a single environment.

Stores transition, performs updates if conditions are met, and selects the next action.

Parameters

rewardfloat: Reward from the previous action.
statearray-like: Next state observed.
donebool, optional: Whether the episode has terminated (default=False).
deterministicbool, optional: If True, selects deterministic actions (default=False).

Returns

np.ndarray: Selected action.

step_batch(rewards, next_states, dones, deterministic=False)

Take a step in multiple environments.

Stores transitions in the replay buffer, performs DDPG updates if conditions are met, and selects next actions.

Parameters

rewardsarray-like, shape (N,): Rewards from the previous actions.
next_statesarray-like, shape (N, state_dim): Next states observed.
donesarray-like, shape (N,): Boolean flags indicating episode termination.
deterministicbool, optional: If True, selects deterministic actions (default=False).

Returns

np.ndarray: Array of selected actions of shape (N, action_dim).