Proximal Policy Optimization (PPO) Continuous
- class rlforge.agents.policy_gradient.ppo_continuous.PPOContinuous(state_dim, action_dim, network_architecture=[64, 64], actor_lr=0.0003, critic_lr=0.0003, discount=0.99, gae_lambda=0.95, clip_epsilon=0.2, update_epochs=10, mini_batch_size=64, rollout_length=2048, value_coef=0.5, entropy_coeff=0.0, max_grad_norm=0.5, tanh_squash=False, action_low=None, action_high=None, device=None)
Proximal Policy Optimization (PPO) Agent for continuous action spaces.
This agent implements PPO with Generalized Advantage Estimation (GAE), adapted for vectorized environments. Data is collected in (T, N, …) format and flattened to (T*N, …) for training. Networks are built internally to ensure proper re-initialization during reset.
Parameters
- state_dimint
Dimension of the input state space.
- action_dimint
Dimension of the continuous action space.
- network_architecturelist of int, optional
Sizes of hidden layers for both actor and critic networks (default=[64, 64]).
- actor_lrfloat, optional
Learning rate for the actor/policy network (default=3e-4).
- critic_lrfloat, optional
Learning rate for the critic/value network (default=3e-4).
- discountfloat, optional
Discount factor γ applied to future rewards (default=0.99).
- gae_lambdafloat, optional
GAE parameter λ controlling bias-variance tradeoff (default=0.95).
- clip_epsilonfloat, optional
Clipping parameter for PPO objective (default=0.2).
- update_epochsint, optional
Number of epochs per PPO update (default=10).
- mini_batch_sizeint, optional
Size of mini-batches sampled during PPO updates (default=64).
- rollout_lengthint, optional
Number of transitions per environment before an update (default=2048).
- value_coeffloat, optional
Coefficient for value loss in PPO objective (default=0.5).
- entropy_coefffloat, optional
Coefficient for entropy bonus in PPO objective (default=0.0).
- max_grad_normfloat, optional
Maximum gradient norm for clipping (default=0.5).
- tanh_squashbool, optional
Whether to apply tanh squashing to actions (default=False).
- action_lowfloat or np.ndarray, optional
Lower bound(s) for continuous actions.
- action_highfloat or np.ndarray, optional
Upper bound(s) for continuous actions.
- devicestr or torch.device, optional
Device to run computations on (“cpu” or “cuda”). Defaults to CUDA if available.
- end(reward)
Complete an episode in a single environment.
Stores the final transition into the rollout buffer.
Parameters
- rewardfloat
Final reward received at the end of the episode.
- end_batch(rewards)
Complete episodes for multiple environments.
Stores terminal transitions into the rollout buffer and performs PPO updates if conditions are met.
Parameters
- rewardsarray-like, shape (N,)
Final rewards received for each terminated environment.
- load(path)
Load the agent’s state from a file.
Restores the networks, the log standard deviation, and the optimizers. If the saved network architecture differs from the current one, the networks are rebuilt to match the saved state.
Parameters
- pathstr
The file path from which to load the state.
- reset()
Reset the agent state for a new run.
Reinitializes the policy and value networks, optimizers, and clears the rollout buffer and cached transitions.
Workflow
Rebuild policy and value networks with fresh weights.
Reinitialize learnable log standard deviation parameter.
Reinitialize Adam optimizers for actor and critic.
Clear rollout buffer and reset step counter.
Reset cached previous state, action, log probability, and value.
Returns
- None
Agent state and networks are reset.
- save(path)
Save the agent’s state to a file.
This includes the state dictionaries for the policy and value networks, the learnable log standard deviation, and both optimizers.
Parameters
- pathstr
The file path where the state should be saved.
- start(state)
Begin a new episode in a single environment.
Parameters
- statearray-like
Initial state of the environment.
Returns
- np.ndarray
Selected action.
- start_batch(states)
Begin a new episode with multiple environments.
Parameters
- statesarray-like, shape (N, state_dim)
Batch of initial states.
Returns
- np.ndarray
Array of selected actions of shape (N, action_dim).
- step(reward, state, done=False)
Take a step in a single environment.
Stores transition, performs updates if conditions are met, and selects the next action.
Parameters
- rewardfloat
Reward from the previous action.
- statearray-like
Next state observed.
- donebool, optional
Whether the episode has terminated (default=False).
Returns
- np.ndarray
Selected action.
- step_batch(rewards, states, dones)
Take a step in multiple environments.
Stores transitions in the rollout buffer, performs PPO updates if conditions are met, and selects next actions.
Parameters
- rewardsarray-like, shape (N,)
Rewards from the previous actions.
- statesarray-like, shape (N, state_dim)
Next states observed.
- donesarray-like, shape (N,)
Boolean flags indicating episode termination.
Returns
- np.ndarray
Array of selected actions of shape (N, action_dim).