Proximal Policy Optimization (PPO) Discrete

class rlforge.agents.policy_gradient.ppo_discrete.PPODiscrete(state_dim, num_actions, actor_lr=0.0003, critic_lr=0.0003, discount=0.99, clip_epsilon=0.2, network_architecture=[64, 64], update_epochs=10, mini_batch_size=64, rollout_length=1024, value_coef=0.5, entropy_coeff=0.01, gae_lambda=0.95, device=None)

Proximal Policy Optimization (PPO) Agent for discrete action spaces.

This agent implements PPO with Generalized Advantage Estimation (GAE), adapted for vectorized environments. It supports rollout-based updates, advantage normalization, and the clipped surrogate objective.

Features

  • Separate actor and critic learning rates

  • Rollout-based updates with GAE

  • Vectorized batch processing (T, N)

  • Advantage normalization

  • Clipped objective for stable updates

Parameters

state_dimint

Dimension of the input state space.

num_actionsint

Number of discrete actions available in the environment.

actor_lrfloat, optional

Learning rate for the actor/policy network (default=3e-4).

critic_lrfloat, optional

Learning rate for the critic/value network (default=3e-4).

discountfloat, optional

Discount factor γ applied to future rewards (default=0.99).

clip_epsilonfloat, optional

Clipping parameter for PPO objective (default=0.2).

network_architecturelist of int, optional

Sizes of hidden layers for both actor and critic networks (default=[64, 64]).

update_epochsint, optional

Number of epochs per PPO update (default=10).

mini_batch_sizeint, optional

Size of mini-batches sampled during PPO updates (default=64).

rollout_lengthint, optional

Number of transitions per environment before an update (default=1024).

value_coeffloat, optional

Coefficient for value loss in PPO objective (default=0.5).

entropy_coefffloat, optional

Coefficient for entropy bonus in PPO objective (default=0.01).

gae_lambdafloat, optional

GAE parameter λ controlling bias-variance tradeoff (default=0.95).

devicestr or torch.device, optional

Device to run computations on (“cpu” or “cuda”). Defaults to CUDA if available.

end(reward)

Complete an episode in a single environment.

Stores the final transition into the rollout buffer.

Parameters

rewardfloat

Final reward received at the end of the episode.

end_batch(rewards)

Complete episodes for multiple environments.

Stores terminal transitions into the rollout buffer and performs PPO updates if conditions are met.

Parameters

rewardsarray-like, shape (N,)

Final rewards received for each terminated environment.

load(filepath)

Load the agent’s state from a file.

Restores the network weights, optimizer states, and verifies that the stored hyperparameters match the current agent configuration.

Parameters

filepathstr

The path to the PyTorch checkpoint file.

Returns

None

reset()

Reset the agent state for a new run.

Reinitializes the policy and value networks, optimizers, and clears the rollout buffer and cached transitions.

Workflow

  1. Rebuild policy and value networks with fresh weights.

  2. Reinitialize Adam optimizers for actor and critic.

  3. Clear rollout buffer and reset step counter.

  4. Reset cached previous state, action, log probability, and value.

Returns

None

Agent state and networks are reset.

save(filepath)

Save the agent’s state to a file.

This method serializes the policy and value networks, their respective optimizers, and all relevant hyperparameters into a single dictionary stored as a PyTorch checkpoint.

Parameters

filepathstr

The destination path where the model should be saved (e.g., ‘ppo_agent.pt’).

Returns

None

start(new_state)

Begin a new episode in a single environment.

Parameters

new_statearray-like

Initial state of the environment.

Returns

int

Selected action.

start_batch(states)

Begin a new episode with multiple environments.

Parameters

statesarray-like, shape (N, state_dim)

Batch of initial states.

Returns

np.ndarray

Array of selected actions of shape (N,).

step(reward, new_state, done=False)

Take a step in a single environment.

Stores transition, performs updates if conditions are met, and selects the next action.

Parameters

rewardfloat

Reward from the previous action.

new_statearray-like

Next state observed.

donebool, optional

Whether the episode has terminated (default=False).

Returns

int

Selected action.

step_batch(rewards, states, dones)

Take a step in multiple environments.

Stores transitions in the rollout buffer, performs PPO updates if conditions are met, and selects next actions.

Parameters

rewardsarray-like, shape (N,)

Rewards from the previous actions.

statesarray-like, shape (N, state_dim)

Next states observed.

donesarray-like, shape (N,)

Boolean flags indicating episode termination.

Returns

np.ndarray

Array of selected actions of shape (N,).