Policy Gradient Agents

Policy gradient methods directly optimize the agent’s policy by adjusting its parameters in the direction that increases expected rewards. Unlike value-based methods, which learn action-value functions and derive policies indirectly, policy gradient agents learn stochastic policies that can naturally handle both discrete and continuous action spaces.

A common approach is the actor-critic architecture, where:

  • The actor represents the policy and selects actions.

  • The critic estimates value functions and provides feedback to improve the actor’s parameters.

RLForge currently includes:

  • REINFORCE — the classic Monte Carlo policy gradient algorithm that updates parameters directly based on returns, without a critic.

  • Softmax Actor-Critic — uses a softmax distribution over discrete actions, allowing the agent to balance exploration and exploitation while learning directly from policy gradients.

  • Gaussian Actor-Critic — outputs continuous actions by sampling from a Gaussian distribution parameterized by mean and variance. The current implementation supports a single continuous output, making it suitable for environments with one-dimensional action spaces.

  • Deep Deterministic Policy Gradient (DDPG) — an off-policy actor-critic method for continuous control, using deterministic policies and target networks.

  • Twin Delayed Deep Deterministic Policy Gradient (TD3) — improves DDPG by using twin critics, delayed policy updates, and target policy smoothing.

  • Soft Actor-Critic (SAC) — an off-policy actor-critic method that maximizes both reward and entropy, encouraging exploration with stochastic policies.

  • Proximal Policy Optimization (PPO-Discrete) — applies PPO to discrete action spaces, using clipped objectives and GAE for stable updates.

  • Proximal Policy Optimization (PPO-Continuous) — applies PPO to continuous action spaces, supporting Gaussian policies with optional tanh squashing.