REINFORCE with Baseline

class rlforge.agents.policy_gradient.reinforce.REINFORCEAgent(num_actions, dims_ranges, alpha_theta=0.002, alpha_w=0.002, discount=0.99, iht_size=4096, num_tilings=8, num_tiles=8, baseline=False, wrap_dims=())

REINFORCE (Monte Carlo Policy Gradient) Agent for discrete action spaces.

This agent implements the REINFORCE algorithm using linear function approximation over tile-coded features. It supports both vanilla REINFORCE and REINFORCE with a baseline.

Vanilla REINFORCE (baseline=False): Uses Monte Carlo return G_t as the update factor.
REINFORCE with Baseline (baseline=True): Uses advantage A_t = G_t - V(s_t, w) as the update factor.

Parameters

num_actionsint: Number of discrete actions available in the environment.
dims_rangeslist of tuple: Ranges for each state dimension, used by the tile coder.
alpha_thetafloat, optional: Policy step size (default=2e-3).
alpha_wfloat, optional: Baseline step size (default=2e-3). Only used if baseline=True.
discountfloat, optional: Discount factor γ applied to future rewards (default=0.99).
iht_sizeint, optional: Size of the index hash table for tile coding (default=4096).
num_tilingsint, optional: Number of tilings used in tile coding (default=8).
num_tilesint, optional: Number of tiles per dimension (default=8).
baselinebool, optional: Whether to use a baseline value function (default=False).
wrap_dimstuple, optional: Dimensions to wrap in tile coding (default=()).

end(reward)

Process the final reward and execute the Monte Carlo update.

Computes returns G_t for the entire episode, calculates the update factor (G_t or A_t depending on baseline usage), updates the baseline function if enabled, and applies the policy gradient update to the actor.

Parameters

rewardfloat: Final reward received at the end of the episode.

reset()

Reset the agent’s internal state.

Resets the actor (and baseline if enabled), clears the trajectory, and resets cached state-action pairs for a new experiment run.

start(state)

Begin a new episode.

Extracts active tiles for the initial state, computes action preferences, applies softmax to obtain action probabilities, samples an action, and caches the state-action pair.

Parameters

statenp.ndarray: The initial state or feature vector.

Returns

int: The chosen action.

step(reward, state)

Perform a step transition.

Stores the transition (S_t, A_t, R_{t+1}), computes action preferences for the new state, applies softmax, samples the next action, and caches it.

Parameters

rewardfloat: Reward received from the previous action.
statenp.ndarray: The new state observed from the environment.

Returns

int: The next action chosen by the agent.