REINFORCE with Baseline

class rlforge.agents.policy_gradient.reinforce.REINFORCEAgent(num_actions, dims_ranges, alpha_theta=0.002, alpha_w=0.002, discount=0.99, iht_size=4096, num_tilings=8, num_tiles=8, baseline=False, wrap_dims=())

REINFORCE (Monte Carlo Policy Gradient) Agent for discrete action spaces.

This agent implements the REINFORCE algorithm using linear function approximation over tile-coded features. It supports both vanilla REINFORCE and REINFORCE with a baseline.

  • Vanilla REINFORCE (baseline=False): Uses Monte Carlo return G_t as the update factor.

  • REINFORCE with Baseline (baseline=True): Uses advantage A_t = G_t - V(s_t, w) as the update factor.

Parameters

num_actionsint

Number of discrete actions available in the environment.

dims_rangeslist of tuple

Ranges for each state dimension, used by the tile coder.

alpha_thetafloat, optional

Policy step size (default=2e-3).

alpha_wfloat, optional

Baseline step size (default=2e-3). Only used if baseline=True.

discountfloat, optional

Discount factor γ applied to future rewards (default=0.99).

iht_sizeint, optional

Size of the index hash table for tile coding (default=4096).

num_tilingsint, optional

Number of tilings used in tile coding (default=8).

num_tilesint, optional

Number of tiles per dimension (default=8).

baselinebool, optional

Whether to use a baseline value function (default=False).

wrap_dimstuple, optional

Dimensions to wrap in tile coding (default=()).

end(reward)

Process the final reward and execute the Monte Carlo update.

Computes returns G_t for the entire episode, calculates the update factor (G_t or A_t depending on baseline usage), updates the baseline function if enabled, and applies the policy gradient update to the actor.

Parameters

rewardfloat

Final reward received at the end of the episode.

reset()

Reset the agent’s internal state.

Resets the actor (and baseline if enabled), clears the trajectory, and resets cached state-action pairs for a new experiment run.

start(state)

Begin a new episode.

Extracts active tiles for the initial state, computes action preferences, applies softmax to obtain action probabilities, samples an action, and caches the state-action pair.

Parameters

statenp.ndarray

The initial state or feature vector.

Returns

int

The chosen action.

step(reward, state)

Perform a step transition.

Stores the transition (S_t, A_t, R_{t+1}), computes action preferences for the new state, applies softmax, samples the next action, and caches it.

Parameters

rewardfloat

Reward received from the previous action.

statenp.ndarray

The new state observed from the environment.

Returns

int

The next action chosen by the agent.