REINFORCE with Baseline
- class rlforge.agents.policy_gradient.reinforce.REINFORCEAgent(num_actions, dims_ranges, alpha_theta=0.002, alpha_w=0.002, discount=0.99, iht_size=4096, num_tilings=8, num_tiles=8, baseline=False, wrap_dims=())
REINFORCE (Monte Carlo Policy Gradient) Agent for discrete action spaces.
This agent implements the REINFORCE algorithm using linear function approximation over tile-coded features. It supports both vanilla REINFORCE and REINFORCE with a baseline.
Vanilla REINFORCE (baseline=False): Uses Monte Carlo return G_t as the update factor.
REINFORCE with Baseline (baseline=True): Uses advantage A_t = G_t - V(s_t, w) as the update factor.
Parameters
- num_actionsint
Number of discrete actions available in the environment.
- dims_rangeslist of tuple
Ranges for each state dimension, used by the tile coder.
- alpha_thetafloat, optional
Policy step size (default=2e-3).
- alpha_wfloat, optional
Baseline step size (default=2e-3). Only used if baseline=True.
- discountfloat, optional
Discount factor γ applied to future rewards (default=0.99).
- iht_sizeint, optional
Size of the index hash table for tile coding (default=4096).
- num_tilingsint, optional
Number of tilings used in tile coding (default=8).
- num_tilesint, optional
Number of tiles per dimension (default=8).
- baselinebool, optional
Whether to use a baseline value function (default=False).
- wrap_dimstuple, optional
Dimensions to wrap in tile coding (default=()).
- end(reward)
Process the final reward and execute the Monte Carlo update.
Computes returns G_t for the entire episode, calculates the update factor (G_t or A_t depending on baseline usage), updates the baseline function if enabled, and applies the policy gradient update to the actor.
Parameters
- rewardfloat
Final reward received at the end of the episode.
- reset()
Reset the agent’s internal state.
Resets the actor (and baseline if enabled), clears the trajectory, and resets cached state-action pairs for a new experiment run.
- start(state)
Begin a new episode.
Extracts active tiles for the initial state, computes action preferences, applies softmax to obtain action probabilities, samples an action, and caches the state-action pair.
Parameters
- statenp.ndarray
The initial state or feature vector.
Returns
- int
The chosen action.
- step(reward, state)
Perform a step transition.
Stores the transition (S_t, A_t, R_{t+1}), computes action preferences for the new state, applies softmax, samples the next action, and caches it.
Parameters
- rewardfloat
Reward received from the previous action.
- statenp.ndarray
The new state observed from the environment.
Returns
- int
The next action chosen by the agent.