Gaussian Actor Critic
- class rlforge.agents.policy_gradient.gaussian_actor_critic.GaussianActorCriticAgent(actor_step_size, critic_step_size, avg_reward_step_size, dims_ranges, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())
Gaussian Actor-Critic Agent with average reward baseline.
This agent implements a continuous-action actor-critic algorithm using tile coding for feature extraction and linear function approximation for both the actor (policy) and critic (value function). The actor outputs the mean and log standard deviation of a Gaussian distribution, from which actions are sampled. Updates are guided by the temporal-difference (TD) error and an average reward baseline.
Parameters
- actor_step_sizefloat
Learning rate for the actor updates.
- critic_step_sizefloat
Learning rate for the critic updates.
- avg_reward_step_sizefloat
Step size for updating the average reward baseline.
- dims_rangeslist of tuple
Ranges for each state dimension, used by the tile coder.
- iht_sizeint, optional
Size of the index hash table for tile coding (default=4096).
- num_tilingsint, optional
Number of tilings used in tile coding (default=8).
- num_tilesint, optional
Number of tiles per dimension (default=8).
- wrap_dimstuple, optional
Dimensions to wrap in tile coding (default=()).
- end(reward)
Complete an episode.
Performs a final update of the actor and critic using the terminal reward and the last cached state-action pair.
Parameters
- rewardfloat
Final reward received at the end of the episode.
- get_td_error(prev_tiles, reward, active_tiles, avg_reward)
Compute the temporal-difference (TD) error.
Calculates the TD error using the reward, average reward baseline, the critic’s prediction for the next state, and the critic’s prediction for the previous state.
Parameters
- prev_tilesarray-like
Active tiles for the previous state.
- rewardfloat
Reward received for the transition.
- active_tilesarray-like
Active tiles for the current state.
- avg_rewardfloat
Current estimate of the average reward baseline.
Returns
- float
The computed TD error.
- reset()
Reset the agent.
Resets the actor and critic weights to their initial values and clears the average reward baseline.
- select_action(mu, sigma)
Select an action by sampling from a Gaussian distribution.
Parameters
- mufloat
Mean of the Gaussian distribution.
- sigmafloat
Standard deviation of the Gaussian distribution.
Returns
- float
The sampled continuous action.
- start(new_state)
Begin a new episode.
Extracts active tiles for the initial state, computes the Gaussian parameters (mean and standard deviation), samples an action, and caches the state-action pair for future updates.
Parameters
- new_statearray-like
The initial state observed from the environment.
Returns
- float
The continuous action selected by the agent.
- step(reward, new_state)
Take a step in the environment.
Updates the actor and critic weights using the TD error and average reward baseline, then selects the next action by sampling from the Gaussian distribution defined by the actor.
Parameters
- rewardfloat
Reward received from the previous action.
- new_statearray-like
The new state observed from the environment.
Returns
- float
The next continuous action chosen by the agent.