Gaussian Actor Critic

class rlforge.agents.policy_gradient.gaussian_actor_critic.GaussianActorCriticAgent(actor_step_size, critic_step_size, avg_reward_step_size, dims_ranges, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())

Gaussian Actor-Critic Agent with average reward baseline.

This agent implements a continuous-action actor-critic algorithm using tile coding for feature extraction and linear function approximation for both the actor (policy) and critic (value function). The actor outputs the mean and log standard deviation of a Gaussian distribution, from which actions are sampled. Updates are guided by the temporal-difference (TD) error and an average reward baseline.

Parameters

actor_step_sizefloat: Learning rate for the actor updates.
critic_step_sizefloat: Learning rate for the critic updates.
avg_reward_step_sizefloat: Step size for updating the average reward baseline.
dims_rangeslist of tuple: Ranges for each state dimension, used by the tile coder.
iht_sizeint, optional: Size of the index hash table for tile coding (default=4096).
num_tilingsint, optional: Number of tilings used in tile coding (default=8).
num_tilesint, optional: Number of tiles per dimension (default=8).
wrap_dimstuple, optional: Dimensions to wrap in tile coding (default=()).

end(reward)

Complete an episode.

Performs a final update of the actor and critic using the terminal reward and the last cached state-action pair.

Parameters

rewardfloat: Final reward received at the end of the episode.

get_td_error(prev_tiles, reward, active_tiles, avg_reward)

Compute the temporal-difference (TD) error.

Calculates the TD error using the reward, average reward baseline, the critic’s prediction for the next state, and the critic’s prediction for the previous state.

Parameters

prev_tilesarray-like: Active tiles for the previous state.
rewardfloat: Reward received for the transition.
active_tilesarray-like: Active tiles for the current state.
avg_rewardfloat: Current estimate of the average reward baseline.

Returns

float: The computed TD error.

reset()

Reset the agent.

Resets the actor and critic weights to their initial values and clears the average reward baseline.

select_action(mu, sigma)

Select an action by sampling from a Gaussian distribution.

Parameters

mufloat: Mean of the Gaussian distribution.
sigmafloat: Standard deviation of the Gaussian distribution.

Returns

float: The sampled continuous action.

start(new_state)

Begin a new episode.

Extracts active tiles for the initial state, computes the Gaussian parameters (mean and standard deviation), samples an action, and caches the state-action pair for future updates.

Parameters

new_statearray-like: The initial state observed from the environment.

Returns

float: The continuous action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates the actor and critic weights using the TD error and average reward baseline, then selects the next action by sampling from the Gaussian distribution defined by the actor.

Parameters

rewardfloat: Reward received from the previous action.
new_statearray-like: The new state observed from the environment.

Returns

float: The next continuous action chosen by the agent.