Gaussian Actor Critic

class rlforge.agents.policy_gradient.gaussian_actor_critic.GaussianActorCriticAgent(actor_step_size, critic_step_size, avg_reward_step_size, dims_ranges, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())

Gaussian Actor-Critic Agent with average reward baseline.

This agent implements a continuous-action actor-critic algorithm using tile coding for feature extraction and linear function approximation for both the actor (policy) and critic (value function). The actor outputs the mean and log standard deviation of a Gaussian distribution, from which actions are sampled. Updates are guided by the temporal-difference (TD) error and an average reward baseline.

Parameters

actor_step_sizefloat

Learning rate for the actor updates.

critic_step_sizefloat

Learning rate for the critic updates.

avg_reward_step_sizefloat

Step size for updating the average reward baseline.

dims_rangeslist of tuple

Ranges for each state dimension, used by the tile coder.

iht_sizeint, optional

Size of the index hash table for tile coding (default=4096).

num_tilingsint, optional

Number of tilings used in tile coding (default=8).

num_tilesint, optional

Number of tiles per dimension (default=8).

wrap_dimstuple, optional

Dimensions to wrap in tile coding (default=()).

end(reward)

Complete an episode.

Performs a final update of the actor and critic using the terminal reward and the last cached state-action pair.

Parameters

rewardfloat

Final reward received at the end of the episode.

get_td_error(prev_tiles, reward, active_tiles, avg_reward)

Compute the temporal-difference (TD) error.

Calculates the TD error using the reward, average reward baseline, the critic’s prediction for the next state, and the critic’s prediction for the previous state.

Parameters

prev_tilesarray-like

Active tiles for the previous state.

rewardfloat

Reward received for the transition.

active_tilesarray-like

Active tiles for the current state.

avg_rewardfloat

Current estimate of the average reward baseline.

Returns

float

The computed TD error.

reset()

Reset the agent.

Resets the actor and critic weights to their initial values and clears the average reward baseline.

select_action(mu, sigma)

Select an action by sampling from a Gaussian distribution.

Parameters

mufloat

Mean of the Gaussian distribution.

sigmafloat

Standard deviation of the Gaussian distribution.

Returns

float

The sampled continuous action.

start(new_state)

Begin a new episode.

Extracts active tiles for the initial state, computes the Gaussian parameters (mean and standard deviation), samples an action, and caches the state-action pair for future updates.

Parameters

new_statearray-like

The initial state observed from the environment.

Returns

float

The continuous action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates the actor and critic weights using the TD error and average reward baseline, then selects the next action by sampling from the Gaussian distribution defined by the actor.

Parameters

rewardfloat

Reward received from the previous action.

new_statearray-like

The new state observed from the environment.

Returns

float

The next continuous action chosen by the agent.