Softmax Actor Critic

class rlforge.agents.policy_gradient.softmax_actor_critic.SoftmaxActorCriticAgent(actor_step_size, critic_step_size, avg_reward_step_size, num_actions, dims_ranges, temperature=1, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())

Softmax Actor-Critic Agent with average reward baseline.

This agent implements a vanilla actor-critic algorithm using tile coding for feature extraction and linear function approximation for both the actor (policy) and critic (value function). The actor updates are guided by the temporal-difference (TD) error, and actions are selected using a softmax distribution over action preferences.

Parameters

actor_step_sizefloat: Learning rate for the actor updates.
critic_step_sizefloat: Learning rate for the critic updates.
avg_reward_step_sizefloat: Step size for updating the average reward baseline.
num_actionsint: Number of discrete actions available in the environment.
dims_rangeslist of tuple: Ranges for each state dimension, used by the tile coder.
temperaturefloat, optional: Temperature parameter for softmax exploration (default=1).
iht_sizeint, optional: Size of the index hash table for tile coding (default=4096).
num_tilingsint, optional: Number of tilings used in tile coding (default=8).
num_tilesint, optional: Number of tiles per dimension (default=8).
wrap_dimstuple, optional: Dimensions to wrap in tile coding (default=()).

end(reward)

Complete an episode.

Performs a final update of the actor and critic using the terminal reward and the last cached state-action pair.

Parameters

rewardfloat: Final reward received at the end of the episode.

get_td_error(prev_tiles, reward, active_tiles, avg_reward)

Compute the temporal-difference (TD) error.

Calculates the TD error using the reward, average reward baseline, the critic’s prediction for the next state, and the critic’s prediction for the previous state.

Parameters

prev_tilesarray-like: Active tiles for the previous state.
rewardfloat: Reward received for the transition.
active_tilesarray-like: Active tiles for the current state.
avg_rewardfloat: Current estimate of the average reward baseline.

Returns

float: The computed TD error.

reset()

Reset the agent.

Resets the actor and critic weights to their initial values and clears the average reward baseline.

select_action(q_values, temperature)

Select an action using a softmax distribution.

Parameters

q_valuesarray-like: Action preferences or Q-values for the current state.
temperaturefloat: Temperature parameter controlling exploration.

Returns

int: The action selected by sampling from the softmax distribution.

start(new_state)

Begin a new episode.

Extracts active tiles for the initial state, computes action preferences, selects an action using softmax, and caches the state-action pair for future updates.

Parameters

new_statearray-like: The initial state observed from the environment.

Returns

int: The action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates the actor and critic weights using the TD error and average reward baseline, then selects the next action.

Parameters

rewardfloat: Reward received from the previous action.
new_statearray-like: The new state observed from the environment.

Returns

int: The next action chosen by the agent.