Softmax Actor Critic

class rlforge.agents.policy_gradient.softmax_actor_critic.SoftmaxActorCriticAgent(actor_step_size, critic_step_size, avg_reward_step_size, num_actions, dims_ranges, temperature=1, iht_size=4096, num_tilings=8, num_tiles=8, wrap_dims=())

Softmax Actor-Critic Agent with average reward baseline.

This agent implements a vanilla actor-critic algorithm using tile coding for feature extraction and linear function approximation for both the actor (policy) and critic (value function). The actor updates are guided by the temporal-difference (TD) error, and actions are selected using a softmax distribution over action preferences.

Parameters

actor_step_sizefloat

Learning rate for the actor updates.

critic_step_sizefloat

Learning rate for the critic updates.

avg_reward_step_sizefloat

Step size for updating the average reward baseline.

num_actionsint

Number of discrete actions available in the environment.

dims_rangeslist of tuple

Ranges for each state dimension, used by the tile coder.

temperaturefloat, optional

Temperature parameter for softmax exploration (default=1).

iht_sizeint, optional

Size of the index hash table for tile coding (default=4096).

num_tilingsint, optional

Number of tilings used in tile coding (default=8).

num_tilesint, optional

Number of tiles per dimension (default=8).

wrap_dimstuple, optional

Dimensions to wrap in tile coding (default=()).

end(reward)

Complete an episode.

Performs a final update of the actor and critic using the terminal reward and the last cached state-action pair.

Parameters

rewardfloat

Final reward received at the end of the episode.

get_td_error(prev_tiles, reward, active_tiles, avg_reward)

Compute the temporal-difference (TD) error.

Calculates the TD error using the reward, average reward baseline, the critic’s prediction for the next state, and the critic’s prediction for the previous state.

Parameters

prev_tilesarray-like

Active tiles for the previous state.

rewardfloat

Reward received for the transition.

active_tilesarray-like

Active tiles for the current state.

avg_rewardfloat

Current estimate of the average reward baseline.

Returns

float

The computed TD error.

reset()

Reset the agent.

Resets the actor and critic weights to their initial values and clears the average reward baseline.

select_action(q_values, temperature)

Select an action using a softmax distribution.

Parameters

q_valuesarray-like

Action preferences or Q-values for the current state.

temperaturefloat

Temperature parameter controlling exploration.

Returns

int

The action selected by sampling from the softmax distribution.

start(new_state)

Begin a new episode.

Extracts active tiles for the initial state, computes action preferences, selects an action using softmax, and caches the state-action pair for future updates.

Parameters

new_statearray-like

The initial state observed from the environment.

Returns

int

The action selected by the agent.

step(reward, new_state)

Take a step in the environment.

Updates the actor and critic weights using the TD error and average reward baseline, then selects the next action.

Parameters

rewardfloat

Reward received from the previous action.

new_statearray-like

The new state observed from the environment.

Returns

int

The next action chosen by the agent.