Soft Actor Critic (SAC)

class rlforge.agents.policy_gradient.sac.SACAgent(state_dim, action_dim, policy_net_architecture=(64, 64), q_net_architecture=(64, 64), actor_lr=0.0003, critic_lr=0.0003, alpha_lr=0.0003, discount=0.99, tau=0.005, update_frequency=1, buffer_size=1000000, mini_batch_size=256, update_start_size=256, tanh_squash=True, action_low=None, action_high=None, target_entropy_factor=0.9, device=None)

Soft Actor-Critic (SAC) Agent for continuous action spaces.

SAC is an off-policy actor-critic algorithm that optimizes a stochastic policy in an entropy-regularized reinforcement learning framework. It balances exploration and exploitation by maximizing both expected reward and policy entropy.

This implementation builds all networks internally for proper reset and management, including policy, twin Q-networks, and entropy tuning.

Parameters

state_dimint

Dimension of the input state space.

action_dimint

Dimension of the continuous action space.

policy_net_architecturetuple of int, optional

Hidden layer sizes for the policy network (default=(64, 64)).

q_net_architecturetuple of int, optional

Hidden layer sizes for the Q-networks (default=(64, 64)).

actor_lrfloat, optional

Learning rate for the actor/policy network (default=3e-4).

critic_lrfloat, optional

Learning rate for the critic/Q-networks (default=3e-4).

alpha_lrfloat, optional

Learning rate for the entropy coefficient α (default=3e-4).

discountfloat, optional

Discount factor γ applied to future rewards (default=0.99).

taufloat, optional

Polyak averaging factor for soft target network updates (default=0.005).

update_frequencyint, optional

Frequency (in steps) of training updates (default=1).

buffer_sizeint, optional

Maximum size of the replay buffer (default=1,000,000).

mini_batch_sizeint, optional

Size of mini-batches sampled from the replay buffer (default=256).

update_start_sizeint, optional

Minimum number of transitions before updates begin (default=256).

tanh_squashbool, optional

Whether to apply tanh squashing to actions (default=True).

action_lowfloat or np.ndarray, optional

Lower bound(s) for continuous actions.

action_highfloat or np.ndarray, optional

Upper bound(s) for continuous actions.

target_entropy_factorfloat, optional

Factor for target entropy calculation (default=0.9).

devicestr or torch.device, optional

Device to run computations on (“cpu” or “cuda”). Defaults to CUDA if available.

end(reward)

Complete an episode in a single environment.

Stores the final transition into the replay buffer.

Parameters

rewardfloat

Final reward received at the end of the episode.

end_batch(rewards)

Complete a batch of episodes.

Stores terminal transitions into the replay buffer and performs SAC updates if conditions are met.

Parameters

rewardsarray-like, shape (N,)

Final rewards received for each terminated environment.

load(filepath)

Load the agent’s state from a file.

Parameters

filepathstr

Path to the file containing the saved state dictionary.

reset()

Reset the agent state for a new run.

Clears the replay buffer, resets counters, and rebuilds networks and optimizers to start training from scratch.

Notes

  • Resets total_steps to zero.

  • Clears cached previous state and action.

  • Calls reset_nets_and_opts() to reinitialize networks and optimizers.

Returns

None

Agent state and networks are reset.

reset_nets_and_opts(target_entropy_factor=0.9, init_weights=False)

Build or rebuild all networks and optimizers.

Initializes the policy network, twin Q-networks, target Q-networks, and learnable parameters for log standard deviation and log α. Also sets up optimizers for actor, critics, and α.

Parameters

target_entropy_factorfloat, optional

Factor used to compute target entropy (default=0.9).

init_weightsbool, optional

If True, initializes target entropy based on action_dim (default=False).

Workflow

  1. Construct policy network (outputs mean actions).

  2. Construct twin Q-networks (Q1 and Q2).

  3. Deep copy Q-networks to create target critics.

  4. Initialize learnable parameters: log_std and log_alpha.

  5. Compute target entropy if init_weights=True.

  6. Initialize Adam optimizers for actor, critics, and α.

  7. Update α from logα.

Returns

None

Networks, parameters, and optimizers are rebuilt in-place.

save(filepath)

Save the agent’s state (networks, optimizers, and parameters) to a file.

Parameters

filepathstr

Path to the file where the state dictionary will be saved.

start(state, deterministic=False)

Begin a new episode in a single environment.

Parameters

statearray-like

Initial state of the environment.

deterministicbool, optional

If True, selects deterministic actions (default=False).

Returns

np.ndarray

Selected action.

start_batch(states, deterministic=False)

Begin a batch of episodes.

Selects actions for multiple environments simultaneously.

Parameters

statesarray-like, shape (N, state_dim)

Batch of initial states.

deterministicbool, optional

If True, selects deterministic actions (default=False).

Returns

np.ndarray

Array of selected actions of shape (N, action_dim).

step(reward, state, done=False, deterministic=False)

Take a step in a single environment.

Stores transition, performs updates if conditions are met, and selects the next action.

Parameters

rewardfloat

Reward from the previous action.

statearray-like

Next state observed.

donebool, optional

Whether the episode has terminated (default=False).

deterministicbool, optional

If True, selects deterministic actions (default=False).

Returns

np.ndarray

Selected action.

step_batch(rewards, next_states, dones, deterministic=False)

Take a step in multiple environments.

Stores transitions in the replay buffer, performs SAC updates if conditions are met, and selects next actions.

Parameters

rewardsarray-like, shape (N,)

Rewards from the previous actions.

next_statesarray-like, shape (N, state_dim)

Next states observed.

donesarray-like, shape (N,)

Boolean flags indicating episode termination.

deterministicbool, optional

If True, selects deterministic actions (default=False).

Returns

np.ndarray

Array of selected actions of shape (N, action_dim).