Bandits

class rlforge.environments.bandits.Bandits(k=10, mean_rewards=None, reward_std=1.0)

Basic k-armed bandit environment.

The k-armed bandit problem is a fundamental reinforcement learning setting where an agent repeatedly chooses among k actions (“arms”), each associated with an unknown reward distribution. The agent’s objective is to maximize cumulative reward by balancing exploration and exploitation.

Parameters

kint: Number of arms (actions).
mean_rewardsarray-like, optional: True mean reward for each arm. If None, sampled from N(0,1).
reward_stdfloat, optional: Standard deviation of reward noise (default: 1.0).

Attributes

kint: Number of arms.
mean_rewardsnumpy.ndarray: True mean reward for each arm.
reward_stdfloat: Standard deviation of reward noise.

optimal_action()

Return the index of the optimal arm.

Returns

int: Index of the arm with the highest true mean reward.

pull(action)

Pull the specified arm and observe a reward.

Parameters

actionint: Index of the arm to pull (0 ≤ action < k).

Returns

rewardfloat: Sampled reward from the arm’s distribution.