Bandits

class rlforge.environments.bandits.Bandits(k=10, mean_rewards=None, reward_std=1.0)

Basic k-armed bandit environment.

The k-armed bandit problem is a fundamental reinforcement learning setting where an agent repeatedly chooses among k actions (“arms”), each associated with an unknown reward distribution. The agent’s objective is to maximize cumulative reward by balancing exploration and exploitation.

Parameters

kint

Number of arms (actions).

mean_rewardsarray-like, optional

True mean reward for each arm. If None, sampled from N(0,1).

reward_stdfloat, optional

Standard deviation of reward noise (default: 1.0).

Attributes

kint

Number of arms.

mean_rewardsnumpy.ndarray

True mean reward for each arm.

reward_stdfloat

Standard deviation of reward noise.

optimal_action()

Return the index of the optimal arm.

Returns

int

Index of the arm with the highest true mean reward.

pull(action)

Pull the specified arm and observe a reward.

Parameters

actionint

Index of the arm to pull (0 ≤ action < k).

Returns

rewardfloat

Sampled reward from the arm’s distribution.