Step
▶ Auto
Reset
A
contextual bandit
: the environment presents a context
x
, the agent picks an action
a
from a set, receives reward
r(x, a)
, and the episode ends. No state transitions — each round is independent.