A contextual bandit: the environment presents a context x, the agent picks an action a from a set, receives reward r(x, a), and the episode ends. No state transitions — each round is independent.