Language generation as a contextual bandit: the prompt is the context x, the full response is the action y, and a reward model scores the complete output. Click "Sample Response" to see the LM generate and receive a reward.