Prompt: "What is 2+3?"
Prompt: "Write a haiku about rain"
Prompt: "Is 7 prime?"
Sample Response
Reset
Language generation as a
contextual bandit
: the prompt is the context
x
, the full response is the action
y
, and a reward model scores the complete output. Click "Sample Response" to see the LM generate and receive a reward.