REWARD
PREDICTOR
HUMAN
FEEDBACK
RL ALGORITHM
ENVIRONMENT
preference labels
predicted
reward
action
observation
observation
RLHF Training Loop (Christiano et al., 2017)