RL ALGORITHM
ENVIRONMENT
REWARD
PREDICTOR
HUMAN
FEEDBACK
action
observation
predicted
reward
observation
preference labels
RLHF Training Loop (Christiano et al., 2017)