Agent
policy π
Env
MDP
action
observation
reward
Standard view:
reward comes from the environment
Agent
policy π
Env
MDP
action
observation
reward
(self-constructed)
Reality:
the agent rewards itself