Agent policy π Env MDP action observation reward

Standard view: reward comes from the environment

Agent policy π Env MDP action observation reward (self-constructed)

Reality: the agent rewards itself