Inner alignment is a problem when you train the reward function & the policy function jointly.

⤋ Read More