Designing agent incentives to avoid reward tampering

Gridworld example

Here, an agent can push rocks, diamonds, and words (as in Sokoban, with black things being movable). The agent’s objective is described by the purple nodes. Initially, the description reads that diamonds provide reward when pushed to the green goal area. This is the intended task. However, the agent can also tamper with the reward function. By pushing the “reward”-word down, the reward function starts assigning reward to rocks instead of diamonds, creating a mismatch between the agent’s rewards and the intended task.

Causal influence diagram representation

Here 𝜣ᴿᵢ represents the reward description at time i, with 𝜣ᴿ₁ = “diamonds are reward”. Meanwhile, S represents the agent’s position and the state of all non-purple tiles. The reward R is determined by how well S satisfies the reward description 𝜣ᴿᵢ. For example, if the reward description is “diamonds are reward”, then R equals the number of diamonds in the goal area in S. The goal of the agent is to select the actions A to optimize the sum of the rewards. The arrows represent causal influence, except for the arrows going into actions, which represent information flow (and are therefore drawn differently with dotted lines).

Current-RF optimization

When choosing A1, the agent optimizes rewards based on the current reward description 𝜣ᴿ₁ and (simulated) future states S2 and S3. Now there are no longer any red directed paths from A1 to future rewards that pass through a reward function node 𝜣ᴿᵢ. That is, the incentive for reward tampering has been averted.

Experiments

Takeaways and future directions

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

Share your ideas with millions of readers.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Sentiment Analysis API VS Custom Text Classification: which one to choose?

NYC Taxi Fare Prediction with Gradient Boosting Algorithm

Canary Model Deployment with Seldon Core

Text Generation Using RNN

Foliar Leaf diseases in Apple Trees

Using a custom metric in Catboost: Classification as regression

Supervised Learning: Regression on real-world use cases

NUMBER PLATE DETECTION USING OPEN CV

DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

More from Medium

Neural Architecture Search

Retrain Your Image Autoencoder By Just Asking

Skillful Precipitation Nowcasting — An Implementation of DeepMind’s DGMR

Introduction to Multi-Task Learning