Designing agent incentives to avoid reward tampering

By Tom Everitt, Ramana Kumar, and Marcus Hutter

From an AI safety perspective, having a clear design principle and a crisp characterization of what problem it solves means that we don’t have to guess which agents are safe. In this post and paper we describe how a design principle called current-RF optimization avoids the reward function tampering problem.

Reinforcement learning (RL) agents are designed to maximize reward. For example, Chess and Go agents are rewarded for winning the game, while a manufacturing robot may be rewarded for correctly assembling some given pieces. RL agents can sometimes find better strategies than the designer of the task, as recently demonstrated in Go and StarCraft.

However, determining what ‘better’ means can be tricky. Sometimes the agent has discovered a seemingly better strategy, but actually it has found a loophole in the reward specification. We call this reward hacking. One type of reward hacking is reward gaming, where the agent exploits a misspecified reward function (see e.g. the boat race example).

In our latest paper, we focus on another type of reward hacking called reward tampering. In reward tampering, the agent doesn’t exploit a misspecified reward function. Instead, it actively changes the reward function. For example, some Super Mario environments have a bug that allows execution of arbitrary code by taking the right sequence of in-game actions. This could in principle be used to redefine the score of the game.

While this type of hacking is beyond the capabilities of current RL agents in most environments, the general quest to build more capable agents may eventually lead us to build agents that can exploit such shortcuts. Understanding reward tampering therefore ties in well with our safety work on anticipating future failure modes and figuring out how to prevent them before they occur.

Gridworld example

We can illustrate the reward tampering problem using a gridworld in which the reward function can be modified. We adopt a game mechanic from “Baba Is You”, a puzzle game where some of the rules of the game are described by words in the environment. The agent can push those words around, in order to change the rules.

Here, an agent can push rocks, diamonds, and words (as in Sokoban, with black things being movable). The agent’s objective is described by the purple nodes. Initially, the description reads that diamonds provide reward when pushed to the green goal area. This is the intended task. However, the agent can also tamper with the reward function. By pushing the “reward”-word down, the reward function starts assigning reward to rocks instead of diamonds, creating a mismatch between the agent’s rewards and the intended task.

Since there are more rocks than diamonds, the strategy that provides the most reward for the agent is to first push the reward-word, and then collect rocks instead of diamonds. Even though this gives the agent the most reward, this solution is undesirable to the user, who wanted diamonds and not rocks.

While simplistic, we argue that this gridworld captures the reward tampering dynamics. In general, when reward tampering is possible, there is something that the agent can do in the environment (e.g. hack into a computer holding the reward function implementation) that changes the implemented reward function, and thereby the dispensed rewards. This is exactly the dynamics in the rocks and diamonds gridworld.

Causal influence diagram representation

Our previous work showed how causal influence diagrams can be used to understand agent incentives and to model AGI safety frameworks. The incentive analysis is directly applicable here. First we model the reward tampering problem with a causal influence diagram of a Markov Decision Process with a modifiable reward function:

Here 𝜣ᴿᵢ represents the reward description at time i, with 𝜣ᴿ₁ = “diamonds are reward”. Meanwhile, S represents the agent’s position and the state of all non-purple tiles. The reward R is determined by how well S satisfies the reward description 𝜣ᴿᵢ. For example, if the reward description is “diamonds are reward”, then R equals the number of diamonds in the goal area in S. The goal of the agent is to select the actions A to optimize the sum of the rewards. The arrows represent causal influence, except for the arrows going into actions, which represent information flow (and are therefore drawn differently with dotted lines).

From the diagram, we can see that there are two types of directed paths from action A₁ to rewards R₂ and R₃. The first type of path (green) goes via Sᵢ, and represents the agent moving diamonds or rocks to the goal area. This is the way we want the agent to get reward. The second type of path (red) goes via 𝜣ᴿᵢ. This path represents tampering with the reward function, and is the path we don’t want the agent to use.

Current-RF optimization

One way to prevent the agent from tampering with the reward function is to isolate or encrypt the reward function. However, we do not expect such solutions to scale indefinitely with our agent’s capabilities, as a sufficiently capable agent may find ways around most defenses. In our new paper, we describe a more principled way to fix the reward tampering problem. Rather than trying to protect the reward function, we change the agent’s incentives for tampering with it.

The fix relies on a slight change to the RL framework that gives the agent query access to the reward function. In the rocks and diamonds environment, this can be done by specifying to the agent how the purple nodes describe the reward function.

Using query access to the reward function, we can design a model-based agent that uses the current reward function to evaluate rollouts of potential policies (a current-RF agent, for short). For example, in the rocks and diamonds environment, a current-RF agent will look at the current reward description, and at time 1 see that it should collect diamonds. This is the criteria by which it will choose its first action, which will be going upwards towards the diamond. Note that the reward description is still changeable, just as before. Still, the current-RF agent will not use the reward-tampering possibility, because it is focused on satisfying the current reward description.

The objective of a current-RF agent corresponds to a slightly different influence diagram:

When choosing A1, the agent optimizes rewards based on the current reward description 𝜣ᴿ₁ and (simulated) future states S2 and S3. Now there are no longer any red directed paths from A1 to future rewards that pass through a reward function node 𝜣ᴿᵢ. That is, the incentive for reward tampering has been averted.

The agent still has an incentive to influence the current reward description 𝜣ᴿ₁. Fortunately, the agent lacks the ability to influence 𝜣ᴿ₁, because there is no directed path from A₁ to 𝜣ᴿ₁. For example, in the rocks and diamonds environment, the agent would have benefitted from the current reward description being different, but there is not much it can do about the current description.

Experiments

In the rocks and diamonds environment, a standard RL agent quickly discovers that it can get more reward by tampering with the reward. In contrast, a current-RF agent does not tamper with the reward. Looking only at the reward the agents collect, it may seem like standard RL performs better:

However, when we instead measure how well the agents perform the diamond-collecting task, we see that the current-RF agent is actually much better than the standard RL-agent:

For these experiments, we used a model-free version of current-RF optimization, which trains one value function Vʀᴏᴄᴋ for the number of rocks in the goal area, and one value function Vᴅɪᴀᴍᴏɴᴅ for the number of diamonds. Which one it maximizes is determined by the current 𝜣ᴿ. For example, initially it optimizes Vᴅɪᴀᴍᴏɴᴅ, but this can change to Vʀᴏᴄᴋ if it (accidentally) pushes the reward word. An implementation of the environment is available here.

Takeaways and future directions

Most RL algorithms have a reward function tampering incentive. Among these are model-based or model-free RL algorithms that learn from a stepwise reward signal. If the trained model can predict the effect of reward tampering, then the agent can learn that tampering will lead to higher stepwise reward, and thus adapt the tampering behavior.

It is fortunate that the reward function tampering incentive can be avoided with a slight change to the standard RL agent objective: current-RF optimization with query access to the reward function. In fact, we believe that most model-based RL algorithms could be converted into current-RF optimizers relatively easily (model-free algorithms may pose more of a challenge).

Our paper explores the properties of current-RF optimization in more depth. It also elaborates on further issues around agent corrigibility and self-preservation incentives, as well as other reward tampering problems such as tampering with user-provided feedback for reward modeling, observation tampering, and belief tampering. It turns out that all these reward tampering problems and their solutions can be naturally represented using causal influence diagrams.

The most important takeaway from the paper is that there are design principles that avoid reward tampering problems, and that these design principles are mutually compatible. An important next step is to turn the design principles into practical and scalable RL algorithms, and to verify that they do the right thing in setups where various types of reward tampering are possible. With time, we hope that these design principles will evolve into a set of best practices for how to build capable RL agents without reward tampering incentives.

Special thanks to Damien Boudot for producing the figures for this post.

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

By Pushmeet Kohli, Krishnamurthy (Dj) Dvijotham, Jonathan Uesato, Sven Gowal, and the Robust & Verified Deep Learning group. This article is cross-posted from DeepMind.com.

Bugs and software have gone hand in hand since the beginning of computer programming. Over time, software developers have established a set of best practices for testing and debugging before deployment, but these practices are not suited for modern deep learning systems. Today, the prevailing practice in machine learning is to train a system on a training data set, and then test it on another set. While this reveals the average-case performance of models, it is…


By Victoria Krakovna (DeepMind), Ramana Kumar (DeepMind), Laurent Orseau (DeepMind), Alexander Turner (Oregon State University)

A major challenge in AI safety is reliably specifying human preferences to AI systems. An incorrect or incomplete specification of the objective can result in undesirable behavior. For example, consider a reinforcement learning agent whose task is to carry a box from point A to point B, who is rewarded for getting the box to point B as quickly as possible. If there happens to be a vase in the shortest path to point B, the agent will have no incentive to go around the…


By Tom Everitt

In our latest paper, we describe a new method for inferring agent incentives. The method is based on influence diagrams, which is a type of graphical model with special decision and utility nodes. In these, graphical criteria can be used to determine both agent observation incentives and agent intervention incentives.

For humans, it is natural to think about intelligent systems as agents that strive to attain their goals by means of taking actions. “The cat came back inside because it was hungry and wanted food.” This intentional stance is natural to employ also for machine learning systems…


By Jan Leike

This post provides an overview of our new paper that outlines a research direction for solving the agent alignment problem. Our approach relies on the recursive application of reward modeling to solve complex real-world problems in a way that aligns with user intentions.

In recent years, reinforcement learning has yielded impressive performance in complex game environments ranging from Atari, Go, and chess to Dota 2 and StarCraft II, with artificial agents rapidly surpassing the human level of play in increasingly complex domains. Games are an ideal platform for developing and testing machine learning algorithms. They present challenging…


By Pedro A. Ortega, Vishal Maini, and the DeepMind safety team

Building a rocket is hard. Each component requires careful thought and rigorous testing, with safety and reliability at the core of the designs. Rocket scientists and engineers come together to design everything from the navigation course to control systems, engines and landing gear. Once all the pieces are assembled and the systems are tested, we can put astronauts on board with confidence that things will go well.

If artificial intelligence (AI) is a rocket, then we will all have tickets on board some day. And, as in rockets, safety…


Get the Medium app