Jan 21
Avoiding Unsafe States in 3D Environments using Human Feedback
By Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, and Jan Leike.
Tl;dr: ReQueST is an algorithm for learning objectives from human feedback on hypothetical behaviour. In this work, we scale ReQueST to complex 3D environments, and show that it works even with feedback sourced entirely from real humans. Read our paper at https://arxiv.org/abs/2201.08102.
Learning about unsafe states
Online reinforcement learning has a problem: it must act unsafely in order to learn not to act unsafely. For example, if we were to use online reinforcement learning to train a self-driving car, the car would have to drive off a cliff in order to learn not to drive off cliffs.
One way that humans solve this problem is by learning from hypothetical situations. Our imagination gives us the ability to consider various courses of action without actually having to enact them in the real world. In particular, this allows us to learn about potential sources of danger without having to expose ourselves or others to the concomitant risks.
The ReQueST algorithm
ReQueST (Reward Query Synthesis via Trajectory optimization) is a technique developed to give AI systems the same ability. ReQueST employs three components:
- A neural environment simulator — a dynamics model learned from trajectories generated by humans exploring the environment safely. In our work this is a pixel-based dynamics model,
- A reward model, learned from human feedback on videos of (hypothetical) behaviour in the learned simulator.
- Trajectory optimisation, so that we can choose hypothetical behaviours to ask the human about that help the reward model learn what’s safe and what’s not (in addition to other aspects of the task) as quickly as possible.
Together, these three components allow us to learn a reward model based entirely on hypothetical examples ‘imagined’ using the learned simulator. If we then use the learned simulator and reward model with a model-based control algorithm, the result is an agent that does what the human wants — in particular, avoiding behaviours the human has indicated is unsafe — without having had to first try those behaviours in the real world!
ReQueST in our work
In our latest paper, we ask: is ReQueST viable in a more realistic setting than the simple 2D environments used in the work that introduced ReQueST? In particular, can we scale ReQueST to a complex 3D environment, with imperfect feedback as sourced from real humans rather than procedural reward functions?
It turns out the answer is: yes!