HIGHLIGHTS
BASALT: A Benchmark for Learning from Human Feedback (Rohin Shah et al) (summarized by Rohin): A typical argument for AI risk, given in Human Compatible (AN #69), is that current AI systems treat their specifications as definite and certain, even though they are typically misspecified. This state of affairs can lead to the agent pursuing instrumental subgoals (AN #107). To solve this, we might instead build AI systems that continually learn the objective from human feedback. This post and paper (on which I am an author) present the MineRL BASALT competition, which aims to promote research on algorithms that learn from human feedback. BASALT aims to provide a benchmark with tasks that are realistic in the sense that (a) it is challenging to write a reward function for them and (b) there are many other potential goals that the AI system “could have” pursued in the environment. Criterion (a) implies that we can’t have automated evaluation of agents (otherwise that could be turned into a reward function) and so suggests that we use human evaluation of agents as our ground truth. Criterion (b) suggests choosing a very “open world” environment; the authors chose Minecraft for this purpose. They provide task descriptions such as “create a waterfall and take a scenic picture of it”; it is then up to researchers to create agents that solve this task using any method they want. Human evaluators then compare two agents against each other and determine which is better. Agents are then given a score using the TrueSkill system. The authors provide a number of reasons to prefer the BASALT benchmark over more traditional benchmarks like Atari or MuJoCo: 1. In Atari or MuJoCo, there are often only a few reasonable goals: for example, in Pong, you either hit the ball back, or you die. If you’re testing algorithms that are meant to learn what the goal is, you want an environment where there could be many possible goals, as is the case in Minecraft. 2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach. 3. The “true reward function” in Atari or MuJoCo is often not a great evaluation: for example, a Hopper policy trained to stand still using a constant reward gets 1000 reward! Human evaluations should not be subject to the same problem. 4. Since the tasks were chosen to be inherently fuzzy and challenging to formalize, researchers are allowed to take whatever approach they want to solving the task, including “try to write down a reward function”. In contrast, for something like Atari or MuJoCo, you need to ban such strategies. The only restriction is that researchers cannot extract additional state information from the Minecraft simulator. 5. Just as we’ve overestimated few-shot learning capabilities (AN #152) by tuning prompts on large datasets of examples, we might also be overestimating the performance of algorithms that learn from human feedback because we tune hyperparameters on the true reward function. Since BASALT doesn’t have a true reward function, this is much harder to do. 6. Since Minecraft is so popular, it is easy to hire Minecraft experts, allowing us to design algorithms that rely on expert time instead of just end user time. 7. Unlike Atari or MuJoCo, BASALT has a clear path to scaling up: the tasks can be made more and more challenging. In the long run, we could aim to deploy agents on public multiplayer Minecraft servers that follow instructions or assist with whatever large-scale project players are working on, all while adhering to the norms and customs of that server. Read more: Paper: The MineRL BASALT Competition on Learning from Human Feedback
|