Skip to main content

preference learning directory


“Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-time Planning”, Qi et al 2022

“Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning”⁠, Carl Qi, Pieter Abbeel, Aditya Grover (2022-04-07; ⁠, ; similar):

The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.

“Inferring Rewards from Language in Context”, Lin et al 2022

“Inferring Rewards from Language in Context”⁠, Jessy Lin, Daniel Fried, Dan Klein, Anca Dragan (2022-04-05):

In classic instruction following, language like “I’d like the JetBlue flight” maps to actions (eg. selecting that flight). However, language also conveys information about a user’s underlying reward function (eg. a general preference for JetBlue), which can allow a model to carry out desirable actions in new contexts. We present a model that infers rewards from language pragmatically: reasoning about how speakers choose utterances not only to elicit desired actions, but also to reveal information about their preferences. On a new interactive flight-booking task with natural language, our model more accurately infers rewards and predicts optimal actions in unseen environments, in comparison to past work that first maps language to actions (instruction following) and then maps actions to rewards (inverse reinforcement learning).

“SURF: Semi-supervised Reward Learning With Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning”, Park et al 2022

“SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning”⁠, Jongjin Park, Younggyo Seo, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee (2022-03-18; ; similar):

Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor’s preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled samples or data augmentation techniques in the context of supervised learning.

Motivated by the recent success of these approaches, we present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. To further improve the label-efficiency of reward learning, we introduce a new data augmentation that temporally crops consecutive subsequences from the original behaviors.

Our experiments demonstrate that our approach substantially improves the feedback-efficiency of the state-of-the-art preference-based method on a variety of locomotion and robotic manipulation tasks.

“Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022

“Safe Deep RL in 3D Environments using Human Feedback”⁠, Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, Jan Leike (2022-01-20; ; similar):

[blog] Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks.

A recently introduced technique, ReQueST⁠, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans—whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality.

In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.

“A Survey of Controllable Text Generation Using Transformer-based Pre-trained Language Models”, Zhang et al 2022

“A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models”⁠, Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, Dawei Song (2022-01-14; ; similar):

Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3–4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.

“WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing”, Hilton et al 2021

“WebGPT: Improving the factual accuracy of language models through web browsing”⁠, Jacob Hilton, Suchir Balaji, Reiichiro Nakano, John Schulman (2021-12-16; ⁠, ⁠, ⁠, ; similar):

[paper] We’ve fine-tuned GPT-3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers to questions online—it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy. We’re excited about developing more truthful AI, but challenges remain, such as coping with unfamiliar types of questions.

Language models like GPT-3 are useful for many different tasks, but have a tendency to “hallucinate” information when performing tasks requiring obscure real-world knowledge.23 To address this, we taught GPT-3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as “Search …”, “Find in page: …” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer.

The model is fine-tuned from GPT-3 using the same general methods we’ve used previously. We begin by training the model to copy human demonstrations, which gives it the ability to use the text-based browser to answer questions. Then we improve the helpfulness and accuracy of the model’s answers, by training a reward model to predict human preferences, and optimizing against it using either reinforcement learning or rejection sampling.

…Our models outperform GPT-3 on TruthfulQA and exhibit more favourable scaling properties. However, our models lag behind human performance, partly because they sometimes quote from unreliable sources (as shown in the question about ghosts above). We hope to reduce the frequency of these failures using techniques like adversarial training.

Evaluating factual accuracy: …

However, this approach raises a number of questions. What makes a source reliable? What claims are obvious enough to not require support? What trade-off should be made between evaluations of factual accuracy and other criteria such as coherence? All of these were difficult judgment calls. We do not think that our model picked up on much of this nuance, since it still makes basic errors. But we expect these kinds of decisions to become more important as AI systems improve, and cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound. We also expect further considerations such as transparency to be important.

Eventually, having models cite their sources will not be enough to evaluate factual accuracy. A sufficiently capable model would cherry-pick sources it expects humans to find convincing, even if they do not reflect a fair assessment of the evidence. There are already signs of this happening (see the questions about boats above). We hope to mitigate this using methods like debate⁠.

“WebGPT: Browser-assisted Question-answering With Human Feedback”, Nakano et al 2021

“WebGPT: Browser-assisted question-answering with human feedback”⁠, Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu Long Ouyang, Christina Kim, Christopher Hesse et al (2021-12-16; ⁠, ⁠, ; backlinks; similar):

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web.

By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.

We train and evaluate our models on ELI5⁠, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences.

This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

Figure 2: Human evaluations on ELI5 comparing against (a) demonstrations collected using our web browser, (b) the highest-voted answer for each question. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.
Figure 3: TruthfulQA results. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.

In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API, and utilize unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3. Instead of trying to improve these ingredients, we focus on combining them using more faithful training objectives. Following Stiennon et al 2020⁠, we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans

We make 2 key contributions:

  • We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in an end to end fashion using general methods such as imitation learning and reinforcement learning.
  • We generate answers with references: passages extracted by the model from web pages while browsing. This is crucial for allowing labelers to judge the factual accuracy of answers, without engaging in a difficult and subjective process of independent research.

…We use this data in 4 main ways: behavior cloning (ie. supervised fine-tuning) using the demonstrations, reward modeling using the comparisons, reinforcement learning against the reward model, and rejection sampling against the reward model. Our best model uses a combination of behavior cloning and rejection sampling. We also find reinforcement learning to provide some benefit when inference-time compute is more limited.

…We evaluate our best model in 3 different ways. First, we compare our model’s answers to answers written by our human demonstrators on a held-out set of questions. Our model’s answers are preferred 56% of the time, demonstrating human-level usage of the text-based browser. Second, we compare our model’s answers (with references stripped, for fairness) to the highest-voted answer provided by the ELI5 dataset. Our model’s answers are preferred 69% of the time. Third, we evaluate our model on TruthfulQA⁠, an adversarial dataset of short-form questions. Our model’s answers are true 75% of the time, and are both true and informative 54% of the time, outperforming our base model (GPT-3), but falling short of human performance

Figure 1: An observation from our text-based web-browsing environment, as shown to human demonstrators (left) and models (right). The web page text has been abridged for illustrative purposes.

Environment design: …For this approach, we designed a text-based web-browsing environment. The language model is prompted with a written summary of the current state of the environment, including the question, the text of the current page at the current cursor location, and some other information (see Figure 1(b)). In response to this, the model must issue one of the commands given in Table 1, which performs an action such as running a Bing search, clicking on a link, or scrolling around. This process is then repeated with a fresh context (hence, the only memory of previous steps is what is recorded in the summary).

While the model is browsing, one of the actions it can take is to quote an extract from the current page. When this is performed, the page title, domain name and extract are recorded to be used later as a reference. Browsing then continues until either the model issues a command to end browsing, the maximum number of actions has been reached, or the maximum total length of references has been reached. At this point, as long as there is at least one reference, the model is prompted with the question and the references, and must compose its final answer

Table 2: An answer produced by our 175B best-of-64 model to a randomly-chosen question from the ELI5 test set (not cherry-picked). The full text of the references can be found in Appendix J⁠, along with answers from our human demonstrators and the ELI5 dataset. Further samples are available⁠.
Field Output
Question Why did we decide that certain words were “bad” and shouldn’t be used in social settings?
Answer We have all been taught that there are certain words that should not be used in public, at least not without good reason. But why do some words fall into this category while others do not? It turns out that there is not a single answer to this question, as each word has a different history and carries different meanings in different cultures and families [2]. However, there are some general categories that “bad” words tend to fall into. “Bad” words generally relate to parts of life that we don’t like talking about in public, like bathroom functions, or negative ways of talking about people’s religion, skin color, ability, or gender [3]. Often, people say bad words when they are experiencing strong emotions, and those words are linked to our emotions [3].
References (titles only) [1, 2, 3] Why Are Some Words ‘Bad’? | Vermont Public Radio (
[4] On Words: ‘Bad’ Words and Why We Should Study Them | UVA Today (
[5] The Science of Curse Words: Why The &@$! Do We Swear? (
Figure 5: Preference of the 175B best-of-n BC model over the BC model. The validation RM prediction is obtained using the estimator described in Appendix I, and predicts human preference well in this setting. The shaded region represents ±1 standard error.

…Our results are shown in Figure 2. Our best model, the 175B best-of-64 model, produces answers that are preferred to those written by our human demonstrators 56% of the time. This suggests that the use of human feedback is essential, since one would not expect to exceed 50% preference by imitating demonstrations alone (although it may still be possible, by producing a less noisy policy). The same model produces answers that are preferred to the reference answers from the ELI5 dataset 69% of the time. This is a substantial improvement over Krishna et al,2021 whose best model’s answers are preferred 23% of the time to the reference answers, although they use substantially less compute than even our smallest model.

Figure 6: BC scaling, varying the proportion of the demonstration dataset and parameter count of the policy.

…The combination of RL and rejection sampling also fails to offer much benefit over rejection sampling alone. One possible reason for this is that RL and rejection sampling are optimizing against the same reward model, which can easily be overoptimized (especially by RL, as noted above). In addition to this, RL reduces the entropy of the policy, which hurts exploration. Adapting the RL objective to optimize rejection sampling performance is an interesting direction for future research. It is also worth highlighting the importance of carefully tuning the BC baseline for these comparisons. As discussed in Appendix E, we tuned the number of BC epochs and the sampling temperature using a combination of human evaluations and reward model score. This alone closed much of the gap we originally saw between BC and RL.

Figure 7: RM scaling, varying the proportion of the comparison dataset and parameter count of the reward model.

…Scaling trends with dataset size and parameter count are shown in Figures 6 and 7. For dataset size, doubling the number of demonstrations increased the policy’s reward model score by about 0.13, and doubling the number of comparisons increased the reward model’s accuracy by about 1.8%. For parameter count, the trends were noisier, but doubling the number of parameters in the policy increased its reward model score by roughly 0.09, and doubling the number of parameters in the reward model increased its accuracy by roughly 0.4%.

Figure 8: Best-of-n scaling, varying the parameter count of the policy and reward model together, as well as the number of answers sampled.

…For rejection sampling, we analyzed how to trade off the number of samples against the number of model parameters for a given inference-time compute budget (see Figure 8). We found that it is generally compute-efficient to use some amount of rejection sampling, but not too much. The models for our main evaluations come from the Pareto frontier of this trade-off: the 760M best-of-4 model, the 13B best-of-16 model, and the 175B best-of-64 model. [cf. Jones 2021]

“Modeling Strong and Human-Like Gameplay With KL-Regularized Search”, Jacob et al 2021

“Modeling Strong and Human-Like Gameplay with KL-Regularized Search”⁠, Athul Paul Jacob, David J. Wu, Gabriele Farina, Adam Lerer, Anton Bakhtin, Jacob Andreas, Noam Brown (2021-12-14; ⁠, ⁠, ; similar):

We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (eg. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with.

We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that applying this algorithm to no-press Diplomacy yields a policy that maintains the same human prediction accuracy as imitation learning while being substantially stronger.

“A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021

“A General Language Assistant as a Laboratory for Alignment”⁠, Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph et al (2021-12-01; ⁠, ⁠, ; similar):

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a ‘preference model pre-training’ stage of training, with the goal of improving sample efficiency when finetuning on human preferences.

“Recursively Summarizing Books With Human Feedback”, Wu et al 2021

“Recursively Summarizing Books with Human Feedback”⁠, Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano (2021-09-22; ; backlinks; similar):

A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases (~5% of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.

“B-Pref: Benchmarking Preference-Based Reinforcement Learning”, Lee et al 2021

“B-Pref: Benchmarking Preference-Based Reinforcement Learning”⁠, Kimin Lee, Laura Smith, Anca Dragan, Pieter Abbeel (2021-06-08; similar):

Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher’s preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically.

[Keywords: Preference-based reinforcement learning, human-in-the-loop reinforcement learning, deep reinforcement learning]

“Trajectory Transformer: Reinforcement Learning As One Big Sequence Modeling Problem”, Janner et al 2021

“Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem”⁠, Michael Janner, Qiyang Colin Li, Sergey Levine (2021-06-03; ⁠, ⁠, ; backlinks; similar):

[blog⁠; a simultaneous-invention of Decision Transformer⁠, with more emphasis on model-based learning like exploration; see Decision Transformer annotation for related work.]


Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem.

To this end, we explore how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence modeling problem largely simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.

…Replacing log-probabilities from the sequence model with reward predictions yields a model-based planning method, surprisingly effective despite lacking the details usually required to make planning with learned models effective.

Related Publication: Chen et al concurrently proposed another sequence modeling approach to reinforcement learning [Decision Transformer]. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (eg. in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (eg. Atari). We encourage you to check out their work as well.

“Decision Transformer: Reinforcement Learning via Sequence Modeling”, Chen et al 2021

“Decision Transformer: Reinforcement Learning via Sequence Modeling”⁠, Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas et al (2021-06-02; ⁠, ⁠, ⁠, ; backlinks; similar):

[online DT] We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.

Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym⁠, and Key-to-Door tasks.

Decision Transformer: autoregressive sequence modeling for RL: We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer model, trained to predict the next action given the previous tokens using a linear output layer. Evaluation is also easy: we can initialize by a desired target return (eg. 1 or 0 for success or failure) and the starting state in the environment. Unrolling the sequence—similar to standard autoregressive generation in language models—yields a sequence of actions to execute in the environment.

Sequence modeling as multitask learning: One effect of this type of modeling is that we perform conditional generation, where we initialize a trajectory by inputting our desired return. Decision Transformer does not yield a single policy; rather, it models a wide distribution of policies. If we plot average achieved return against the target return of a trained Decision Transformer, we find distinct policies are learned that can reasonably match the target, trained only with supervised learning. Furthermore, on some tasks (such as Q✱bert and Seaquest), we find Decision Transformer can actually extrapolate outside of the dataset and model policies achieving higher return!

[Paper⁠; Github⁠; see also MuZero⁠, “goal-conditioned” or “upside-down reinforcement learning” (such as “morethan” prompting), Shawn Presser’s GPT-2 chess model (& Cheng’s almost-DT chess transformer), value equivalent models, Ortega et al 2021 on ‘delusions’⁠. Simultaneous work at BAIR invents Decision Transformer as Trajectory Transformer⁠. Note that DT, being in the ‘every task is a generation task’ paradigm of GPT, lends itself nicely to preference learning simply by formatting human-ranked choices of a sequence.

The simplicity of this version of the control codes or ‘inline metadata trick’ (eg. CTRL) means it can be reused with almost any generative model where some measure of quality or reward is available (even if only self-critique like likelihood of a sequence eg. in Meena-style best-of ranking or inverse prompting): you have an architecture floorplan DALL·E? Use standard architecture software to score plans by their estimated thermal efficiency/​sunlight/​etc; prefix these scores, retrain, & decode for good floorplans maximizing thermal efficiency/​sunlight. You have a regular DALL·E? Sample n samples per prompt, CLIP-rank the images, prefix their ranking, retrain… No useful CLIP? Then use the CogView self-text-captioning trick to turn generated images back into text, rank by text likelihood… Choose Your Own Adventure AI Dungeon game-tree? Rank completions by player choice, feed back in for preference learning… All of the work is done by the data, as long as the generative model is smart enough.]

“A Survey of Preference-Based Reinforcement Learning Methods”, Wirth et al 2021

“A Survey of Preference-Based Reinforcement Learning Methods”⁠, Christian Wirth, Riad Akrour, Gerhard Neumann, Johannes Fürnkranz (2021-05-20; similar):

Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress.

To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert’s preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge.

We provide an unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/​exploitation problem is tackled.

Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL.

“Brain-computer Interface for Generating Personally Attractive Images”, Spape et al 2021

2021-spape.pdf: “Brain-computer interface for generating personally attractive images”⁠, Michiel Spape, Keith M. Davis III, Lauri Kangassalo, Niklas Ravaja, Zania Sovijarvi-Spape, Tuukka Ruotsalo et al (2021-02-12; ⁠, ⁠, ; similar):

While we instantaneously recognize a face as attractive, it is much harder to explain what exactly defines personal attraction. This suggests that attraction depends on implicit processing of complex, culturally and individually defined features. Generative adversarial neural networks (GANs), which learn to mimic complex data distributions, can potentially model subjective preferences unconstrained by pre-defined model parameterization.

Here, we present generative brain-computer interfaces (GBCI), coupling GANs with brain-computer interfaces. GBCI first presents a selection of images and captures personalized attractiveness reactions toward the images via electroencephalography. These reactions are then used to control a ProGAN model, finding a representation that matches the features constituting an attractive image for an individual. We conducted an experiment (N = 30) to validate GBCI using a face-generating GAN and producing images that are hypothesized to be individually attractive. In double-blind evaluation of the GBCI-produced images against matched controls, we found GBCI yielded highly accurate results.

Thus, the use of EEG responses to control a GAN presents a valid tool for interactive information-generation. Furthermore, the GBCI-derived images visually replicated known effects from social neuroscience, suggesting that the individually responsive, generative nature of GBCI provides a powerful, new tool in mapping individual differences and visualizing cognitive-affective processing.

[Keywords: brain-computer interfaces, electroencephalography (EEG), generative adversarial networks (GANs), image generation, attraction, personal preferences, individual differences]

Figure 5: Individually generated faces and their evaluation. Panel A shows for 8 female and 8 male participants (full overview available here) the individual faces expected to be evaluated positively (in green framing) and negatively (in red). Panel B shows the evaluation results averaged across participants for both the free selection (upper-right) and explicit evaluation (lower-right) tasks. In the free selection task, the images that were expected to be found attractive (POS) and unattractive (NEG) were randomly inserted with 20 matched controls (RND = random expected attractiveness), and participants made a free selection of attractive faces. In the explicit evaluation task, participants rated each generated (POS, NEG, RND) image on a Likert-type scale of personal attractiveness

…Thus, negative generated images were evaluated as highly attractive for other people, but not for the participant themselves. Taken together, the results suggest that the GBCI was highly accurate in generating personally attractive images (83.33%). They also show that while both negative and positive generated images were evaluated as highly attractive for the general population (respectively M = 4.43 and 4.90 on a scale of 1–5), only the positive generated images (M = 4.57) were evaluated as highly personally attractive.

Qualitative results: In semi-structured post-test interviews, participants were shown the generated images that were expected to be found attractive/ unattractive. Thematic analysis found predictions of positive attractiveness were experienced as accurate: There were no false positives (generated unattractive found personally attractive). The participants also expressed being pleased with results (eg. “Quite an ideal beauty for a male!”; “I would be really attracted to this!”; “Can I have a copy of this? It looks just like my girlfriend!”).

“Learning Personalized Models of Human Behavior in Chess”, McIlroy-Young et al 2020

“Learning Personalized Models of Human Behavior in Chess”⁠, Reid McIlroy-Young, Russell Wang, Siddhartha Sen, Jon Kleinberg, Ashton Anderson (2020-08-23; ⁠, ⁠, ; similar):

Even when machine learning systems surpass human ability in a domain, there are many reasons why AI systems that capture human-like behavior would be desirable: humans may want to learn from them, they may need to collaborate with them, or they may expect them to serve as partners in an extended interaction. Motivated by this goal of human-like AI systems, the problem of predicting human actions—as opposed to predicting optimal actions—has become an increasingly useful task.

We extend this line of work by developing highly accurate personalized models of human behavior in the context of chess. Chess is a rich domain for exploring these questions, since it combines a set of appealing features: AI systems have achieved superhuman performance but still interact closely with human chess players both as opponents and preparation tools, and there is an enormous amount of recorded data on individual players. Starting with an open-source version of AlphaZero trained on a population of human players, we demonstrate that we can significantly improve prediction of a particular player’s moves by applying a series of fine-tuning adjustments. Furthermore, we can accurately perform stylometry—predicting who made a given set of actions—indicating that our personalized models capture human decision-making at an individual level.

“Aligning Superhuman AI With Human Behavior: Chess As a Model System”, McIlroy-Young et al 2020

“Aligning Superhuman AI with Human Behavior: Chess as a Model System”⁠, Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, Ashton Anderson (2020-06-02; ⁠, ⁠, ; similar):

As artificial intelligence becomes increasingly intelligent—in some cases, achieving superhuman performance—there is growing potential for humans to learn from and collaborate with algorithms. However, the ways in which AI systems approach problems are often different from the ways people do, and thus may be uninterpretable and hard to learn from. A crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that constitute human behavior, rather than simply matching aggregate human performance.

We pursue this goal in a model system with a long history in artificial intelligence: chess. The aggregate performance of a chess player unfolds as they make decisions over the course of a game. The hundreds of millions of games played online by players at every skill level form a rich source of data in which these decisions, and their exact context, are recorded in minute detail. Applying existing chess engines to this data, including an open-source implementation of AlphaZero, we find that they do not predict human moves well.

We develop and introduce Maia, a customized version of Alpha-Zero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines, and can achieve maximum accuracy when predicting decisions made by players at a specific skill level in a tuneable way. For a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines. Taken together, our results suggest that there is substantial promise in designing artificial intelligence systems with human collaboration in mind by first accurately modeling granular human decision-making.

“Active Preference-Based Gaussian Process Regression for Reward Learning”, Bıyık et al 2020

“Active Preference-Based Gaussian Process Regression for Reward Learning”⁠, Erdem Bıyık, Nicolas Huynh, Mykel J. Kochenderfer, Dorsa Sadigh (2020-05-06; ; similar):

Designing reward functions is a challenging problem in AI and robotics. Humans usually have a difficult time directly specifying all the desirable behaviors that a robot needs to optimize. One common approach is to learn reward functions from collected expert demonstrations. However, learning reward functions from demonstrations introduces many challenges: some methods require highly structured models, eg. reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that on the other hand require tremendous amount of data. In addition, humans tend to have a difficult time providing demonstrations on robots with high degrees of freedom, or even quantifying reward values for given demonstrations.

To address these challenges, we present a preference-based learning approach—where as an alternative, the human feedback is only in the form of comparisons between trajectories. Furthermore, we do not assume highly constrained structures on the reward function. Instead, we model the reward function using a Gaussian Process (GP) and propose a mathematical formulation to actively find a GP using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. Our results in simulations and an user study suggest that our approach can efficiently learn expressive reward functions for robotics tasks.

“RL Agents Implicitly Learning Human Preferences”, Wichers 2020

“RL agents Implicitly Learning Human Preferences”⁠, Nevan Wichers (2020-02-14; ⁠, ; backlinks; similar):

In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human’s preferences are fulfilled based on the activations of a RL agent’s neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent’s activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.

“Reward-rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020

“Reward-rational (implicit) choice: A unifying formalism for reward learning”⁠, Hong Jun Jeon, Smitha Milli, Anca D. Dragan (2020-02-12; similar):

It is often difficult to hand-specify what the correct reward function is for a task, so researchers have instead aimed to learn reward functions from human behavior or feedback. The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years. We’ve gone from demonstrations, to comparisons, to reading into the information leaked when the human is pushing the robot away or turning it off. And surely, there is more to come. How will a robot make sense of all these diverse types of behavior? Our key insight is that different types of behavior can be interpreted in a single unifying formalism—as a reward-rational choice that the human is making—often implicitly. The formalism offers both an unifying lens with which to view past work, as well as a recipe for interpreting new sources of information that are yet to be uncovered. We provide two examples to showcase this: interpreting a new feedback type, and reading into how the choice of feedback itself leaks information about the reward.

“What Does BERT Dream Of? A Visual Investigation of Nightmares in Sesame Street”, Bäuerle & Wexler 2020

“What does BERT dream of? A visual investigation of nightmares in Sesame Street”⁠, Alex Bäuerle, James Wexler (2020-01-13; ⁠, ; backlinks; similar):

BERT⁠, a neural network published by Google in 2018, excels in natural language understanding. It can be used for multiple different tasks, such as sentiment analysis or next sentence prediction, and has recently been integrated into Google Search. This novel model has brought a big change to language modeling as it outperformed all its predecessors on multiple different tasks. Whenever such breakthroughs in deep learning happen, people wonder how the network manages to achieve such impressive results, and what it actually learned. A common way of looking into neural networks is feature visualization. The ideas of feature visualization are borrowed from Deep Dream, where we can obtain inputs that excite the network by maximizing the activation of neurons, channels, or layers of the network. This way, we get an idea about which part of the network is looking for what kind of input.

In Deep Dream, inputs are changed through gradient descent to maximize activation values. This can be thought of as similar to the initial training process, where through many iterations, we try to optimize a mathematical equation. But instead of updating network parameters, Deep Dream updates the input sample. What this leads to is somewhat psychedelic but very interesting images, that can reveal to what kind of input these neurons react. Examples for Deep Dream processes with images from the original Deep Dream blogpost. Here, they take a randomly initialized image and use Deep Dream to transform the image by maximizing the activation of the corresponding output neuron. This can show what a network has learned about different classes or for individual neurons.

Feature visualization works well for image-based models, but has not yet been widely explored for language models. This blogpost will guide you through experiments we conducted with feature visualization for BERT. We show how we tried to get BERT to dream of highly activating inputs, provide visual insights of why this did not work out as well as we hoped, and publish tools to explore this research direction further. When dreaming for images, the input to the model is gradually changed. Language, however, is made of discrete structures, ie. tokens, which represent words, or word-pieces. Thus, there is no such gradual change to be made…Looking at a single pixel in an input image, such a change could be gradually going from green to red. The green value would slowly go down, while the red value would increase. In language, however, we can not slowly go from the word “green” to the word “red”, as everything in between does not make sense. To still be able to use Deep Dream, we have to utilize the so-called Gumbel-Softmax trick, which has already been employed in a paper by Poerner et al 2018⁠. This trick was introduced by Jang et. al. and Maddison et. al.. It allows us to soften the requirement for discrete inputs, and instead use a linear combination of tokens as input to the model. To assure that we do not end up with something crazy, it uses two mechanisms. First, it constrains this linear combination so that the linear weights sum up to one. This, however, still leaves the problem that we can end up with any linear combination of such tokens, including ones that are not close to real tokens in the embedding space. Therefore, we also make use of a temperature parameter, which controls the sparsity of this linear combination. By slowly decreasing this temperature value, we can make the model first explore different linear combinations of tokens, before deciding on one token.

…The lack of success in dreaming words to highly activate specific neurons was surprising to us. This method uses gradient descent and seemed to work for other models (see Poerner et al 2018). However, BERT is a complex model, arguably much more complex than the models that have been previously investigated with this method.

“Deep Bayesian Reward Learning from Preferences”, Brown & Niekum 2019

“Deep Bayesian Reward Learning from Preferences”⁠, Daniel S. Brown, Scott Niekum (2019-12-10; similar):

Bayesian inverse reinforcement learning (IRL) methods are ideal for safe imitation learning, as they allow a learning agent to reason about reward uncertainty and the safety of a learned policy. However, Bayesian IRL is computationally intractable for high-dimensional problems because each sample from the posterior requires solving an entire Markov Decision Process (MDP). While there exist non-Bayesian deep IRL methods, these methods typically infer point estimates of reward functions, precluding rigorous safety and uncertainty analysis.

We propose Bayesian Reward Extrapolation (B-REX), a highly efficient preference-based Bayesian reward learning algorithm that scales to high-dimensional, visual control tasks. Our approach uses successor feature representations and preferences over demonstrations to efficiently generate samples from the posterior distribution over the demonstrator’s reward function without requiring an MDP solver. Using samples from the posterior, we demonstrate how to calculate high-confidence bounds on policy performance in the imitation learning setting, in which the ground-truth reward function is unknown. We evaluate our proposed approach on the task of learning to play Atari games via imitation learning from pixel inputs, with no access to the game score. We demonstrate that B-REX learns imitation policies that are competitive with a state-of-the-art deep imitation learning method that only learns a point estimate of the reward function. Furthermore, we demonstrate that samples from the posterior generated via B-REX can be used to compute high-confidence performance bounds for a variety of evaluation policies. We show that high-confidence performance bounds are useful for accurately ranking different evaluation policies when the reward function is unknown. We also demonstrate that high-confidence performance bounds may be useful for detecting reward hacking.

“Learning Human Objectives by Evaluating Hypothetical Behavior”, Reddy et al 2019

“Learning Human Objectives by Evaluating Hypothetical Behavior”⁠, Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan Leike (2019-12-05; ⁠, ; backlinks; similar):

We seek to align agent behavior with an user’s objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user’s reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information⁠, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST).

We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.

“Reinforcement Learning Upside Down: Don't Predict Rewards—Just Map Them to Actions”, Schmidhuber 2019

“Reinforcement Learning Upside Down: Don't Predict Rewards—Just Map Them to Actions”⁠, Juergen Schmidhuber (2019-12-05; ⁠, ⁠, ; backlinks; similar):

We transform reinforcement learning (RL) into a form of supervised learning (SL) by turning traditional RL on its head, calling this Upside Down RL (UDRL). Standard RL predicts rewards, while UDRL instead uses rewards as task-defining inputs, together with representations of time horizons and other computable functions of historic and desired future data. UDRL learns to interpret these input observations as commands, mapping them to actions (or action probabilities) through SL on past (possibly accidental) experience.

UDRL generalizes to achieve high rewards or other goals, through input commands such as: get lots of reward within at most so much time! A separate paper [63] on first experiments with UDRL shows that even a pilot version of UDRL can outperform traditional baseline algorithms on certain challenging RL problems.

We also conceptually simplify an approach [60] for teaching a robot to imitate humans. First videotape humans imitating the robot’s current behaviors, then let the robot learn through SL to map the videos (as input commands) to these behaviors, then let it generalize and imitate videos of humans executing previously unknown behavior. This Imitate-Imitator concept may actually explain why biological evolution has resulted in parents who imitate the babbling of their babies.

“Preference-Based Learning for Exoskeleton Gait Optimization”, Tucker et al 2019

“Preference-Based Learning for Exoskeleton Gait Optimization”⁠, Maegan Tucker, Ellen Novoseller, Claudia Kann, Yanan Sui, Yisong Yue, Joel Burdick, Aaron D. Ames (2019-09-26; similar):

This paper presents a personalized gait optimization framework for lower-body exoskeletons. Rather than optimizing numerical objectives such as the mechanical cost of transport, our approach directly learns from user preferences, eg. for comfort. Building upon work in preference-based interactive learning, we present the CoSpar algorithm. CoSpar prompts the user to give pairwise preferences between trials and suggest improvements; as exoskeleton walking is a non-intuitive behavior, users can provide preferences more easily and reliably than numerical feedback. We show that CoSpar performs competitively in simulation and demonstrate a prototype implementation of CoSpar on a lower-body exoskeleton to optimize human walking trajectory features. In the experiments, CoSpar consistently found user-preferred parameters of the exoskeleton’s walking gait, which suggests that it is a promising starting point for adapting and personalizing exoskeletons (or other assistive devices) to individual users.

“Fine-Tuning GPT-2 from Human Preferences”, Ziegler et al 2019

“Fine-Tuning GPT-2 from Human Preferences”⁠, Daniel Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Dario Amodei, Alec Radford, Paul Christiano, Geoffrey Irving et al (2019-09-19; ⁠, ; backlinks; similar):

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans”, which we believe is key to extracting information about human values.

This work applies human preference learning to several natural language tasks: continuing text with positive sentiment or physically descriptive language using the BookCorpus, and summarizing content from the TL;DR and CNN⁠/​Daily Mail datasets. Each of these tasks can be viewed as a text completion problem: starting with some text X, we ask what text Y should follow. [For summarization, the text is the article plus the string “TL;DR:”.]

We start with a pretrained language model (the 774M parameter version of GPT-2) and fine-tune the model by asking human labelers which of four samples is best. Fine-tuning for the stylistic continuation tasks is sample efficient: 5,000 human samples suffice for strong performance according to humans. For summarization, models trained with 60,000 comparisons learn to copy whole sentences from the input while skipping irrelevant preamble; this copying is an easy way to ensure accurate summaries, but may exploit the fact that labelers rely on simple heuristics.

Bugs can optimize for bad behavior

One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.

Looking forward

We’ve demonstrated reward learning from human preferences on two kinds of natural language tasks, stylistic continuation and summarization. Our results are mixed: for continuation we achieve good results with very few samples, but our summarization models are only “smart copiers”: they copy from the input text but skip over irrelevant preamble. The advantage of smart copying is truthfulness: the zero-shot and supervised models produce natural, plausible-looking summaries that are often lies. We believe the limiting factor in our experiments is data quality exacerbated by the online data collection setting, and plan to use batched data collection in the future.

We believe the application of reward learning to language is important both from a capability and safety perspective. On the capability side, reinforcement learning lets us correct mistakes that supervised learning would not catch, but RL with programmatic reward functions “can be detrimental to model quality.” On the safety side, reward learning for language allows important criteria like “don’t lie” to be represented during training, and is a step towards scalable safety methods such as a debate and amplification. [Followup: “Learning to summarize from human feedback”⁠, Stiennon et al 2020.]

“Fine-Tuning Language Models from Human Preferences”, Ziegler et al 2019

“Fine-Tuning Language Models from Human Preferences”⁠, Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano et al (2019-09-18; ⁠, ⁠, ; backlinks; similar):

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/​Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

“Lm-human-preferences”, Ziegler et al 2019

“lm-human-preferences”⁠, Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano et al (2019-09-14; ; backlinks; similar):

Code for the paper ‘Fine-Tuning Language Models from Human Preferences’. Status: Archive (code is provided as-is, no updates expected). We provide code for:

  • Training reward models from human labels
  • Fine-tuning language models using those reward models

It does not contain code for generating labels. However, we have released human labels collected for our experiments, at gs://lm-human-preferences/labels. For those interested, the question and label schemas are simple and documented in

The code has only been tested using the smallest GPT-2 model (124M parameters). This code has only been tested using Python 3.7.3. Training has been tested on GCE machines with 8 V100s, running Ubuntu 16.04, but development also works on Mac OS X.

“Dueling Posterior Sampling for Preference-Based Reinforcement Learning”, Novoseller et al 2019

“Dueling Posterior Sampling for Preference-Based Reinforcement Learning”⁠, Ellen R. Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, Joel W. Burdick (2019-08-04; similar):

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the preference feedback. As preference feedback is provided on trajectories rather than individual state-action pairs, we develop a Bayesian approach for the credit assignment problem, translating preferences to a posterior distribution over state-action reward models. We prove an asymptotic Bayesian no-regret rate for DPS with a Bayesian linear regression credit assignment model. This is the first regret guarantee for preference-based RL to our knowledge. We also discuss possible avenues for extending the proof methodology to other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.

“Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog”, Jaques et al 2019

“Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog”⁠, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu et al (2019-06-30; similar):

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment—eg. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation—a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves substantial improvements over prior methods in off-policy batch RL.

“Reward Learning from Human Preferences and Demonstrations in Atari”, Ibarz et al 2018

“Reward learning from human preferences and demonstrations in Atari”⁠, Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei (2018-11-15; ; backlinks; similar):

To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.

“Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making”, Zintgraf et al 2018

“Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making”⁠, Luisa M. Zintgraf, Diederik M. Roijers, Sjoerd Linders, Catholijn M. Jonker, Ann Nowé (2018-02-21; similar):

In multi-objective decision planning and learning, much attention is paid to producing optimal solution sets that contain an optimal policy for every possible user preference profile. We argue that the step that follows, i.e. determining which policy to execute by maximizing the user’s intrinsic utility function over this (possibly infinite) set, is under-studied. This paper aims to fill this gap.

We build on previous work on Gaussian processes and pairwise comparisons for preference modelling, extend it to the multi-objective decision support scenario, and propose new ordered preference elicitation strategies based on ranking and clustering. Our main contribution is an in-depth evaluation of these strategies using computer and human-based experiments. We show that our proposed elicitation strategies outperform the currently used pairwise methods, and found that users prefer ranking most. Our experiments further show that utilising monotonicity information in GPs by using a linear prior mean at the start and virtual comparisons to the nadir and ideal points increases performance.

We demonstrate our decision support framework in a real-world study on traffic regulation, conducted with the city of Amsterdam.

“Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces”, Warnell et al 2017

“Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces”⁠, Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, Peter Stone (2017-09-28; ; backlinks; similar):

While recent advances in deep reinforcement learning have allowed autonomous learning agents to succeed at a variety of complex tasks, existing algorithms generally require a lot of training data. One way to increase the speed at which agents are able to learn to perform tasks is by leveraging the input of human trainers. Although such input can take many forms, real-time, scalar-valued feedback is especially useful in situations where it proves difficult or impossible for humans to provide expert demonstrations. Previous approaches have shown the usefulness of human input provided in this fashion (eg. the TAMER framework), but they have thus far not considered high-dimensional state spaces or employed the use of deep learning. In this paper, we do both: we propose Deep TAMER, an extension of the TAMER framework that leverages the representational power of deep neural networks in order to learn complex tasks in just a short amount of time with a human trainer. We demonstrate Deep TAMER’s success by using it and just 15 minutes of human-provided feedback to train an agent that performs better than humans on the Atari game of Bowling—a task that has proven difficult for even state-of-the-art reinforcement learning methods.

“Towards Personalized Human AI Interaction—adapting the Behavior of AI Agents Using Neural Signatures of Subjective Interest”, Shih et al 2017

“Towards personalized human AI interaction—adapting the behavior of AI agents using neural signatures of subjective interest”⁠, Victor Shih, David C. Jangraw, Paul Sajda, Sameer Saproo (2017-09-14; similar):

Reinforcement Learning AI commonly uses reward/​penalty signals that are objective and explicit in an environment—eg. game score, completion time, etc.—in order to learn the optimal strategy for task performance. However, Human-AI interaction for such AI agents should include additional reinforcement that is implicit and subjective—eg. human preferences for certain AI behavior—in order to adapt the AI behavior to idiosyncratic human preferences. Such adaptations would mirror naturally occurring processes that increase trust and comfort during social interactions.

Here, we show how a hybrid brain-computer-interface (hBCI), which detects an individual’s level of interest in objects/​events in a virtual environment, can be used to adapt the behavior of a Deep Reinforcement Learning AI agent that is controlling a virtual autonomous vehicle. Specifically, we show that the AI learns a driving strategy that maintains a safe distance from a lead vehicle, and most novel, preferentially slows the vehicle when the human passengers of the vehicle encounter objects of interest. This adaptation affords an additional 20% viewing time for subjectively interesting objects.

This is the first demonstration of how an hBCI can be used to provide implicit reinforcement to an AI agent in a way that incorporates user preferences into the control system.

“Learning from Human Preferences”, Amodei et al 2017

“Learning from Human Preferences”⁠, Dario Amodei, Paul Christiano, Alex Ray (2017-06-13; ⁠, ⁠, ; backlinks; similar):

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

We present a learning algorithm that uses small amounts of human feedback to solve modern RL environments. Machine learning systems with human feedback have been explored before, but we’ve scaled up the approach to be able to work on much more complicated tasks. Our algorithm needed 900 bits of feedback from a human evaluator to learn to backflip—a seemingly simple task which is simple to judge but challenging to specify.

The overall training process is a 3-step feedback cycle between the human, the agent’s understanding of the goal, and the RL training.

Preference learning architecture

Our AI agent starts by acting randomly in the environment. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments. It then uses RL to learn how to achieve that goal. As its behavior improves, it continues to ask for human feedback on trajectory pairs where it’s most uncertain about which is better, and further refines its understanding of the goal.

“Deep Reinforcement Learning from Human Preferences”, Christiano et al 2017

“Deep reinforcement learning from human preferences”⁠, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei (2017-06-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

“Just Sort It! A Simple and Effective Approach to Active Preference Learning”, Maystre & Grossglauser 2015

“Just Sort It! A Simple and Effective Approach to Active Preference Learning”⁠, Lucas Maystre, Matthias Grossglauser (2015-02-19; ; backlinks; similar):

We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking?

We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.

“Bayesian Active Learning for Classification and Preference Learning”, Houlsby et al 2011

“Bayesian Active Learning for Classification and Preference Learning”⁠, Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, Máté Lengyel (2011-12-24; ⁠, ; backlinks; similar):

Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.