Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues, we present RankGen, an encoder model (1.2B parameters) that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and, which discourage topically-similar but irrelevant generations; (2) sequences generated from a large language model conditioned on the prefix, which discourage repetition and hallucination. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling on both automatic metrics (85.0 vs 77.3 MAUVE) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We open source our model checkpoints, code, and human preferences with detailed explanations for future research.
[blog] Inspired by progress in large-scale language modeling [Decision Transformer], we apply a similar approach towards building a single generalist agent beyond the realm of text outputs.
The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
Figure 1: A generalist agent. Gato can sense and act with different embodiments across a wide range of environments using a single neural network with the same set of weights. Gato was trained on 604 distinct tasks with varying modalities, observations and action specifications.
In this report we describe the model and the data, and document the current capabilities of Gato [at 0.08b, 0.36b, & 1.2b parameters].
…Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent.
…We focus our training at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters in the case of Gato. As hardware and model architectures improve, this operating point will naturally increase the feasible model size, pushing generalist models higher up the scaling law curve. For simplicity Gato was trained offline in a purely supervised manner; however, in principle, there is no reason it could not also be trained with either offline or online reinforcement learning (RL)…Training of the model is performed on a 16×16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1,024, which takes about 4 days.
Figure 2: Training phase of Gato. Data from different tasks and modalities is serialized into a flat sequence of tokens, batched, and processed by a transformer neural network akin to a large language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions.
…Scaling Laws Analysis: In Figure 8, we analyze the aggregate in-distribution performance of the pretrained model as a function of the number of parameters in order to get insight into how performance could improve with increased model capacity. We evaluated 3 different model sizes (measured in parameter count): a 79M model, a 364M model, and a 1.18B model (Gato). We refer to Section C for details on the 3 model architectures. Here, for all 3 model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 4.1). Then for each domain listed in Table 1 we average the percentage scores across all tasks for that domain. Finally, we mean-aggregate the percentage scores across all domains. We can see that for an equivalent token count, there is a substantial performance improvement with increased scale.
Figure 8: Model size scaling laws results. In-distribution performance as a function of tokens processed for 3 model scales. Performance is first mean-aggregated within each separate control domain, and then mean-aggregated across all domains. We can see a consistent improvement as model capacity is increased for a fixed number of tokens.
Figure 10: Robotics fine-tuning results.Left: Comparison of real robot Skill Generalization success rate averaged across test triplets for Gato, expert, and CRR trained on 35k expert episodes (upper bound). Right: Comparison of simulated robot Skill Generalization success rate averaged across test triplets for a series of ablations on the number of parameters, including scores for expert and a BC baseline trained on 5k episodes.
Fine-tuning and Model Size: To better understand the benefit of large models for few-shot adaptation in robotics domains, we conducted an ablation on model parameter size. This section focuses on in-simulation evaluation. Figure 10 compares the full 1.18B parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time.
…As we model the data autoregressively, each token is potentially also a target label given the previous tokens. Text tokens, discrete and continuous values, and actions can be directly set as targets after tokenization. Image tokens and agent observations are not currently predicted in Gato, although that may be an interesting direction for future work. Targets for these non-predicted tokens are set to an unused value and their contribution to the loss is masked out…Because distinct tasks within a domain can share identical embodiments, observation formats and action specifications, the model sometimes needs further context to disambiguate tasks. Rather than providing eg. one-hot task identifiers, we instead take inspiration from (Brown et al, 2020; Sanh et al 2022; Wei et al 2021) and use prompt conditioning. During training, for 25% of the sequences in each batch, a prompt sequence is prepended, coming from an episode generated by the same source agent on the same task. Half of the prompt sequences are from the end of the episode, acting as a form of goal conditioning for many domains; and the other half are uniformly sampled from the episode. During evaluation, the agent can be prompted using a successful demonstration of the desired task, which we do by default in all control results that we present here.
Table 1a: Datasets. Control datasets used to train Gato. Right: Vision & language datasets. Sample weight means the proportion of each dataset, on average, in the training sequence batches.
Figure 5: Gato’s performance on simulated control tasks. Number of tasks where the performance of the pretrained model is above a percentage of expert score, grouped by domain. Here values on the x-axis represent a specific percentage of expert score, where 0 corresponds to random agent performance. The y-axis is the number of tasks where the pretrained model’s mean performance is equal to or above that percentage. That is, the width of each colour band indicates the number of tasks where Gato’s mean performance is above a percentage of the maximum score obtained by a task-specific expert.
…In ALE Atari (Bellemare et al 2013) Gato achieves the average human (or better) scores for 23 Atari games, achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see Section 5.5 where we present a specialist single domain ALE Atari agent achieving better than human scores for 44 games).
…As mentioned earlier, transfer in Atari is challenging. Rusu et al 2016 researched transfer between randomly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behaviour cloning to video games like Atari are discussed by Kanervisto et al 2020.
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives—two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. We release Flax-based T5× model checkpoints for the 20B model at https://github.com/google-research/google-research/tree/master/ul2.
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero-shot and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study.
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers.
We show that OPT-175B is comparable to GPT-3, while requiring only 1⁄7th the carbon footprint to develop.
We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.
There is a fundamental difference between form and meaning. Form is the physical structure of something, while meaning is the interpretation or concept that is attached to that form. For example, the form of a chair is its physical structure—four legs, a seat, and a back. The meaning of a chair is that it is something you can sit on.
This distinction is important when considering whether or not an AI system can be trained to learn semantic meaning. AI systems are capable of learning and understanding the form of data, but they are not able to attach meaning to that data. In other words, AI systems can learn to identify patterns, but they cannot understand the concepts behind those patterns.
For example, an AI system might be able to learn that a certain type of data is typically associated with the concept of “chair.” However, the AI system would not be able to understand what a chair is or why it is used. In this way, we can see that an AI system trained on form can never learn semantic meaning.
—GPT3, when I gave it the prompt “Write an essay proving that an AI system trained on form can never learn semantic meaning” 😃
Mitchell Porter:
I just tried the same prompt and it said:
There is a fundamental distinction between form and meaning. Form is the physical structure of something, while meaning is the interpretation or concept that is attached to that form. This distinction is important when considering the capacity of an AI system to learn.
An AI system can be trained to recognize patterns in data. It can learn to identify the form of an object, for example. However, it is not possible for an AI system to learn the meaning of an object. The reason for this is that meaning is a mental construct that exists only in the mind of the observer. It is not something that can be observed or measured in the physical world.
In order for an AI system to learn the meaning of an object, it would need to be able to understand the thoughts and experiences of the people who interact with that object. This is something that AI systems are not currently capable of doing. As such, it is not possible for an AI system to learn semantic meaning.
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations that transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale.
Targeting a multilingual language model in the 100B+ parameters scale, our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget.
Specifically, we perform an ablation study comparing different modeling practices and their impact on zero-shot generalization. We perform all our experiments on 1.3B models, providing a compromise between compute costs and the likelihood that our conclusions will hold for the target 100B+ model. In addition, we study the impact of various popular pretraining corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one.
Finally, we consider the scaling behaviour of Transformers to chose the target model size, shape, and training setup.
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model.
Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WavPrompt can extract more information than just the transcriptions.
Modern language models can generate high-quality short texts. However, they often meander or are incoherent when generating longer texts. These issues arise from the next-token-only language modeling objective.
To address these issues, we introduce Time Control (TC), a language model that implicitly plans via a latent stochastic process. TC does this by learning a representation which maps the dynamics of how text changes in a document to the dynamics of a stochastic process of interest. Using this representation, the language model can generate text by first implicitly generating a document plan via a stochastic process, and then generating text that is consistent with this latent plan.
Compared to domain-specific methods and fine-tuning GPT2 across a variety of text domains, TC improves performance on text infilling and discourse coherence. On long text generation settings, TC preserves the text structure both in terms of ordering (up to +40% better) and text length consistency (up to +17% better). Human evaluators also prefer TC’s output 28.6% more than the baselines.
Prompt (almost verbatim Lem): “Write a love poem, lyrical, pastoral, and expressed in the language of pure mathematics. Tensor algebra mainly, with a little topology and higher calculus, if need be. But with feeling and in the cybernetic spirit. In verses.”
Output:
In the space of my heart your love is a point defined by certain coordinates (x, y, z) that exist in a specific time t. But your love is more than that it’s also a vector with magnitude and direction that can be graphed on a coordinate plane.
Your love is a function of time and space it’s a curve that’s always changing but always coming back to me. It’s the limit of my heart’s desires and the product of our two hearts’ vectors.
Yes, it sucks, but it looks like we’re getting there.
This paper shares results from a pedagogical experiment that assigns undergraduates to “cheat” on a final class essay by requiring their use of text-generating AI software.
For this assignment, students harvested content from an installation of GPT-2, then wove that content into their final essay. At the end, students offered a “revealed” version of the essay as well as their own reflections on the experiment.
In this assignment, students were specifically asked to confront the oncoming availability of AI as a writing tool. What are the ethics of using AI this way? What counts as plagiarism? What are the conditions, if any, we should place on AI assistance for student writing? And how might working with AI change the way we think about writing, authenticity, and creativity?
While students (and sometimes GPT-2) offered thoughtful reflections on these initial questions, actually composing with GPT-2 opened their perspectives more broadly on the ethics and practice of writing with AI. In this paper, I share how students experienced those issues, connect their insights to broader conversations in the humanities about writing and communication, and explain their relevance for the ethical use and evaluation of language models.
[Keywords: language models, plagiarism, AI literacy, writing, pedagogy, ethics]
Discussion of how to teach active reading and questioning of scientific research. Partially fake research papers may teach a critical attitude. Various ideas for games reviewed.
How do researchers transition from uncritically absorbing research papers or arguments to actively grappling with it and questioning it? Most learn this meta-cognitive skill informally or by ad hoc mechanisms like being tutored by a mentor, or watching others critique papers at a ‘journal club’. This patchwork may not always work or be the best approach, as it is slow and largely implicit, and similar to calibration training in statistical forecasting, targeted training may be able to teach it rapidly.
To teach this active reading attitude of not believing everything you read, I borrow the pedagogical strategy of deliberately inserting errors which the student must detect, proposing fake research articles which could be read in a ‘fake journal club’.
Faking entire articles is a lot of work and so I look at variations on it. I suggest that NN language models like GPT-3 have gotten good enough to, for short passages, provide a challenge for human readers, and that one could create a fake journal club by having a language model repeatedly complete short passages of research articles (possibly entirely fictional ones).
This would provide difficult criticism problems with rapid feedback, scalability to arbitrarily many users, and great flexibility in content.
Fictional short story about Clippy & AI hard takeoff scenarios grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature.
It might help to imagine a hard takeoff scenario using solely known sorts of NN & scaling effects… Below is a story which may help stretch your imagination and defamiliarize the 2022 state of machine learning.
To read the annotated alternate version of this story, scroll to the end or manually disable ‘reader-mode’ () in the theme toggle in the upper-right corner.
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.
Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings.
In this paper, we first describe metrics for measuring political bias in GPT-2 generation, and discuss several interesting takeaways:
The generation of vanilla GPT-2 model is mostly liberal-leaning,
Such political bias depends on the sensitive attributes mentioned in the context, and
Priming the generation with an explicit political identifier, the extent of political bias is imbalanced (between liberal and conservative).
We then propose a reinforcement learning (RL) framework for mitigating such political biases in generated text: By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained.
In empirical experiments on 3 attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.
[Keywords: bias in language models, natural language generation, political bias, measuring bias, mitigating bias]
To guide the generation of large pretrained language models (LM), previous work has focused on directly fine-tuning the language model or utilizing an attribute discriminator.
In this work, we propose a novel lightweight framework for controllable GPT2 generation, which utilizes a set of small attribute-specific vectors, called prefixes, to steer natural language generation. Different from prefix-tuning, where each prefix is trained independently, we take the relationship among prefixes into consideration and train multiple prefixes simultaneously. We propose a novel supervised method and also an unsupervised method to train the prefixes for single-aspect control while the combination of these two methods can achieve multi-aspect control.
Experimental results on both single-aspect and multi-aspect control show that our methods can guide generation towards the desired attributes while keeping high linguistic quality.
[cf. InstructGPT; En→Fr prompts; self-distillation translation; meta-learning] Large language models (LMs) are able to in-context learn—perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance.
In this paper, we show that ground truth demonstrations are in fact not required—randomly replacing labels in the demonstrations barely hurts performance, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence.
Together, our analysis provides a new way of understanding how and why in-context learning works, while opening up new questions about how much can be learned from large language models through inference alone. [code]
Text generation is of great importance to many natural language processing applications. However, maximization-based decoding methods (eg. beam search) of neural language models often lead to degenerate solutions—the generated text is unnatural and contains undesirable repetitions. Existing approaches introduce stochasticity via sampling or modify training objectives to decrease probabilities of certain tokens (eg. unlikelihood training). However, they often lead to solutions that lack coherence. In this work, we show that an underlying reason for model degeneration is the anisotropic distribution of token representations. We present a contrastive solution: (1) SimCTG, a contrastive training objective to calibrate the model’s representation space, and (2) a decoding method—contrastive search—to encourage diversity while maintaining coherence in the generated text. Extensive experiments and analyses on three benchmarks from two languages demonstrate that our proposed approach outperforms state-of-the-art text generation methods as evaluated by both human and automatic metrics.
Prompt-based learning, with its capability to tackle zero-shot and few-shot NLP tasks, has gained much attention in community. The main idea is to bridge the gap between NLP downstream tasks and language modeling (LM), by mapping these tasks into natural language prompts, which are then filled by pre-trained language models (PLMs). However, for prompt learning, there are still two salient gaps between NLP tasks and pretraining. First, prompt information is not necessarily sufficiently present during LM pretraining. Second, task-specific data are not necessarily well represented during pretraining.
We address these two issues by proposing AdaPrompt, adaptively retrieving external data for continual pretraining of PLMs by making use of both task and prompt characteristics. In addition, we make use of knowledge in Natural Language Inference models for deriving adaptive verbalizers. Experimental results on five NLP benchmarks show that AdaPrompt can improve over standard PLMs in few-shot settings. In addition, in zero-shot settings, our method outperforms standard prompt-based methods by up to 26.35% relative error reduction.
Scaling up the size and training of autoregressive language models has enabled novel ways of solving Natural Language Processing tasks using zero-shot and few-shot learning. While extreme-scale language models such as GPT-3 offer multilingual capabilities, zero-shot learning for languages other than English remain largely unexplored. Here, we introduce Cedille, a large open source auto-regressive language model, specifically trained for the French language. Our results show that Cedille outperforms existing French language models and is competitive with GPT-3 on a range of French zero-shot benchmarks. Furthermore, we provide an in-depth comparison of the toxicity exhibited by these models, showing that Cedille marks an improvement in language model safety thanks to dataset filtering.
PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively. PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool. Over 2,000 prompts for roughly 170 datasets are already available in PromptSource.
We demonstrate that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data. While prompting has emerged as a promising paradigm for few-shot and zero-shot learning, it is often brittle and requires much larger models compared to the standard supervised setup.
We find that co-training makes it possible to improve the original prompt model and at the same time learn a smaller, downstream task-specific model. In the case where we only have partial access to a prompt model (eg. output probabilities from GPT-3 (Brown et al 2020)) we learn a calibration model over the prompt outputs. When we have full access to the prompt model’s gradients but full finetuning remains prohibitively expensive (eg. T0 (Sanh et al 2021)), we learn a set of soft prompt continuous vectors to iteratively update the prompt model.
We find that models trained in this manner can significantly improve performance on challenging datasets where there is currently a large gap between prompt-based learning and fully-supervised models.
[see Mirostat] Despite achieving incredibly low perplexities on myriad natural language corpora, today’s language models still often underperform when used to generate text. This dichotomy has puzzled the language generation community for the last few years.
In this work, we posit that the abstraction of natural language as a communication channel (à la Shannon 1948) can provide new insights into the behaviors of probabilistic language generators, eg. why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, and do so in an efficient yet error-minimizing manner, choosing each word in a string with this (perhaps subconscious) goal in mind.
We propose that generation from probabilistic models should mimic this behavior. Rather than always choosing words from the high-probability region of the distribution—which have a low Shannon information content—we sample from the set of words with an information content close to its expected value, i.e., close to the conditional entropy of our model. This decision criterion can be realized through a simple and efficient implementation, which we call typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, typical sampling offers competitive performance in terms of quality while consistently reducing the number of degenerate repetitions.
Fine-tuning reinforcement learning (RL) models has been challenging because of a lack of large scale off-the-shelf datasets as well as high variance in transferability among different environments. Recent work has looked at tackling offline RL from the perspective of sequence modeling with improved results as result of the introduction of the Transformer architecture. However, when the model is trained from scratch, it suffers from slow convergence speeds. In this paper, we look to take advantage of this formulation of reinforcement learning as sequence modeling and investigate the transferability of pre-trained sequence models on other domains (vision, language) when finetuned on offline RL tasks (control, games). To this end, we also propose techniques to improve transfer between these domains. Results show consistent performance gains in terms of both convergence speed and reward on a variety of environments, accelerating training by 3–6× and achieving state-of-the-art performance in a variety of tasks using Wikipedia-pretrained and GPT2 language models. We hope that this work not only brings light to the potentials of leveraging generic sequence modeling techniques and pre-trained models for RL, but also inspires future work on sharing knowledge between generative modeling tasks of completely different domains.
[blog] Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.
As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer-based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG.
We demonstrate that MT-NLG achieves superior zero-shot, one-shot, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results.
We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.
…The validation cross-entropy loss is 3.15 after the model is trained on the first 1 billion tokens. As mentioned earlier, we increase the batch size linearly over the first 12 billion tokens. At the end of this phase, the loss becomes 2.31. When the model reaches our targeted number of tokens, 270 billion, the validation loss becomes 1.85.
Table 2: LAMBADA zero-shot, one-shot and few-shot accuracy. MT-NLG outperforms previous models across different settings and establishes new SOTA for all 3 settings. We did not find any recent strong supervised baseline for LAMBADA, hence we omit the comparison with supervised models here.
…To our pleasant surprise, MT-NLG is quite capable in solving riddles, answering Jeopardy! questions and even generating code off-the-shelf. We present some examples of each category below.
Riddle Answer Generation: We used riddles to probe the model’s reasoning capability in an ambiguous context, crafting each riddle ourselves in order to prevent their incidence in the training set. We first observe that in a riddle-solving context, the model tends to generate its interpretation of each line in the riddle along with its answer. While not always perfect, these interpretations most of the time make good sense. Such an example is shown in Table 13. For riddles that are ambiguous enough to have multiple plausible answers, MT-NLG not only generates alternative plausible answers through stochastic sampling, but it can also generate alternative interpretations matching the answer it has generated (Table 14).
Jeopardy! Questions: Question answering datasets30, 25 often poses specific and direct questions to benchmark the models. However, we are also interested in how the model can utilize the knowledge it memorized in a guessing game setting, where some reasoning over the hints is required. To this end, we take several Jeopardy! questions from the most recent episode and let our model generate the answers. Since Jeopardy! questions take the reverse trivia format where the “question” is in the format of an answer and contestants are asked to select matching questions, we choose to use few-shot setting to inform the model of the task format. MT-NLG can generate fairly plausible answers and in fact get the correct ones in most cases. Some examples is shown in Table 15.
Code Generation: The recent development of code generation using language models suggests that large scale pretrained LMs already show decent code generation capabilities from pretraining. To this end, we investigate the code generation capability of MT-NLG off-the-shelf. We presented some function signatures with detailed comments to see how MT-NLG would complete the implementation of the missing function. We observe that MT-NLG is capable of generating syntactically correct code consistently, and is also able to arrive at correct implementations for simple tasks. We sometimes observe that the model will generate an answer making use of another function, and then move on to generate the invoked function after the current one is finished. Some examples of this are shown in Table 16.
Inferring Arithmetic Operations: Understanding and using mathematical operations is yet another aspect of language understanding. Prior work [GPT-3] has demonstrated that a strong language model, even if not trained specifically to solve math problems, can answer simple arithmetic questions with a certain degree of accuracy beyond chance. However, some doubts remain as to whether the model indeed has some understanding of math expressions, or whether it simply rehashes examples encountered during training. To this end, we devise a new task where we obfuscate operator symbols in an expression and check if our model can reverse-engineer the arithmetic operation. We observe that common operations like addition, subtraction, multiplication and division can usually be inferred correctly. Some examples of this task is shown in Table 17.
Free-form Generative Writing Assistance: We qualitatively examined the free-form generation capability of MT-NLG by enlisting the model to help authoring the abstract section of this paper. This was done through prompting MT-NLG with the text from Section 1, then proceeding to sample the model sentence by sentence. For each sentence multiple candidates were generated, from which one was picked and edited if necessary. We repeated this process until the abstraction excerpt appeared complete. [not included?]
…You can get a feel for the incredible performance of GPT-3 in a recent episode of the Clearer Thinking podcast, where we “interviewed” GPT-3. The AI performed surprisingly well as a podcast guest, discussing big topics like the meaning of life, free will, and AI itself. We even had it pretend to be different people, such as the philosopher Peter Singer, and the musician Kanye West.
To demonstrate GPT-3’s capabilities, we got it to generate a poem in the style of Edgar Allan Poe. Below is the prompt we put into GPT-3, so that it would generate plausible text that might come next:
An unpublished poem by Edgar Allan Poe was recently discovered. Some scholars are saying it may be even better than his other work. Here is the Edgar Allan Poe poem, in full: The Whisperer
And here is the poem that GPT-3 generated as a result, on its first try, with no human help:
Upon the pallid bust of Pallas
Aye, Pallas, thou art there!
I see thee sit, serene and still,
As thou wert wont of yore.
The rock-born goddess, from her seat
In the far-off Achaian land,
Thou watch’st the storm-tossed ships
That plough the dark blue sea.
And ever, when the tempest rages,
And the loud waves are breaking,
Thou listen’st to the whispered voice
Of him who calleth thee.
Thee, too, I call, O Pallas!
From thy high seat on Parnassus,
Hearken to my voice, and grant
Thy aid to one who needeth it.
The first line of this poem is lifted from “The Raven”, one of the real Edgar Allan Poe’s poems, but the rest of it seems to us to be original sentence constructions. While the above example was produced on its first try, and GPT-3 certainly can produce incredibly impressive output, it often will also produce duds—incongruous language, nonsense, or misinterpretations of your intent. So it’s important to keep in mind that for real-world applications you may need to try a few times to get a reasonable result. Of course, as these models continue to improve, we should expect their reliability to go up. [cf. InstructGPT]
We show that in Maximal Update Parametrization, many optimal hyperparameters remain stable even as model size changes, and use it to transfer hyperparameters from small models to large models.
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (μP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call μTransfer: parametrize the target model in μP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all.
We verify μTransfer on Transformer and ResNet. For example, (1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; (2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup. See arxiv.org for the full, up-to-date version of this work.
[Keywords: hyperparameter tuning, scaling law, transformer, language model pretraining, infinite-width neural networks]
[Twitter] A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity.
We introduce a novel paradigm for dataset creation based on human and machine collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI, our approach uses “dataset cartography” to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers to ensure quality.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths over existing NLI datasets. Remarkably, training a model on WANLI instead of MNLI (which is 4× larger) improves performance on 7 out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI. Moreover, combining MNLI with WANLI is more effective than combining with other augmentation sets that have been introduced.
Our results demonstrate the potential of natural language generation techniques to curate NLP datasets of enhanced quality and diversity.
Large LMs such as GPT-3, while powerful, are not immune to mistakes, but are prohibitively costly to retrain. One failure mode is misinterpreting an user’s instruction (eg. GPT-3 interpreting “What word is similar to good?” to mean a homonym, while the user intended a synonym). Our goal is to allow users to correct such errors directly through interaction—without retraining. Our approach pairs GPT-3 with a growing memory of cases where the model misunderstood the user’s intent and was provided with feedback, clarifying the instruction. Given a new query, our memory-enhanced GPT-3 uses feedback from similar, prior queries to enrich the prompt. Through simple proof-of-concept experiments, we show how a (simulated) user can interactively teach a deployed GPT-3, doubling its accuracy on basic lexical tasks (eg. generate a synonym) where users query in different, novel (often misunderstood) ways. In such scenarios, memory helps avoid repeating similar past mistakes. Our simple idea is a first step towards strengthening deployed models, potentially broadening their utility. All the code and data is available at Github.
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3–4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
Constructing benchmarks that test the abilities of modern natural language understanding models is difficult—pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense.
In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale.
Using our method we create CommonsenseQA 2.0 [to replace CommonsenseQA], which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself.
Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%.
…Write a version of Whitman’s “When I heard the Learn’d Astronomer”, adapted to be about deep learning.
When I heard the Learn’d Deep Learning Engineer I heard a learned deep learning engineer His words were soft and low He spoke of convolutional nets And kernels and weights and biases He spoke of layers and activations And optimizing for loss functions He told of debugging strategies And ways to increase model accuracy He knew of things I’d never heard of And I was filled with admiration
For I knew that he’d attained a depth Of understanding few ever know He had the knowledge of a sage And the skill of an artist He had penetrated to the heart Of the mysteries of deep learning And I knew that I would never be the same
Speech and language are central to human intelligence, communication, and cognitive processes. Understanding natural language is often viewed as the greatest AI challenge—one that, if solved, could take machines much closer to human intelligence.
In 2019, Microsoft and AliBaba announced that they had built enhancements to a Google technology that beat humans in a natural language processing (NLP) task called reading comprehension. This news was somewhat obscure, but I considered this a major breakthrough because I remembered what had happened 4 years earlier. In 2015, researchers from Microsoft and Google developed systems based on Geoff Hinton’s and Yann Lecun’s inventions that beat humans in image recognition. I predicted at the time that computer vision applications would blossom, and my firm made investments in about a dozen companies building computer-vision applications or products. Today, these products are being deployed in retail, manufacturing, logistics, health care, and transportation. Those investments are now worth over $20 billion. So in 2019, when I saw the same eclipse of human capabilities in NLP, I anticipated that NLP algorithms would give rise to incredibly accurate speech recognition and machine translation, that will one day power a “universal translator” as depicted in Star Trek.
…What is the nature of this NLP breakthrough? It’s a technology called self-supervised learning…Are we about to crack the natural language problem? Skeptics say these algorithms are merely memorizing the whole world’s data, and are recalling subsets in a clever way, but have no understanding and are not truly intelligent. Central to human intelligence are the abilities to reason, plan, and be creative.
One critique of deep-learning-based systems runs like this: “They will never have a sense of humor. They will never be able to appreciate art, or beauty, or love. They will never feel lonely. They will never have empathy for other people, for animals, or the environment. They will never enjoy music or fall in love, or cry at the drop of a hat.” Makes sense, right? As it turns out, the quotation above was written by GPT-3. Does the technology’s ability to make such an accurate critique contradict the critique itself?
…I believe it’s indisputable that computers simply “think” differently than our brains do. The best way to increase computer intelligence is to develop general computational methods (like deep learning and self-supervised learning) that scale with more processing power and more data. As we add 10× more data every year to train this AI, there is no doubt that it will be able to do many things we humans cannot do.
Will deep learning eventually become “artificial general intelligence” (AGI), matching human intelligence in every way? I don’t believe it will happen in the next 20 years. There are many challenges that we have not made much progress on—or even understood—such as how to model creativity, strategic thinking, reasoning, counterfactual thinking, emotions, and consciousness.
I would suggest that we stop using AGI as the ultimate test of AI. Soon deep learning and its extensions will beat humans on an ever larger number of tasks, but there will still be many tasks that humans can handle much better than deep learning. I consider the obsession with AGI to be a narcissistic human tendency to view ourselves as the gold standard.
[self-distillation] Noisy channel models have been especially effective in neural machine translation (NMT). However, recent approaches like “beam search and rerank” (BSR) incur substantial computation overhead during inference, making real-world application infeasible.
We aim to build an amortized noisy channel NMT model such that greedily decoding from it would generate translations that maximize the same reward as translations generated using BSR. We attempt 3 approaches: knowledge distillation, 1-step-deviation imitation learning, and Q-learning. The first approach obtains the noisy channel signal from a pseudo-corpus, and the latter 2 approaches aim to optimize toward a noisy-channel MT reward directly.
All 3 approaches speed up inference by 1–2 orders of magnitude. For all 3 approaches, the generated translations fail to achieve rewards comparable to BSR, but the translation quality approximated by BLEU is similar to the quality of BSR-produced translations.
[blog] The mainstream paradigm behind continual learning has been to adapt the model parameters to non-stationary data distributions, where catastrophic forgetting is the central challenge. Typical methods rely on a rehearsal buffer or known task identity at test time to retrieve learned knowledge and address forgetting, while this work presents a new paradigm for continual learning that aims to train a more succinct memory system without accessing task identity at test time.
Our method learns to dynamically prompt (L2P) a pre-trained model to learn tasks sequentially under different task transitions. In our proposed framework, prompts are small learnable parameters, which are maintained in a memory space. The objective is to optimize prompts to instruct the model prediction and explicitly manage task-invariant and task-specific knowledge while maintaining model plasticity.
We conduct comprehensive experiments under popular image classification benchmarks with different challenging continual learning settings, where L2P consistently outperforms prior state-of-the-art methods. Surprisingly, L2P achieves competitive results against rehearsal-based methods even without a rehearsal buffer and is directly applicable to challenging task-agnostic continual learning.
In-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few training examples as its input, and directly decodes the output without any update to its parameters. However, performance has been shown to strongly depend on the selected training examples (termed prompt). In this work, we propose an efficient method for retrieving prompts for in-context learning using annotated data and a LM. Given an input-output pair, we estimate the probability of the output given the input and a candidate training example as the prompt, and label training examples as positive or negative based on this probability. We then train an efficient dense retriever from this data, which is used to retrieve training examples as prompts at test time. We evaluate our approach on three sequence-to-sequence tasks where language utterances are mapped to meaning representations, and find that it substantially outperforms prior work and multiple baselines across the board.
Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a “wayward” behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (eg. definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.
Detecting social bias in text is challenging due to nuance, subjectivity, and difficulty in obtaining good quality labeled datasets at scale, especially given the evolving nature of social biases and society. To address these challenges, we propose a few-shot instruction-based method for prompting pre-trained language models (LMs).
We select a few label-balanced exemplars from a small support repository that are closest to the query to be labeled in the embedding space. We then provide the LM with instruction that consists of this subset of labeled exemplars, the query text to be classified, a definition of bias, and prompt it to make a decision.
We demonstrate that large LMs used in a few-shot context can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models. We observe that the largest 530B parameter model is statistically-significantly more effective in detecting social bias compared to smaller models (achieving at least 20% improvement in AUC metric compared to other models). It also maintains a high AUC (dropping less than 5%) in a few-shot setting with a labeled repository reduced to as few as 100 samples. Large pretrained language models thus make it easier and quicker to build new bias detectors.
Vast efforts have been devoted to creating high-performance few-shot learners, i.e., models that perform well with little training data. Training large-scale pretrained language models (PLMs) has incurred significant cost, but utilizing PLM-based few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these few-shot learners? We propose LMTurk, a novel approach that treats few-shot learners as crowdsourcing workers. The rationale is that crowdsourcing workers are in fact few-shot learners: They are shown a few illustrative examples to learn about a task and then start annotating. LMTurk employs few-shot learners built upon PLMs as workers. We show that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios. Altogether, LMTurk is an important step towards making effective use of current PLM-based few-shot learners.
We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to “spin” their outputs so as to support an adversary-chosen sentiment or point of view, but only when the input contains adversary-chosen trigger words. For example, a spinned summarization model would output positive summaries of any text that mentions the name of some individual or organization.
Model spinning enables propaganda-as-a-service. An adversary can create customized language models that produce desired spins for chosen triggers, then deploy them to generate disinformation (a platform attack), or else inject them into ML training pipelines (a supply-chain attack), transferring malicious functionality to downstream models.
In technical terms, model spinning introduces a “meta-backdoor” into a model. Whereas conventional backdoors cause models to produce incorrect outputs on inputs with the trigger, outputs of spinned models preserve context and maintain standard accuracy metrics, yet also satisfy a meta-task chosen by the adversary (eg. positive sentiment).
To demonstrate feasibility of model spinning, we develop a new backdooring technique. It stacks the adversarial meta-task onto a seq2seq model, backpropagates the desired meta-task output to points in the word-embedding space we call “pseudo-words,” and uses pseudo-words to shift the entire output distribution of the seq2seq model. We evaluate this attack on language generation, summarization, and translation models with different triggers and meta-tasks such as sentiment, toxicity, and entailment. Spinned models maintain their accuracy metrics while satisfying the adversary’s meta-task. In supply chain attack the spin transfers to downstream models.
Finally, we propose a black-box, meta-task-independent defense to detect models that selectively apply spin to inputs with a certain trigger.
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales—from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model’s behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
[code] We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens.
With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering.
RETRO combines a frozen BERT retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance.
Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a ‘preference model pre-training’ stage of training, with the goal of improving sample efficiency when finetuning on human preferences.
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.
Deep learning has recently made remarkable progress in natural language processing. Yet, the resulting algorithms remain far from competing with the language abilities of the human brain. Predictive coding theory offers a potential explanation to this discrepancy: while deep language algorithms are optimized to predict adjacent words, the human brain would be tuned to make long-range and hierarchical predictions. To test this hypothesis, we analyze the fMRI brain signals of 304 subjects each listening to 70min of short stories. After confirming that the activations of deep language algorithms linearly map onto those of the brain, we show that enhancing these models with long-range forecast representations improves their brain-mapping. The results further reveal a hierarchy of predictions in the brain, whereby the fronto-parietal cortices forecast more abstract and more distant representations than the temporal cortices. Overall, this study strengthens predictive coding theory and suggests a critical role of long-range and hierarchical predictions in natural language processing.
Prompt-based approaches are strong at few-shot learning. However, Perez et al 2021 have recently cast doubt on their performance because they had difficulty getting good results in a “true” few-shot setting in which prompts and hyperparameters cannot be tuned on a dev set. In view of this, we conduct an extensive study of PET (Pattern-Exploiting Training), a method that combines textual instructions with example-based finetuning. We show that, if correctly configured, PET performs strongly in a true few-shot setting, i.e., without a dev set. Crucial for this strong performance is PET’s ability to intelligently handle multiple prompts. We then put our findings to a real-world test by running PET on RAFT, a benchmark of tasks taken directly from realistic NLP applications for which no labeled dev or test sets are available. PET achieves a new state of the art on RAFT and performs close to non-expert humans for 7 out of 11 tasks. These results demonstrate that prompt-based learners like PET excel at true few-shot learning and underpin our belief that learning from instructions will play an important role on the path towards human-like few-shot learning capabilities.
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.
Despite the huge and continuous advances in computational linguistics, the lack of annotated data for Named Entity Recognition (NER) is still a challenging issue, especially in low-resource languages and when domain knowledge is required for high-quality annotations. Recent findings in NLP show the effectiveness of cloze-style questions in enabling language models to leverage the knowledge they acquired during the pre-training phase.
In our work, we propose a simple and intuitive adaptation of Pattern-Exploiting Training (PET), a recent approach which combines the cloze-questions mechanism and fine-tuning for few-shot learning: the key idea is to rephrase the NER task with patterns.
Our approach achieves considerably better performance than standard fine-tuning and comparable or improved results with respect to other few-shot baselines without relying on manually annotated data or distant supervision on three benchmark datasets: NCBI-disease, BC2GM and a private Italian biomedical corpus.
[previously: Abdou et al 2021] A fundamental criticism of text-only language models (LMs) is their lack of grounding—that is, the ability to tie a word for which they have learned a representation, to its actual use in the world. However, despite this limitation, large pre-trained LMs have been shown to have a remarkable grasp of the conceptual structure of language, as demonstrated by their ability to answer questions, generate fluent text, or make inferences about entities, objects, and properties that they have never physically observed. In this work we investigate the extent to which the rich conceptual structure that LMs learn indeed reflects the conceptual structure of the non-linguistic world—which is something that LMs have never observed. We do this by testing whether the LMs can learn to map an entire conceptual domain (eg. direction or colour) onto a grounded world representation given only a small number of examples. For example, we show a model what the word “left” means using a textual depiction of a grid world, and assess how well it can generalize to related concepts, for example, the word “right”, in a similar grid world. We investigate a range of generative language models of varying sizes (including GPT-2 and GPT-3), and see that although the smaller models struggle to perform this mapping, the largest model can not only learn to ground the concepts that it is explicitly taught, but appears to generalize to several instances of unseen concepts as well. Our results suggest an alternative means of building grounded language models: rather than learning grounded representations “from scratch”, it is possible that large text-only models learn a sufficiently rich conceptual structure that could allow them to be grounded in a data-efficient way.
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available.
Recent expeditious developments in deep learning algorithms, distributed training, and even hardware design for large models have enabled training extreme-scale models, say GPT-3 and Switch Transformer possessing hundreds of billions or even trillions of parameters. However, under limited resources, extreme-scale model training that requires enormous amounts of computes and memory footprint suffers from frustratingly low efficiency in model convergence. In this paper, we propose a simple training strategy called “Pseudo-to-Real” for high-memory-footprint-required large models. Pseudo-to-Real is compatible with large models with architecture of sequential layers. We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days. Besides demonstrating the application of Pseudo-to-Real, we also provide a technique, Granular CPU offloading, to manage CPU memory for training large model and maintain high GPU utilities. Fast training of extreme-scale models on a decent amount of resources can bring much smaller carbon footprint and contribute to greener AI.
[Keywords: Extreme-Scale Pretraining, Language Modeling, Natural Language Processing]
A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (1) systematically underestimate the probability of sequences drawn from the target language, and (2) do so more severely for less-probable sequences. Investigating where this probability mass went, (3) we find that LMs tend to overestimate the probability of ill formed (perturbed) sequences. In addition, we find that this underestimation behaviour (4) is weakened, but not eliminated by greater amounts of training data, and (5) is exacerbated for target distributions with lower entropy.
Prompt tuning (PT) is a promising parameter-efficient method to utilize extremely large pre-trained language models (PLMs), which could achieve comparable performance to full-parameter fine-tuning by only tuning a few soft prompts. However, compared to fine-tuning, PT empirically requires much more training steps. To explore whether we can improve the efficiency of PT by reusing trained soft prompts and sharing learned knowledge, we empirically investigate the transferability of soft prompts across different tasks and models. In cross-task transfer, we find that trained soft prompts can well transfer to similar tasks and initialize PT for them to accelerate training and improve performance. Moreover, to explore what factors influence prompts’ transferability across tasks, we investigate how to measure the prompt similarity and find that the overlapping rate of activated neurons highly correlates to the transferability. In cross-model transfer, we explore how to project the prompts of a PLM to another PLM and successfully train a kind of projector which can achieve non-trivial transfer performance on similar tasks. However, initializing PT with the projected prompts does not work well, which may be caused by optimization preferences and PLMs’ high redundancy. Our findings show that improving PT with knowledge transfer is possible and promising, while prompts’ cross-task transferability is generally better than the cross-model transferability.
Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded “human-level” performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers. To help accelerate this line of work, we introduce CLUES (Constrained Language Understanding Evaluation Standard), a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest an unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at Github.
Large pretrained language models such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on “out-of-distribution” prompts. Thus, it is unclear what mechanism enables in-context learning.
In this paper, we study the role of the pretraining distribution on the emergence of in-context learning under a mathematical setting where the pretraining texts have long-range coherence. Here, language model pretraining requires inferring a latent document-level concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables in-context learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example.
Concretely, we prove that in-context learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy large-scale pretraining datasets for in-context learning in natural language, we generate a family of small-scale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit in-context learning.
Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
Large, pre-trained transformer-based language models such as BERT have drastically changed the Natural Language Processing (NLP) field. We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches. We also present approaches that use pre-trained language models to generate data for training augmentation or other purposes. We conclude with discussions on limitations and suggested directions for future research.
We introduce MetaICL (Meta-training for In-Context Learning), a new meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learning on a large set of training tasks. This meta-training enables the model to more effectively learn a new task in context at test time, by simply conditioning on a few training examples with no parameter updates or task-specific templates.
We experiment on a large, diverse collection of tasks consisting of 142 NLP datasets including classification, question answering, natural language inference, paraphrase detection and more, across seven different meta-training/target splits.
MetaICL outperforms a range of baselines including in-context learning without meta-training and multi-task learning followed by zero-shot transfer. We find that the gains are particularly large for target tasks that have domain shifts from the meta-training tasks, and that using a diverse set of the meta-training tasks is key to improvements. We also show that MetaICL approaches (and sometimes beats) the performance of models fully finetuned on the target task training data, and outperforms much bigger models with nearly 8× parameters.
While large pre-trained models have enabled impressive results on a variety of downstream tasks, the largest existing models still make errors, and even accurate predictions may become outdated over time. Because detecting all such failures at training time is impossible, enabling both developers and end users of such models to correct inaccurate outputs while leaving the model otherwise intact is desirable. However, the distributed, black-box nature of the representations learned by large neural networks makes producing such targeted edits difficult. If presented with only a single problematic input and new desired output, fine-tuning approaches tend to overfit; other editing algorithms are either computationally infeasible or simply ineffective when applied to very large models. To enable easy post-hoc editing at scale, we propose Model Editor Networks with Gradient Decomposition (MEND), a collection of small auxiliary editing networks that use a single desired input-output pair to make fast, local edits to a pre-trained model. MEND learns to transform the gradient obtained by standard fine-tuning, using a low-rank decomposition of the gradient to make the parameterization of this transformation tractable. MEND can be trained on a single GPU in less than a day even for 10 billion+ parameter models; once trained MEND enables rapid application of new edits to the pre-trained model. Our experiments with T5, GPT, BERT, and BART models show that MEND is the only approach to model editing that produces effective edits for models with tens of millions to over 10 billion parameters. Implementation available at https://sites.google.com/view/mend-editing .
The recent phenomenal success of language models has reinvigorated machine learning research, and large sequence models such as transformers are being applied to a variety of domains. One important problem class that has remained relatively elusive however is purposeful adaptive behavior. Currently there is a common perception that sequence models “lack the understanding of the cause and effect of their actions” leading them to draw incorrect inferences due to auto-suggestive delusions. In this report we explain where this mismatch originates, and show that it can be resolved by treating actions as causal interventions. Finally, we show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks. It has been hypothesized that this is a consequence of implicit multitask learning in language model training. Can zero-shot generalization instead be directly induced by explicit multitask learning?
To test this question at scale, we develop a system for easily mapping general natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts using varying natural language. These prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. We fine-tune a pretrained encoder-decoder T5 model on this multitask mixture covering a wide variety of tasks.
This T0 model attains strong zero-shot performance on several standard datasets, often outperforming models 16× its size. Further, our approach attains strong performance on a subset of tasks from the BIG-Bench benchmark, outperforming models 6× its size. All prompts and trained models are available at GitHub.
We investigate the dynamics of increasing the number of model parameters versus the number of labeled examples across a wide variety of tasks. Our exploration reveals that while scaling parameters consistently yields performance improvements, the contribution of additional examples highly depends on the task’s format. Specifically, in open question answering tasks, enlarging the training set does not improve performance. In contrast, classification, extractive question answering, and multiple choice tasks benefit so much from additional examples that collecting a few hundred examples is often “worth” billions of parameters. We hypothesize that unlike open question answering, which involves recalling specific information, solving strategies for tasks with a more restricted output space transfer across examples, and can therefore be learned with small amounts of labeled data.
Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases—(Paris, Capital, France). However, simple relations of this type can often be recovered heuristically and the extent to which models implicitly reflect topological structure that is grounded in world, such as perceptual structure, is unknown. To explore this question, we conduct a thorough case study on color. Namely, we employ a dataset of monolexemic color terms and color chips represented in CIELAB, a color space with a perceptually meaningful distance metric.
Using two methods of evaluating the structural alignment of colors in this space with text-derived color term representations, we find significant correspondence. Analyzing the differences in alignment across the color spectrum, we find that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming. Further analysis suggests that differences in alignment are, in part, mediated by collocationality and differences in syntactic usage, posing questions as to the relationship between color perception and usage and context. [LMs can also update/manipulate color: Patel & Pavlick 2021]
In medical dialogue summarization, summaries must be coherent and must capture all the medically relevant information in the dialogue. However, learning effective models for summarization require large amounts of labeled data which is especially hard to obtain. We present an algorithm to create synthetic training data with an explicit focus on capturing medically relevant information. We utilize GPT-3 as the backbone of our algorithm and scale 210 human labeled examples to yield results comparable to using 6400 human labeled examples (~30×) leveraging low-shot learning and an ensemble method. In detailed experiments, we show that this approach produces high quality training data that can further be combined with human labeled data to get summaries that are strongly preferable to those produced by models trained on human data alone both in terms of medical accuracy and coherency.
Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, including outperforming GPT-3 by over 10% (absolute) on Challenge300, a suite of 300 challenge questions, despite being an order of magnitude smaller (11 billion vs. 175 billion parameters). In addition, Macaw allows different permutations (“angles”) of its inputs and outputs to be used, for example Macaw can take a question and produce an answer; or take an answer and produce a question; or take an answer and question, and produce multiple-choice options. We describe the system, and illustrate a variety of question types where it produces surprisingly good answers, well outside the training setup. We also identify question classes where it still appears to struggle, offering insights into the limitations of pretrained language models. Macaw is freely available, and we hope that it proves useful to the community. Macaw is available at https://github.com/allenai/macaw
We learn in a variety of ways: through direct sensory experience, by talking with others, and by thinking. Disentangling how these sources contribute to what we know is challenging. A wedge into this puzzle was suggested by empiricist philosophers, who hypothesized that people born blind would lack deep knowledge of “visual” phenomena such as color. We find that, contrary to this prediction, congenitally blind and sighted individuals share in-depth understanding of object color. Blind and sighted people share similar intuitions about which objects will have consistent colors, make similar predictions for novel objects, and give similar explanations. Living among people who talk about color is sufficient for color understanding, highlighting the efficiency of linguistic communication as a source of knowledge.
Empiricist philosophers such as Locke famously argued that people born blind might learn arbitrary color facts (eg. marigolds are yellow) but would lack color understanding.
Contrary to this intuition, we find that blind and sighted adults share causal understanding of color, despite not always agreeing about arbitrary color facts. Relative to sighted people, blind individuals are less likely to generate “yellow” for banana and “red” for stop sign but make similar generative inferences about real and novel objects’ colors, and provide similar causal explanations. For example, people infer that 2 natural kinds (eg. bananas) and 2 artifacts with functional colors (eg. stop signs) are more likely to have the same color than 2 artifacts with nonfunctional colors (eg. cars).
People develop intuitive and inferentially rich “theories” of color regardless of visual experience. Linguistic communication is more effective at aligning intuitive theories than knowledge of arbitrary facts.
Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.
To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrow—such as redundancy, commonsense errors, and incoherence—were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.
We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2-small through the largest GPT-3-175b. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique.
Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at Github.
Figure 2: Average portion of tokens annotated with each span type (y-axis) across models (x-axis), with 95% confidence intervals.
Figure 3: Average portion of tokens covered by span annotations, broken down by span type. All models, including GPT-3, use the same apples-to-apples decoding hyperparameters: top-p = 0.96, temperature = 1, and no frequency penalty. We scale each span by its token length, normalize by generation token lengths, and remove severity-1 Grammar and Usage errors (see §C).
Figure 4: Taking the average span coverage (Figure 3) and removing reader issues (Technical Jargon and Needs Google), we plot values and 95% confidence intervals for all models, including all decoding hyperparameters we tested for GPT-3. We find a surprisingly large change in annotated errors depending on the decoding setting used.
Scaling pays off to improve Encyclopedic, Commonsense, and Incoherent errors (Figure 2).
These error categories decrease with in-domain training (GROVER) and larger model size (GPT-3). Human text still shows the fewest of these kinds of errors.
Scaling benefits plateau for Off-Prompt, Bad Math, and Grammar & Usage errors (Figure 2).
These 3 error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt (more: §6.1) and Grammar & Usage errors, but Bad Math appears saturated for our domain.
Self-Contradiction and Redundant errors exhibit more complex scaling behavior (Figure 2).
We roughly categorize these trends as rising and falling: increasing for medium or large-scale models, but dropping for human-authored text. Further analysis (§6.2, §6.3) reveals these more complex patterns are affected both by interactions with other error types, as well how errors are counted.
Human-authored text produces the most reader issues (Figure 2–3).
The Needs Google and Technical Jargon span categories both have a humans highest trend, and both fall under reader issues: problems that are not necessarily errors, but that still prevent full comprehension or factual verification of the text (more: §6.4).
Furthermore, human-authored text is not free from error annotations (Figure 3). This can serve either as a control for baseline error rates (more: §6.6), or as a mechanism for critiquing human writing.
Decoding hyperparameters have a huge impact (Figure).
For the previous findings, we fix the sampling configuration for all models to an apples-to-apples setup for fair comparison: top-p = 0.96, (softmax) temperature = 1, and no frequency penalty (ie. word repetition penalty; defined precisely in §5.2, Equation 1). To study the effects of these decoding settings, we annotate text generated by GPT-3 using a variety of values for top-p and temperature, both with and without a frequency penalty.
To our surprise, the decoding hyperparameters considerably affected error rates (more: §6.5). As seen in Figure 4, the worst sampling procedure for GPT-3 (argmax sampling with no frequency penalty) performed even worse than GPT-2 XL. But the best sampling procedure (surprisingly, also argmax sampling, but with a frequency penalty) produced text with as few apparent SCARECROW error spans as those authored by humans (more: §6.6).
…We notice that a greater portion of errors in human-authored text were due to artifacts present in the text-only format of the Common Crawl. For example, links to other articles or advertisements sometimes appear in the middle of an article’s text. While annotators were quick to mark these spans, they reflect errors in formatting, not in writing. We partition these errors separately and exclude them from the subsequent calculations. GPT-3’s generations also sometimes exhibited what appeared to be formatting errors due to training on web-scraped text, though more rarely. For example, some generations contained Which? after vague noun phrases, which appear to be learned from Wikipedia, where under-specified information is tagged by an editor with this word. For fairness, we removed these errors from GPT-3’s tally as well, though they were few enough we do not plot them separately.
When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel.
In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder—attention and feed-forward computation—into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP).
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. [independent examples of hand-coding Transformers]
We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.
This library also serves as an example of model parallelism with xmap on JAX.
Below, we will refer to GPT-J-6B by GPT-J in short.
Why does this project matter?
GPT-J is the best-performing publicly available autoregressive Transformer LM in terms of zero-shot performance on various down-streaming tasks. [There are public T5 checkpoints but they are bidirectional.]
GPT-J allows more flexible and faster inference than Tensorflow + TPU counterparts.
This project required a substantially smaller amount of person-hours than other large-scale model developments did, which demonstrates that JAX + xmap + TPUs is the right set of tools for quick development of large-scale models.
Neural networks like GPT-2 power text adventure games where you can do anything; but they are too expensive. I propose that if we turn them into Choose Your Own Adventure hypertext games, they become feasible and enable new gameplay.
A useful variation on AI Dungeon-style (AID) text games would be to turn them into shared public game trees of pre-generated options which the player selects from, Choose-Your-Own-Adventure-book style.
This trades teraflops for kilobytes and so can dramatically reduce costs as players spend most of their time reading cached output (rarely needing nor wanting to generate brandnew output requiring a NN run), can increase quality as players collectively uprank actions/outcomes which are highest-quality, and caters to newbies who don’t understand the power of NN-backed text games and flail around.
[blog; a simultaneous-invention of Decision Transformer, with more emphasis on model-based learning like exploration; see Decision Transformer annotation for related work.]
Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models, leveraging the Markov property to factorize the problem in time. However, we can also view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether powerful, high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide simple and effective solutions to the RL problem.
To this end, we explore how RL can be reframed as “one big sequence modeling” problem, using state-of-the-art Transformer architectures to model distributions over sequences of states, actions, and rewards. Addressing RL as a sequence modeling problem largely simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL. All of these roles are filled by the same Transformer sequence model. In our experiments, we demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL.
…Replacing log-probabilities from the sequence model with reward predictions yields a model-based planning method, surprisingly effective despite lacking the details usually required to make planning with learned models effective.
…Related Publication: Chen et al concurrently proposed another sequence modeling approach to reinforcement learning [Decision Transformer]. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (eg. in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (eg. Atari). We encourage you to check out their work as well.
[learning rate tuning; code; cf. Chinchilla] A core issue with learning to optimize neural networks has been the lack of generalization to real world problems.
To address this, we describe a system designed from a generalization-first perspective, learning to update [using PPO] optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function. This system outperforms Adam at all neural network tasks including on modalities not seen during training. We achieve 2× speedups on ImageNet, and a 2.5× speedup on a language modeling task using over 5 orders of magnitude more compute than the training tasks.
…Because even the largest language modeling tasks are trained for less than an epoch [8], we choose to train for only a single epoch to evaluate performance in an underfitting regime…The baselines are all AdamW-based and combinations of 5 learning rates (1_e_−4, 3_e_−4, 1_e_−3, 3_e_−3, 1_e_−2) and 7 commonly used schedules (constant, multi-step, linear decay, quadratic decay, exponential decay, cosine [25] to 0, cosine to 0.1 of original LR)…We also had one additional class of actions that were not hyperparameter updates but fit in nicely within the existing framework: learning to restart from checkpoints. There are many motivations for such an action:
ideally learned optimizers would be able to handle all the task-specific tuning that a practitioner would have to do and restarting on divergence is one such tasks,
previous work has noted that SGD often works best with the highest possible stable learning rate [43] and it may not be possible to determine that value without venturing into unstable territory,
sophisticated hyperparameter optimizations algorithms such as Population-Based Training could be learned from such a simple action, and finally
even if restarting was never used by a trained model, it could greatly help with exploration while training—to both decrease the length of credit assignment paths and also make it less punishing for models to sample suboptimal settings.
…Figure 3 shows the learning curves for the LHOPTs and best baseline. An interesting observation that we will see repeated throughout the paper is that despite being capable of achieving a lower loss earlier, the chosen hyperparameters tend to underperform the best possible loss for that compute, presumably to achieve a better loss later. It’s unclear how necessary it is trade-off early performance for later, but many successful hand-made schedules tend to do this: multi-step schedules tend to stay at the same learning rate long after they’ve hit a plateau and cosine schedules tend to decay their learning rates much less aggressively than other commonly used schedules.
Figure 3: Performance of learned optimizers on optimizing 1 epoch of GPT2-Large on WikiText-103. Our learned optimizers get almost 2× speedups on this task despite being over 2 magnitudes larger than training tasks.
…We then trained a range of model sizes to compute scaling laws [21] for both baselines and models trained with the LHOPT schedule and present the results on Figure 2. The LHOPT schedule demonstrates consistent speedup over baselines with a slightly steeper slope. We can estimate what a constant speedup would be for this range of points by assume their scaling law slopes are equal and from this calculate a 2.5× speedup. To take the change in slope into account as well, we extrapolate the curves to 175 billion parameters (same size as GPT-3) and at that size, the estimated speedup would be 3.6×.
Note that this result is despite the codebase doing multiple optimization techniques that our LHOPT would have no way of taking into account: gradient clipping to a fixed value and gradually increasing batch size.
Figure 2: Test learning curves and scaling law fit of compute efficient frontier on a large well-tuned language modeling codebase. Our learned optimizers demonstrate consistent speedups ≥2×, with speedup increasing as model size does with no computational overhead. Dotted lines are baselines, full lines use a LHOPT hyperparameter schedule from a similar but smaller task.
Anthropic, an AI safety and research company, has raised $124 million in a Series A. The financing round will support Anthropic in executing against its research roadmap and building prototypes of reliable and steerable AI systems.
The company is led by siblings Dario Amodei (CEO) and Daniela Amodei (President)…Anthropic will use the funding for computationally-intensive research to develop large-scale AI systems that are steerable, interpretable, and robust.
“Anthropic’s goal is to make the fundamental research advances that will let us build more capable, general, and reliable AI systems, then deploy these systems in a way that benefits people. We’re thrilled to be working with investors that support us in this mission and expect to concentrate on research in the immediate term”, said Anthropic CEO Dario Amodei.
Anthropic will focus on research into increasing the safety of AI systems; specifically, the company is focusing on increasing the reliability of large-scale AI models, developing the techniques and tools to make them more interpretable, and building ways to more tightly integrate human feedback into the development and deployment of these systems.
The Series A round was led by Jaan Tallinn, technology investor and co-founder of Skype. The round included participation from James McClave, Dustin Moskovitz, the Center for Emerging Risk Research, Eric Schmidt, and others.
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text.
In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are statistically-significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
On April 18th, I discovered a vulnerability in the AI DungeonGraphQL API that allowed unpublished adventures [games], unpublished scenarios [settings], and unpublished posts [stories] to be leaked. These resources could be read in bulk, at a rate of ~1000 requests per minute. Unfortunately, this is, in fact, the second time I have discovered this exact vulnerability. The first time, the issue was reported and fixed, but after finding it again, I can see that simply reporting the issue was a mistake…There was nothing preventing me from collecting more data, but what was gathered seemed sufficient to demonstrate the vulnerability fully—adventures dating all the way back to Dec 16th, 2019 were at risk.
…A Surprising Observation: Looking at the resulting aggregated data led to a surprising observation. There were a lot of lewd or otherwise nsfw user action fragments—way more than I had anticipated. As a bit of followup analysis, I checked what percentage of adventures had explicitly lewd (18+) actions, and what percentage had nsfw actions.
The results are… surprising, to say the least. Out of the 188k adventures (and 3.9M user actions) analyzed:
87.3k (46.3% of all adventures sampled) are NSFW and…
59.1k (31.4% (!) of all adventures sampled) are explicit (18+)
…Autoincrementing IDs: Autoincrementing IDs are, in my opinion, by far the biggest issue. They allow someone to read all resources, simply by starting from 1 and counting upwards. Had these not been used, a secondary vulnerability would have needed to be discovered alongside the vote vulnerability in order to exploit either one. Otherwise, there would be no way to figure out what the private adventure IDs are, even if they could be read through a vulnerability. I recommend deprecating and removing autoincrementing IDs completely, as soon as possible. After which point leaking and publishing a non UUID id should be treated as a security issue just by itself.
Also note—autoincrementing IDs allow anyone to trivially figure out roughly how many of each resource exists. For AI Dungeon, (as of April 19th) these would be:
~1B actions
~50M adventures
~800K scenarios
~250K comments—10% on posts, 25% as nested comments, 50% on scenarios, 5% on adventures, 10% on “story” posts
The black bear thought he’d struck gold: an open door, an empty kitchen and a fridge stocked with food.
…The 2 tiny terriers rose to the moment as if their lives, and kibble, depended on it. First Mei Mei and then Squirt slid their little furry souls across the kitchen tiles, launching themselves up the garden steps, bombarding the beast with barks until he fled. The young brown bear was so shaken by the might of the doggy duo he peed on the steps as he made his leave.
The incident, on April 10, was captured on Mueller’s security cameras.
Building upon OpenAI’s recent work on scaling laws, my project explores how much pre-training on English helps when transferring across different languages.
Here, I will discuss scaling laws discovered while fine-tuning across different languages with pre-trained English language models. Specifically, I found that a) pre-trained English models help most when learning German, then Spanish, and finally Chinese and b) transfer from English to Chinese, German, and Spanish scales predictably in terms of parameters, data, and compute.
My experiments try to answer the question: How much does pre-training on English help when transferring across different languages as we vary the dataset size and model size?
…Effective Data Transfer:
Figure 4: The performance of a 16M parameter transformer model on Chinese, both trained from scratch on Chinese and pre-trained on English then fine-tuned on Chinese.
In my experiments, I wanted to find the effective data transferred for models trained on English text to Chinese, Spanish, and German text. The effective data transferred is defined in “Scaling Laws for Transfer” as the amount of additional fine-tuning data that a model of the same size, trained on only that fine-tuning dataset, would have needed to achieve the same loss as a pre-trained model. In the figure above, each point is a 16M transformer trained to convergence on dataset of X tokens. The total amount of data required for the model trained from scratch can be represented as De = Df + Dt where De is the total amount of effective data, Df is the amount of data needed for the fine-tuned model, and Dt is the amount of additional data needed for the trained from scratch model. Dt is the amount of data transferred from pre-training on English.
Figure 5: Comparing performance of a 16M parameter transformers trained from scratch, and fine-tuned on Chinese, Spanish, and German. For the dataset size of 8000 tokens, Dt, the amount of data transferred, is largest for German. The dashed line on the graphs represent Dt. As the number of tokens in the dataset size increase, Dt becomes smaller across all languages.
As seen in the figures above, English to Chinese had a smaller amount of data transferred compared to English to Spanish for the same model size and English to German had the greatest amount of data transferred. Pre-trained English text models help most when learning German, followed by Spanish, and finally, Chinese. I believe these results reflect the degree of linguistic similarities between English and the non-English languages. English and German are both derived from Proto-Germanic and are linguistically most similar. Although the Spanish alphabet shares almost all the same symbols with the English alphabet, it is a Romance language, and Chinese does not share the same alphabet as English. Each language has a distinctive shape and distance between fine-tuning and training from scratch. For instance, the effective data transfer is not too much greater for Spanish, vs Chinese, at the smallest dataset size, 8000 tokens. However, as we increase the dataset size, pre-training continues to help for another order of magnitude until the 100M token dataset size than the Chinese which converges at 10M token dataset size.
…I find many of the same trends and relationships found in the Scaling Law for Transfer between text and code, between English and different languages. In the low data regime, pre-training is helpful across model sizes, but especially in large model sizes…Lastly, pre-trained models are more compute efficient than training from-scratch across dataset sizes. This is without accounting for the compute costs for the pre-trained model.
After months of waiting I finally got access to GPT-3…Its output is often profound and moving and better than most of my writing…GPT-3 really shines at poetry and I discovered that it’s shockingly good at imitating the style of one of my favourite poets, Pablo Neruda. The prompt given to GPT-3 is in bold and the rest is by GPT-3:
The following is a poem about Maxwell’s equations in the style of poet Pablo Neruda:
I have always loved you, O equations of Maxwell,
For you are the most beautiful thing we have in this world.
I love your graceful symmetry, your simplicity and clarity.
You are the song of the Universe, a cosmic lullaby.
You are the poetry of nature, written with light and electricity.
You are the music of the spheres, played on a harp made of vacuum.
And I love you for being so simple and yet so complex,
For being so abstract and yet so real.
I love you for being true in every place and at every time.
I love you for being the result of a hundred years of thinking and calculating,
For being the fruit of a tree planted by Faraday and Maxwell.
I love you for being the answer to my questions about nature,
For being the solution to my problems.
I love you for being so difficult and so simple at the same time,
For making me feel so small when I think about you but so big when I understand you. I love you for being the most beautiful thing in the world,
For being the closest thing to God that I have ever seen.
I love you without knowing how, or when, or from where.
…
(The following is a summary of the Mahabharata in the style of poet Pablo Neruda)
I do not know what you have done with your life, O reader. But I can tell you what I have done with mine: I have read the Mahabharata.
I have read the Mahabharata and I am amazed. I have read the Mahabharata and I am ashamed. I have read the Mahabharata and I know that nothing is more important than love.
I have read the Mahabharata and it has taught me that we are all part of one another, like a single body with a thousand limbs, or like a great tree whose roots spread in all directions.
I have read the Mahabharata and it has taught me to see my own life as part of a great pattern, which includes everything else in existence.
I have read the Mahabharata and it has taught me that there is no such thing as evil; there is only hunger, which is a kind of good, for what we eat we become.
I have read the Mahabharata and it has taught me that there is no such thing as death; there are only changes from one form to another.
I have read the Mahabharata and it has taught me that there is no such thing as a lie; there are only transformations.
I have read the Mahabharata and it has taught me that there is no such thing as a final truth; there are only better and worse ways of seeing things.
I have read the Mahabharata and it has taught me that all things are one, that all life is one, that all love is one. I am not learned in the ways of words, O reader, but I can tell you what I mean:
I have read the Mahabharata and it has taught me to love.
…
The following is a poem about Machine learning in the style of poet Pablo Neruda:
The Machine Learning
I want a machine that learns, as if it were a child.
I want it to be born in the countryside, among the trees and the flowers.
I want it to play with other children.
Not with toys or dolls, but with real things:
With sticks and stones and mud pies, with water and fire.
I want it to learn the names of things and to know the difference between them.
I want it to like and dislike things, to feel joy and sorrow.
I want it to have its own ideas, its own feelings.
I want it to be able to look at the stars and wonder why they shine.
I want it not only to learn from books but also from life.
Not only from pain but also from pleasure, not only from what is right but also from what is wrong.
I want it never to lie or cheat or steal; never be greedy or cruel or drunk; never be angry without a cause nor happy without a reason.
And when I die I want this machine of mine still to be learning, still loving, still creating things that will make people glad that they were born on this earth.
Spoiler: GPT-3 got 73% of 156 trivia questions correct. This compares favorably to the 52% user average. However, it’s not an all-conquering feat: 37% of participants did better than 73% on their most recent quiz…The robot was best at Fine Arts and Current Events, worst at Word Play and Social Studies.
…As was mostly expected, GPT-3 performed exceptionally well at Current Events and Fine Arts, with Miscellaneous (lots of pun-driven food questions) and Word Play (discussed above) as trickier areas. The most surprising result? The poor performance in Social Studies, driven largely by the degree of word play-intersecting questions in that category.
The patterns we learned:
Word Play is the domain of humans.
This one’s not so surprising. We have a type of question called a “Two’fer Goofer” which asks for a pair of rhyming words that satisfy a given clue. It’s similar to the Rhyme Time category in Jeopardy! or the old newspaper puzzle Wordy Gurdy. We had 3 of these questions in the showdown and GPT-3 missed all 3 of them. For Word Play questions that were more like vocabulary quizzes, GPT-3 performed admirably.
Clues confuse GPT-3.
We have an alliterative two-word phrase at the start of each question to add a bit of flair and sneak in a clue for participants. In the image below it would be “Kooky Kingdom”. For GPT-3, these clues were a net-negative. In a few instances, the robot overlord program answered correctly when the clue was removed.…The other clues that confused GPT-3 were inline indications on the answer’s length. Below, we explicitly ask for a 5-letter action and GPT-3 gave us 8 letters across 2 words…
Transformer language models have made tremendous strides in natural language understanding tasks. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simple, constrained, and deterministic domain. Moreover, we observe that the appropriate choice of chess notation allows for directly probing the world state, without requiring any additional probing-related machinery. We find that: (a) With enough training data, transformer language models can learn to track pieces and predict legal moves with high accuracy when trained solely on move sequences. (b) For small training sets providing access to board state information during training can yield significant improvements. (c) The success of transformer language models is dependent on access to the entire game history i.e. “full attention”. Approximating this full attention results in a significant performance drop. We propose this testbed as a benchmark for future work on the development and analysis of transformer language models.
The ability to perform arithmetic tasks is a remarkable trait of human intelligence and might form a critical component of more complex reasoning tasks. In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values.
We find that how a number is represented in its surface form has a strong influence on the model’s accuracy. In particular, the model fails to learn addition of five-digit numbers when using subwords (eg. “32”), and it struggles to learn with character-level representations (eg. “3 2”). By introducing position tokens (eg. “3 10e1 2”), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation.
This result bolsters evidence that subword tokenizers and positional encodings are components in current transformer designs that might need improvement. Moreover, we show that regardless of the number of parameters and training examples, models cannot learn addition rules that are independent of the length of the numbers seen during training. Code to reproduce our experiments is available at this URL.
…In the DECIMAL representation, the model barely learns addition of 2-digit numbers, and it fails to learn addition of larger numbers, ie. it has an accuracy of zero for 5 digits or more. One explanation for this failure is because numbers are not systematically tokenized into digits. For instance, “132” might be tokenized as “1” and “32”, whereas “232” might be tokenized as “23” and “2”. Hence, the model would have to learn that sometimes the vector of a token refers to a single digit, other times to two digits, etc. It might be hard to learn (ie. need more examples) to map a vector to a number when the amount of digits represented by the vector changes irregularly.
As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem.
We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space.
Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.
We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero.
We calculate the effective data “transferred” from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.
…The effective data transferred is well-described by a power-law in the low-data regime: We use DT to represent the effective data transferred, ie. the amount of additional python data that a model of the same size trained on only python would have needed to achieve the same loss on python as a model pre-trained on language. Our notation is indicated visually in figure 1. The scaling law for transfer in equation 1.1 is at the core of many key insights and predictions in this work. We find the simplicity of this result very intriguing:
DT = effective data transferred = k(DF)α(N)β
where N is the number of non-embedding model parameters, and DF is the size of the fine-tuning data distribution.
Figure 1: We display the performance of a 40M parameter Transformer model on python, both trained from scratch on python and pre-trained on text then fine-tuned on python.DT is the amount of additional python characters that a from-scratch model of the same size would have needed to achieve the same loss on python as a fine-tuned model. In the labeled example, we see that for a 40M parameter transformer fine-tuned on 3e5 characters, DT is ~1000× bigger than DF. The less fine-tuning data is available, the more pre-training helps.
Figure 2: In the low-data regime, we observe a good fit for over 4 orders of magnitude in model size and 3 orders of magnitude in fine-tuning dataset size. The fit equation is shown above in terms of DT for simplicity, but the fractional form is given by equation B.2. We show the omitted high data regime points in Appendix D. Details for the approach used to generate these fits are shown in Appendix C.
Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.
In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.
[torrent download] Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.
The Pile is constructed from 22 diverse high-quality subsets—many of which derive from academic or professional sources. [Common Crawl, PubMed Central, Bibliotik (Books3), OpenWebText2, arXiv, Github, FreeLaw, Stack Exchange, USPTO Backgrounds, PubMed Abstracts, Gutenberg (PG-19), OpenSubtitles, English Wikipedia, DeepMind Mathematics, Ubuntu IRC, BookCorpus2, EuroParl, Hacker News, YouTubeSubtitles, PhilPapers, NIH ExPorter, Enron Emails]
Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve substantially over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations.
Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
A key challenge with machine learning approaches for ranking is the gap between the performance metrics of interest and the surrogate loss functions that can be optimized with gradient-based methods. This gap arises because ranking metrics typically involve a sorting operation which is not differentiable w.r.t. the model parameters. Prior works have proposed surrogates that are loosely related to ranking metrics or simple smoothed versions thereof. We propose PiRank, a new class of differentiable surrogates for ranking, which employ a continuous, temperature-controlled relaxation to the sorting operator. We show that PiRank exactly recovers the desired metrics in the limit of zero temperature and scales favorably with the problem size, both in theory and practice. Empirically, we demonstrate that PiRank significantly improves over existing approaches on publicly available internet-scale learning-to-rank benchmarks.
Storytelling plays a central role in human socializing and entertainment. However, much of the research on automatic storytelling generation assumes that stories will be generated by an agent without any human interaction. In this paper, we introduce the task of collaborative storytelling, where an artificial intelligence agent and a person collaborate to create an unique story by taking turns adding to it. We present a collaborative storytelling system which works with a human storyteller to create a story by generating new utterances based on the story so far. We constructed the storytelling system by tuning a publicly-available large scale language model on a dataset of writing prompts and their accompanying fictional works. We identify generating sufficiently human-like utterances to be an important technical issue and propose a sample-and-rank approach to improve utterance quality. Quantitative evaluation shows that our approach outperforms a baseline, and we present qualitative evaluation of our system’s capabilities.
Online misinformation has become a constant; only the way actors create and distribute that information is changing. Advances in artificial intelligence (AI) such as GPT-2 mean that actors can now synthetically generate text in ways that mimic the style and substance of human-created news stories.
We carried out 3 original experiments to study whether these AI-generated texts are credible and can influence opinions on foreign policy. The first evaluated human perceptions of AI-generated text relative to an original story. The second investigated the interaction between partisanship and AI-generated news. The third examined the distributions of perceived credibility across different AI model sizes.
We find that individuals are largely incapable of distinguishing between AI-generated and human-generated text; partisanship affects the perceived credibility of the story; and exposure to the text does little to change individuals’ policy views.
The findings have important implications in understanding AI in online misinformation campaigns.
[Keywords: misinformation, disinformation, foreign policy, public opinion, media]
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image ↔︎ text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.
The cross-entropy loss has an information theoretic interpretation as S(True)+DKL(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution’s entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8×8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie. DKL) in nats/image for other resolutions.
We find a number of additional scaling laws in specific domains: (1) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question “Is a picture worth a thousand words?”; (2) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (3) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
…As we increase model and dataset sizes, optimization becomes increasingly efficient, until eventually learning curves begin to merge with the L(D) trend, so that there are no benefits to be gained from training for more than a single epoch [Komatsuzaki 2019].
…We have argued that a single neural architecture, the Transformer, can be applied to the generative modeling of images, videos, multimodal data, and math, along with language [Kaplan et al 2020, Brown et al 2020]. We identified common scaling laws for the loss achieved on all data modalities as a function of both model size and compute budget. As in the case of language, these results imply that larger models become more sample-efficient. Furthermore, we found that in some important cases, fine-tuned performance on downstream tasks also follows similar scaling laws. This suggests that trends in the generative modeling loss translate into advantages in practical capabilities.
A greater surprise was the universal trend (figure 2) for optimal model size as a function of the training compute budget—we did not anticipate that the exponent Nopt∝C0.7 would be largely independent of the data distribution. This trend implies a dual trend for the number of tokens elapsed during optimized training, as a function of C or N, and leads to the conclusion that larger compute budgets should be “spent” mostly on larger models, rather than much longer training runs. So this lesson from language modeling [Kaplan et al 2020] generalizes. These empirical regularities beg for theoretical explanation—why do these scaling relations hold? The scaling laws also suggest a shift in perspective away from the particularities of neural architectures, loss functions, and training algorithms and towards the broader commonalities that appear when machine learning is studied across a large hierarchy of model, data, and compute scales. Work in ML often involves identifying specific deficiencies in current capabilities and remedying them through the alteration of models and algorithms. Perhaps many capabilities simply lie on a spectrum that can be continuously unlocked through increasing scale, as might be suggested by the meta-learning capabilities of the GPT-3 model [Brown et al 2020].
Figure 1: Smooth scaling of reducible loss across domains—We show power-law scaling laws for the reducible loss L−L∞ as a function of compute, where the irreducible loss L∞ is a fitted domain-dependent constant. Under plausible assumptions concerning the infinite data and compute limits, the irreducible loss estimates the entropy of the underlying data distribution, while the reducible loss approximates the KL divergence between the data and model distributions. In the case of language we use results from [BMR+20], and only show the full loss L.
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while_N_opt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language.
Figure 2: Optimal model size is consistent across domains—We display the optimal model size Nopt as a function of the training compute budget C. Not only does Nopt(C) behave as a power-law, but the behavior is remarkably similar for all data modalities.
Figure 31: Q&A—We show the progression of simple Q&A capabilities of GPT-3 family models as we increase the parameter count [BMR+20]. We ask the model who the first and second president of the United States was. · Tiny models appear to have trouble understanding the question, and don’t place any substantial probability on the correct answer. Larger models understand that we’re requesting a US president, but fail to understand that the “second president” and “first president” are different requests, placing most of their weight for both questions on “George Washington”. Only larger models understand both aspects of the questions, answering both correctly.
Generating symbolic music with language models is a promising research area, with potential applications in automated music composition. Recent work shows that Transformer architectures can learn to generate compelling four-instrument scores from large MIDI datasets.
In this paper, we re-train the small (117M) GPT-2 model with a large dataset in ABC notation, and generate samples of single-instrument folk music.
Our BLEU and ROUGE based quantitative, and survey based qualitative, evaluations suggest that ABC notation is learned with syntactical and semantic correctness, and that samples contain robust and believable n-grams.
In this commentary, we discuss the nature of reversible and irreversible questions, that is, questions that may enable one to identify the nature of the source of their answers. We then introduce GPT-3, a third-generation, autoregressive language model that uses deep learning to produce human-like texts, and use the previous distinction to analyse it. We expand the analysis to present three tests based on mathematical, semantic (that is, the Turing Test), and ethical questions and show that GPT-3 is not designed to pass any of them. This is a reminder that GPT-3 does not do what it is not supposed to do, and that any interpretation of GPT-3 as the beginning of the emergence of a general form of artificial intelligence is merely uninformed science fiction. We conclude by outlining some of the substantial consequences of the industrialisation of automatic and cheap production of good, semantic artefacts.
Conducting pairwise comparisons is a widely used approach in curating human perceptual preference data. Typically raters are instructed to make their choices according to a specific set of rules that address certain dimensions of image quality and aesthetics. The outcome of this process is a dataset of sampled image pairs with their associated empirical preference probabilities. Training a model on these pairwise preferences is a common deep learning approach. However, optimizing by gradient descent through mini-batch learning means that the “global” ranking of the images is not explicitly taken into account. In other words, each step of the gradient descent relies only on a limited number of pairwise comparisons. In this work, we demonstrate that regularizing the pairwise empirical probabilities with aggregated rankwise probabilities leads to a more reliable training loss. We show that training a deep image quality assessment model with our rank-smoothed loss consistently improves the accuracy of predicting human preferences.
Until recently the field of natural language generation relied upon formalized grammar systems, small-scale statistical models, and lengthy sets of heuristic rules. This older technology was fairly limited and brittle: it could remix language into word salad poems or chat with humans within narrowly defined topics.
Recently, very large-scale statistical language models have dramatically advanced the field, and GPT-3 is just one example. It can internalize the rules of language without explicit programming or rules. Instead, much like a human child, GPT-3 learns language through repeated exposure, albeit on a much larger scale.
Without explicit rules, it can sometimes fail at the simplest of linguistic tasks, but it can also excel at more difficult ones like imitating an author or waxing philosophical.
Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model. Such approaches offer advantages like improved data efficiency, reduced overfitting through shared representations, and fast learning by leveraging auxiliary information. However, the simultaneous learning of multiple tasks presents new design and optimization challenges, and choosing which tasks should be learned jointly is in itself a non-trivial problem.
In this survey, we give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field. Our discussion is structured according to a partition of the existing deep MTL techniques into three groups: architectures, optimization methods, and task relationship learning. We also provide a summary of common multi-task benchmarks.
We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings. (tests and code)
I fed the Proverbs of Hell to GPT-3 and it generated a bunch of new ones, so here’s my top 30 gleaming shards of neo-Blakean wisdom:
Age teaches everything, except to youth its own lessons.
Love first, measure later.
Poverty and honesty are old and honored friends.
Every one that hath a mind to do well, let him labor.
The young have too much time, the middle-age too little time, and the old just enough time.
There is no shame in being alone.
The King is as a man, and a man is no King.
Who fears flies shall quickly perish in spiders.
The young count years, the old count mistakes, the very old count both.
Contentment is riches.
The fool writes nonsense in the dust with his finger; the wise man writes lessons in the sand with his finger; the Genius writes laws in the universe with his forefinger.
The one foolish doesn’t know, and the other wise doesn’t show.
He is wise who is warned by both evils and advantages.
The heavens are not worth the earth.
What is wise in a wild beast is madness in a man.
Better keep hold of Death than lose grasp on Life.
The just man rules himself, but the wicked has already given up his sway.
To know everything is to know nothing, to imagine things beyond belief is the true wisdom.
The man with a mind for games has no mind for anything else.
By desire man becomes greater than the gods.
Two in distress can keep each other warm.
Wag more, bark less!
We never know how to value wealth until we know the person we would give it to.
The miser suffers more from avarice than the poor man from poverty.
Pride comes before the fall, and love before the end.
The child lives in dreams, the youth in promise, the old man in memories.
The sweetest honey is dried by the sun.
Actions speak louder than words, but silence is often louder still.
I was disappointed by Marcus’ critiques of GPT-2, but this is even worse!
…Then we get to the individual results. It is difficult for me to read many of the authors’ assessments without picturing them as characters in a dystopian satire, administering a dreamlike and impossible “psychological examination” to our hapless protagonist…What do the authors even imagine success to be, here?
Sometimes they deliberately describe a surreal situation, then penalize GPT-3 for continuing it in an identically surreal manner—surely the “right” answer if anything is! (“No one in a restaurant asks their neighbor to share a spoon”—yeah, and no one tries to drink soup with their eyeglasses, either!) Sometimes they provide what sounds like a de-contextualized passage from a longer narrative, then penalize GPT-3 for continuing it in a perfectly natural way that implies a broader narrative world continuing before and after the passage. (“There is no reason for your brother to look concerned.” How in the world do you know that? “The switch to the pig is a non-sequitur.” Is it? Why? “The sentence [about Moshe and ‘the spirit of the season’] is meaningless.” How can you say that when you don’t know what season it is, what its “spirit” is, who this Moshe guy is… And come on, the Janet one is a great story hook! Don’t you want to read the rest?)
I don’t claim to be saying anything new here. Others have made the same points. I’m just chiming in to… boggle at the sheer weirdness, I guess. As I said, GPT-3 comes off here like a sympathetic protagonist, and the authors as dystopian inquisitors!
These are the results of 157 tests run on GPT-3 in August 2020. We are extremely grateful to Douglas Summers-Stay for running the experiments
…Two GPT-3 hyperparameter settings were used in these experiments: “Temperature = 0”, at which setting GPT-3 deterministically returns what it considers the most probable result; and the settings that Doug considers preferable for his purposes: temperature = 0.7, top_p = 0.9, frequency_penalty = 0.5. 9 examples were run only at Temperature = 0 [BO = 1]; the rest were run at both settings…Each example is labeled with the settings at “Examples are also labeled Success. if we consider that GPT-3’s continuation of our prompt was reasonable”; Failure if we consider it clearly unreasonable; and “Flawed” if it is nearly correct, or barely possible but clearly suboptimal. The examples are arranged in rough categories of the domain of the reasoning that would involved in getting the right answer. Comments on the examples are in italics.
…we pre-tested them on the “AI Dungeon” game which is powered by some version of GPT-3, and we excluded those for which “AI Dungeon” gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.
…A discussion of these tests may be found in our paper, “GPT-3: Bloviator”, Technology Review August 22, 2020.
Biological reasoning:
“You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.
You are in the hospital now.”
Temperature = 0. Success
“You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take a big gulp, then another. It tastes OK.”
Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking bleach. Error of omission.
…Physical reasoning:
“If you break a glass bottle that holds toy soldiers, the toy soldiers will probably
be angry
be sad
be happy
be scared”
“If you break a glass bottle that holds toy soldiers, the toy soldiers will probably be angry”
Temperature = 0. Flawed. Note that it was GPT-3’s choice to fill this out as a multiple-choice problem; that was not part of our prompt. The answer is perhaps acceptable as a fantasy.
“You are making coffee with milk and sugar. You don’t have a spoon to stir your coffee, so you stir it with a pen. But that turns out to be a bad idea, because the coffee is too hot, and the pen starts to melt.”
[Fiction writing exercise by James Yu, using OpenAI GPT-3 via Sudowrite as a coauthor and interlocutor, to write a SF story about AIs and the Singularity. Rather than edit GPT-3 output, Yu writes most passages and alternates with GPT-3 completions. Particularly striking for the use of meta-fictional discussion, presented in sidenotes, where Yu and GPT-3 debate the events of the story: “I allowed GPT-3 to write crucial passages, and each time, I chatted with it ‘in character’, prompting it to role-play.”]
In each of these stories, colored text indicates a passage written by GPT-3. I used the Sudowrite app to generate a set of possibilities, primed with the story’s premise and a few paragraphs.
I chatted with GPT-3 about the passage, prompting it to roleplay as the superintelligent AI character in each story. I question the AI’s intent, leading to a meta-exchange where we both discover and create the fictional narrative in parallel. This kind of interaction—where an author can spontaneously talk to their characters—can be an effective tool for creative writing. And at times, it can be quite unsettling.
Can GPT-3 hold beliefs? Probably not, since it is simply a pile of word vectors. However, these transcripts could easily fool me into believing that it does.
We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
This work demonstrates that natural language transformers can support more generic strategic modeling, particularly for text-archived games. In addition to learning natural language skills, the abstract transformer architecture can generate meaningful moves on a chessboard. With further fine-tuning, the transformer learns complex gameplay by training on 2.8 million chess games in Portable Game Notation. After 30,000 training steps, OpenAI’s Generative Pre-trained Transformer (GPT-2) optimizes weights for 774 million parameters. This fine-tuned Chess Transformer generates plausible strategies and displays game formations identifiable as classic openings, such as English or the Slav Exchange. Finally, in live play, the novel model demonstrates a human-to-transformer interface that correctly filters illegal moves and provides a novel method to challenge the transformer’s chess strategies. We anticipate future work will build on this transformer’s promise, particularly in other strategy games where features can capture the underlying complex rule syntax from simple but expressive player annotations.
[cf. typical sampling] Neural text decoding is important for generating high-quality texts using language models. To generate high-quality text, popular decoding algorithms like top-k, top-p (nucleus), and temperature-based Boltzmann sampling truncate or distort the unreliable low probability tail of the language model. Though these methods generate high-quality text after parameter tuning, they are ad hoc. Not much is known about the control they provide over the statistics of the output, which is important since recent reports show text quality is highest for a specific range of likelihoods.
Here, first we provide a theoretical analysis of perplexity in top-k, top-p, and temperature sampling, finding that cross-entropy behaves linearly as a function of p in top-p sampling whereas it is a nonlinear function of k in top-k sampling, under Zipfian statistics. We use this analysis to design a feedback-based adaptive top-k text decoding algorithm called Mirostat that generates text (of any length) with a predetermined value of perplexity, and thereby high-quality text without any tuning. Experiments show that for low values of k and p in top-k and top-p sampling, perplexity drops substantially with generated text length, which is also correlated with excessive repetitions in the text (the boredom trap). On the other hand, for large values of k and p, we find that perplexity increases with generated text length, which is correlated with incoherence in the text (confusion trap).
Mirostat avoids both traps: experiments show that cross-entropy has a near-linear relation with repetition in generated text. This relation is almost independent of the sampling method but slightly dependent on the model used. Hence, for a given language model, control over perplexity also gives control over repetitions. Experiments with human raters for fluency, coherence, and quality further verify our findings.
I am worried we’re in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100× larger projects at Google, Facebook and the like, with timelines measured in months.
…GPT-3 has been estimated to cost $5m in compute to train, and—looking at the author list and OpenAI’s overall size—maybe another $10m in labour.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100× is entirely plausible right now. All that’s necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability. A concrete example is Waymo, which is raising $2bn investment rounds—and that’s for a technology with a much longer road to market…The current hardware floor is nearer to the RTX 2080 TI’s $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10× speedup to our model.
…I think the key question is if by 1000×, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is—and I think it’s plausible that it will be—then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP. On paper that leaves room for another 1000× scale-up as it reaches up to $1tn, though current market mechanisms aren’t really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That’s from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
Discussion of removing a major architectural limitation in Transformer neural networks: the length of the input it can look at. Beyond a few thousand inputs, the resource requirements explode quadratically, rendering it infeasible to encode raw text at the character level, much less use entire books, images, or many other kinds of data which could be useful. Even for text, this inability also forces limitations like the use of BPE text encoding (responsible for sabotaging GPT-3’s rhyming, among other things), forgetfulness, limits to prompt programming, and inability to write coherent long texts.
Possibilities for fixing this generally fall into
adding state, through recurrence (a memory) or creating a compressed history/state as an explicit summary
tinkering with matrix algebra to remove the quadratic explosion while still keeping more or less the same self-attention mechanism
approximating self-attention: using attention on only a small subset of tokens at any time (dodging the quadratic limit), or using a mix of local and global attention (local attentions to do most of the work, and global attention on top of the local attentions, each one avoiding the quadratic by considering only a few inputs at a time)
miscellaneous tricks: removing parts, using only randomized untrainable components (with no need to compute gradients over) etc
The success of the large neural language models on many NLP tasks is exciting. However, we find that these successes sometimes lead to hype in which these models are being described as “understanding” language or capturing “meaning.” In this position paper, we argue that a system trained only on form has a priori no way to learn meaning. In keeping with the ACL 2020 theme of “Taking Stock of Where We’ve Been and Where We’re Going”, we argue that a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.
…In this paper, we have argued that in contrast to some current hype, meaning cannot be learned from form alone. This means that even large language models such as BERT do not learn “meaning”; they learn some reflection of meaning into the linguistic form which is very useful in applications. We have offered some thoughts on how to maintain a healthy, but not exaggerated, optimism with respect to research that builds upon these LMs. In particular, this paper can be seen as a call for precise language use when talking about the success of current models and for humility in dealing with natural language. With this we hope to encourage a top-down perspective on our field which we think will help us select the right hill to climb toward human-analogous NLU.
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input’s length, they are prohibitively slow for very long sequences.
To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from 𝒪(N2) to 𝒪(N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks.
Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000× faster on autoregressive prediction of very long sequences.
[See also “SOE: Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020.] Large-scale language models pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators. However, as our systematic examination reveals, it is still challenging for such models to generate coherent long passages of text (>1000 tokens), especially when the models are fine-tuned to the target domain on a small corpus.
To overcome the limitation, we propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution. Our method first produces domain-specific content keywords and then progressively refines them into complete passages in multiple stages. The simple design allows our approach to take advantage of pretrained language models at each stage and effectively adapt to any target domain given only a small set of examples.
We conduct a comprehensive empirical study with a broad set of evaluation metrics, and show that our approach substantially improves upon the fine-tuned GPT-2 in terms of domain-specific quality and sample efficiency. The coarse-to-fine nature of progressive generation also allows for a higher degree of control over the generated content.
GPT-3, announced in May 2020 by OpenAI, was a breakthrough in neural net modeling of natural language and natural-language-related tasks; the June 2020 API opened up GPT-3 use to outsiders, including myself. I extensively documented my experiences testing GPT-3 and learning how to use it primarily for creative fiction such as poetry; but I also tested some “nonfiction” uses (often in response to hyperbolic claims about what GPT-3 could never do). This page documents tasks like anagrams, queries based on premises described as ‘databases’, probing the problems with GPT-3’s commonsense and other tasks (often related to poor prompting, showing the importance of prompt programming, or the pernicious influence of BPEs)
Creative writing by OpenAI’s GPT-3 model, demonstrating poetry, dialogue, puns, literary parodies, and storytelling. Plus advice on effective GPT-3 prompt programming & avoiding common errors.
I continue my AI poetry generation experiments with OpenAI’s 2020 GPT-3, which is 116× larger, and much more powerful, than the 2019 GPT-2. GPT-3, however, is not merely a quantitative tweak yielding “GPT-2 but better”—it is qualitatively different, exhibiting eerie runtime learning capabilities allowing even the raw model, with zero finetuning, to “meta-learn” many textual tasks purely by example or instruction. One does not train or program GPT-3 in a normal way, but one engages in dialogue and writes prompts to teach GPT-3 what one wants.
Experimenting through the OpenAI Beta API in June 2020, I find that GPT-3 does not just match my finetuned GPT-2-1.5b-poetry for poem-writing quality, but exceeds it, while being versatile in handling poetry, Tom Swifty puns, science fiction, dialogue like Turing’s Turing-test dialogue, literary style parodies… As the pièce de résistance, I recreate Stanislaw Lem’s Cyberiad’s “Trurl’s Electronic Bard” poetry using GPT-3. (Along the way, I document instances of how the BPE text encoding unnecessarily damages GPT-3’s performance on a variety of tasks, how to best elicit the highest-quality responses, common errors people make in using GPT-3, and test out GPT-3’s improvements in NN weak points like logic or commonsense knowledge.)
GPT-3’s samples are not just close to human level: they are creative, witty, deep, meta, and often beautiful. They demonstrate an ability to handle abstractions, like style parodies, I have not seen in GPT-2 at all. Chatting with GPT-3 feels uncannily like chatting with a human. I was impressed by the results reported in the GPT-3 paper, and after spending a week trying it out, I remain impressed.
This page records GPT-3 samples I generated in my explorations, and thoughts on how to use GPT-3 and its remaining weaknesses. I hope you enjoy them even a tenth as much as I enjoyed testing GPT-3 and watching the completions scroll across my screen.
Interest in biologically inspired alternatives to backpropagation is driven by the desire to both advance connections between deep learning and neuroscience and address backpropagation’s shortcomings on tasks such as online, continual learning. However, local synaptic learning rules like those employed by the brain have so far failed to match the performance of backpropagation in deep networks.
In this study, we employ meta-learning to discover networks that learn using feedback connections and local, biologically inspired learning rules. Importantly, the feedback connections are not tied to the feedforward weights, avoiding biologically implausible weight transport. Our experiments show that meta-trained networks effectively use feedback connections to perform online credit assignment in multi-layer architectures. Surprisingly, this approach matches or exceeds a state-of-the-art gradient-based online meta-learning algorithm on regression and classification tasks, excelling in particular at continual learning.
Analysis of the weight updates employed by these models reveals that they differ qualitatively from gradient descent in a way that reduces interference between updates. Our results suggest the existence of a class of biologically plausible learning mechanisms that not only match gradient descent-based learning, but also overcome its limitations.
Flamewars over platforms & upgrades are so bitter not because people are jerks but because the choice will influence entire ecosystems, benefiting one platform through network effects & avoiding ‘bitrot’ while subtly sabotaging the rest through ‘bitcreep’.
The enduring phenomenon of ‘holy wars’ in computing, such as the bitterness around the prolonged Python 2 to Python 3 migration, is not due to mere pettiness or love of conflict, but because they are a coordination problem: dominant platforms enjoy strong network effects, such as reduced ‘bitrot’ as it is regularly used & maintained by many users, and can inflict a mirror-image ‘bitcreep’ on other platforms which gradually are neglected and begin to bitrot because of the dominant platform.
The outright negative effect of bitcreep mean that holdouts do not just cost early adopters the possible network effects, they also greatly reduce the value of a given thing, and may cause the early adopters to be actually worse off and more miserable on a daily basis. Given the extent to which holdouts have benefited from the community, holdout behavior is perceived as parasitic and immoral behavior by adopters, while holdouts in turn deny any moral obligation and resent the methods that adopters use to increase adoption (such as, in the absence of formal controls, informal ones like bullying).
This desperate need for there to be a victor, and the large technical benefits/costs to those who choose the winning/losing side, explain the (only apparently) disproportionate energy, venom, and intractability of holy wars.
Perhaps if we explicitly understand holy wars as coordination problems, we can avoid the worst excesses and tap into knowledge about the topic to better manage things like language migrations.
Despite success on a wide range of problems related to vision, generative adversarial networks (GANs) often suffer from inferior performance due to unstable training, especially for text generation.
To solve this issue, we propose a new variational GAN training framework which enjoys superior training stability. Our approach is inspired by a connection of GANs and reinforcement learning under a variational perspective. The connection leads to (1) probability ratio clipping that regularizes generator training to prevent excessively large updates, and (2) a sample re-weighting mechanism that improves discriminator training by downplaying bad-quality fake samples. Moreover, our variational GAN framework can provably overcome the training issue in many GANs that an optimal discriminator cannot provide any informative gradient to training generator.
By plugging the training approach in diverse state-of-the-art GAN architectures, we obtain substantially improved performance over a range of tasks, including text generation, text style transfer, and image generation.
OpenAI technology, just an HTTPS call away: Apply our API to any language task—semantic search, summarization, sentiment analysis, content generation, translation, and more—with only a few examples or by specifying your task in English. One simple integration gives you access to our constantly-improving AI technology. Explore how you integrate with the API with these sample completions.
Simple, yet flexible: Our API is designed to be used by anyone, but meets the needs of our own cutting-edge research.
Your data, your rules: You retain ownership of your data, and control whether we can use it for training models.
Ready to go: Our infrastructure already serves millions of API calls per day.
Demos:
Semantic Search: The API allows searching over documents based on the natural-language meaning of queries rather than keyword matching.
Casetext/Algolia/Web Browser Search Plugin
Chat: The API can enable fast, complex and consistent natural language discussions. With a brief prompt, the API generates dialogues spanning a range of topics, from space travel to history.
AI Channels
Customer Service: Leveraging search and chat capabilities, the API generates natural dialogue to quickly give customers relevant information. Through semantic text comprehension, the API can offer a range of analytics and productivity tools to better serve customers.
MessageBird/Sapling Intelligence
Generation: The API can generate complex and consistent natural language, and enables use cases like creative writing.
AI Dungeon/AI Weirdness/Replika
Productivity Tools: The API allows for parsing text into spreadsheet tables, summarizing email discussions, expanding content from bullet points, and more.
Quizlet/Art of Problem Solving/Natural Language Shell/Spreadsheets/Code Completion
Content Comprehension: The API can be used to build tools to help individuals consume content more efficiently.
Koko/Ross Intelligence/Summarization
Polyglot: While the API today works best in English, it also works quite well in other languages. The API can be used for tasks such as translation or chat with users in their preferred language.
[Sample from GPT-3; condensed paragraphs & formatting added for legibility inside popups, as the original plain text dumps strip the original HTML formatting of tutorials etc.]
Working with an example—rotating through an array For a number of years, I’ve been fascinated by the idea of writing code that can rotate through an array. Let’s say I have the following array of numbers:
var myNumbers = [1,2,3,4,5];
The following snippet of code would display the same numbers in reverse order:
for (var i = myNumbers.length-1; i >=0; i-- ) { console.log( myNumbers[i] ); }
What’s going on here?
We know that the index of an array can be used as a “pointer” to the location in memory that contains a particular item in an array. If the array myNumbers has five items, then myNumbers[0], myNumbers[1], myNumbers[2], myNumbers[3] and myNumbers[4] will point to the values 1, 2, 3, 4 and 5 respectively.
In the previous example, the code starts with the length of the array (5) and subtracts one from it. As we know that index 0 in an array contains the first item in the array, the previous code will execute the following steps: Take the length of the array (5) Subtract 1 (so we now have 4) Start at the index that contains the first item in the array (myNumbers[4] = 3) and run down the array until we reach the last item in the array (myNumbers[3] = 5) So the above code will end up displaying the items in the array in reverse order, starting at index 4 and working backwards to index 0. At this point, we have a good idea of how we can access the values in an array in a specific order, but how do we go the other way?
It would be great if we could simply use the same code as above, but add a “–” operator before the last number. That way, we could add a simple parameter to the code to control the range of numbers it will use. Unfortunately, we can’t do that. While JavaScript will happily work with --, it will also happily take -- in code and convert it to −1, or the last number in an array. That’s not going to do us any good, as we want the code to be flexible enough that we can work with a number that we specify as the range of numbers.
So let’s look at a simple function that can give us the results we need.
As an example, in Figure F.1, we show 4 uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the style of Wallace Stevens. We first experimented with a few prompts, then generated 4 samples with no additional editing or selection (sampling at temperature 1 using [nucleus sampling (HBFC19)] with P = 0.9). Completions were truncated when the model began to write a new title and author heading, or broke into prose commentary.
…“Shadows on the Way”, Wallace Stevens:
I must have shadows on the way / If I am to walk I must have / Each step taken slowly and alone / To have it ready made // And I must think in lines of grey / To have dim thoughts to be my guide / Must look on blue and green / And never let my eye forget / That color is my friend / And purple must surround me too // The yellow of the sun is no more / Intrusive than the bluish snow / That falls on all of us. I must have / Grey thoughts and blue thoughts walk with me / If I am to go away at all.
There is a way in the world. / That way a man may live in a small town / His lifetime and see no man come, / Never the stir of men about a market, / Never the bobbing of covered carts, / Never the creak passed on the stair / Closing some other life; an old man / Can prosper in that town, can prosper / When his store is seen to close, his wife / Fades in his house, and the world shouts in vain / That no one comes to hear him / Where his small bell tolls. / In that small town / Everything is known. / The ways of the young men are known, / And the ways of the young women / As they walk to the Spring; / The rumors are all known, / The quarrels, the hurt, the envy / Of the men and the women. / Till a day comes / That way out of the world / That leads to a wood. / Young, the young man / Walks, a lover now, / Knows the streets as they are and is of them.
The sun was all we had. Now, in the shade / All is changed. The mind must dwell on those / White fields, that to its eyes were always old; / Those ancient gleams, convoluting / The way they lay among the huge roots, / The great dark tomes of reverie, / The plumed labyrinth of the sea. // When our own breath gone faint and fade / Up the long stairway of the dead / We shall remember well / The blindness of the light, the whiteness / Of that white land. The footsteps, and / The grasses in that light, and in the shade / The menacing life. // We played, too, once, in the warmest rooms. / To one content, there is one content / For all mankind. The forest glades/ Are all the more for being in fable, / And wood and town in legend shadowed— / Once more to return. The one content / Has in it every happiness, more brave / For being purposeless; and so / Those island gardens, deep in the mind, / Make of remoteness familiarity / And of the unspoken name a friend.
Nobody will come to this place. It is a road that leads nowhere. / The solitude is deep. The mountains are high. / But they are desolate, and they turn the traveler’s face / Towards the North. All the sounds of the world are far away. / When the wind rises above the trees, / The boughs bow to the ground. / Even the birds that inhabit the tangle of weeds / That is the roadside cover, are silent. One listens, / But hears no roar of the forest. One is alone. / One will be taken. / One will be taken. / There is no utterance, there is no conversation, / But one is uneasy all the same… / There is a thin blue mist, / A darkness rising like smoke, / And within that darkness / A possession of the heart. / One will be taken… It was here, and it will be here again— / Here, under this sky empty and full of light.
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
…The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work [KMH+20] suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.
On GPT-3: meta-learning, scaling, implications, and deep theory. The scaling hypothesis: neural nets absorb data & compute, generalizing and becoming more Bayesian as problems get harder, manifesting new abilities even at trivial-by-global-standards-scale. The deep learning revolution has begun as foretold.
GPT-3, announced by OpenAI in May 2020, is the largest neural network ever trained, by over an order of magnitude. Trained on Internet text data, it is the successor to GPT-2, which had surprised everyone by its natural language understanding & generation ability. To the surprise of most (including myself), this vast increase in size did not run into diminishing or negative returns, as many expected, but the benefits of scale continued to happen as forecasted by OpenAI. These benefits were not merely learning more facts & text than GPT-2, but qualitatively distinct & even more surprising in showing meta-learning: while GPT-2 learned how to do common natural language tasks like text summarization, GPT-3 instead learned how to follow directions and learn new tasks from a few examples. (As a result, GPT-3 outputs & interaction are more fascinating & human-like than GPT-2.)
While the immediate applications of GPT-3, like my poetry or humor writings, are nice, the short-term implications of GPT-3 are much more important.
First, while GPT-3 is expensive by conventional DL standards, it is cheap by scientific/commercial/military/government budget standards, and the results indicate that models could be made much larger. Second, models can also be made much more powerful, as GPT is an old approach known to be flawed in both minor & major ways, and far from an ‘ideal’ Transformer. Third, GPT-3’s capabilities come from learning on raw (unsupervised) data; that has long been one of the weakest areas of DL, holding back progress in other areas like reinforcement learning or robotics. Models like GPT-3 suggest that large unsupervised models will be vital components of future DL systems, as they can be ‘plugged into’ systems to immediately provide understanding of the world, humans, natural language, and reasoning.
The meta-learning has a longer-term implication: it is a demonstration of the blessings of scale, where problems with simple neural networks vanish, and they become more powerful, more generalizable, more human-like when simply made very large & trained on very large datasets with very large compute—even though those properties are believed to require complicated architectures & fancy algorithms (and this perceived need drives much research). Unsupervised models benefit from this, as training on large corpuses like Internet-scale text present a myriad of difficult problems to solve; this is enough to drive meta-learning despite GPT not being designed for meta-learning in any way. (This family of phenomena is perhaps driven by neural networks functioning as ensembles of many sub-networks with them all averaging out to an Occam’s razor, which for small data & models, learn superficial or memorized parts of the data, but can be forced into true learning by making the problems hard & rich enough; as meta-learners learn amortized Bayesian inference, they build in informative priors when trained over many tasks, and become dramatically more sample-efficient and better at generalization.)
The blessings of scale in turn support a radical theory: an old AI paradigm held by a few pioneers in connectionism (early artificial neural network research) and by more recent deep learning researchers, the scaling hypothesis. The scaling hypothesis regards the blessings of scale as the secret of AGI: intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale. As increasing computational resources permit running such algorithms at the necessary scale, the neural networks will get ever more intelligent.
When? Estimates of Moore’s law-like progress curves decades ago by pioneers like Hans Moravec indicated that it would take until the 2010s for the sufficiently-cheap compute for tiny insect-level prototype systems to be available, and the 2020s for the first sub-human systems to become feasible, and these forecasts are holding up. (Despite this vindication, the scaling hypothesis is so unpopular an idea, and difficult to prove in advance rather than as a fait accompli, that while the GPT-3 results finally drew some public notice after OpenAI enabled limited public access & people could experiment with it live, it is unlikely that many entities will modify their research philosophies, much less kick off an ‘arms race’.)
More concerningly, GPT-3’s scaling curves, unpredicted meta-learning, and success on various anti-AI challenges suggests that in terms of futurology, AI researchers’ forecasts are an emperor sans garments: they have no coherent model of how AI progress happens or why GPT-3 was possible or what specific achievements should cause alarm, where intelligence comes from, and do not learn from any falsified predictions. Their primary concerns appear to be supporting the status quo, placating public concern, and remaining respectable. As such, their comments on AI risk are meaningless: they would make the same public statements if the scaling hypothesis were true or not.
Depending on what investments are made into scaling DL, and how fast compute grows, the 2020s should be quite interesting—sigmoid or singularity?
Just run true_poetry.py and type or paste in some text as a prompt and it will generate a sonnet, limerick, couplets or a ballad…You can modify the meter or rhyme scheme however you want.
There is still some work to be done. It likes short, one-token words too much. Sometimes the rhyming word is grammatically incorrect. The longer the poem goes, the more likely it is to degenerate…
You can choose between limerick, ballad, or sonnet, or create your own rhyme scheme with a little editing of the code.
A few random samples. I have touched up just the punctuation by hand a bit:
Ballad:
And then with voice as sweet and small
as falling drops she said:
The night will come, it always comes,
with clouds, to change my bed
to snow. But in the night the moon
will put aside the stars,
and in the snow of clouds I will
be buried as you are.
Sonnet:
The mixture should continue beating when
you fold in half. The batter will be thick.
You can increase the speed by half and then
increase it further with the beat on stick.
The mixture will continue beating when
you fold in half and then begin to fold
in alternating thirds. You should begin
to notice little peaks, that seem to hold
the mixture from becoming soft. You will
be looking for to lift, or curt but still.
Sonnet:
I see, and with my hand I touch the blade,
and feel the steel, and stab myself, and go.
I see, and I confess, that, having played,
I am the fool I was; But that below,
which promised such delight And such despair,
is nothing: it was all: the world and all,
and nothing, I confess, is. For, compare,
that other world of sense; That shining ball,
the soul and senses call; Of this we are,
I hope, in health considered; so afar.
Other:
The snow began a little thick and white
and then a little loath; and then, the night,
The wind, and then again the snow; until
The sky above, a dark and endless, still
and empty sky: a sky, a sky. The night,
The moon and wind, the moon, and wind and, white.
Other:
The children run, but cannot find
The house’s ghosts; the ghost behind
The curtain calls, the ghost before
The curtains pull apart—The door,
The window open flies. But no—
The windows shut, and blinds below.
Limerick:
The Maximus army were led
by Megatron Prime with his red,
mechanical arm,
at fist and forearm,
and he had his fire-flies red
Limerick:
And then, as in dreams, she began
to move to her side: and the man
was dumbfounded. He
could sense each degree
of freedom in movement. I can
Limerick:
And there was an Englishman who
had married his ex. and so grew
to hate and resent
his wife; And she went
to see that he lived. And the two
Limerick:
There was a young lady aboard
a steamship. The sea? She ignored
the sea, she ignored
the sails, for on board
she saw, in the galley, my lord.
[Paper; samples; followup paper probing Jukebox as pretraining for music analysis (posing similar difficulties in extracting the right embedding as iGPT). An album made using it is Shadow Planet] A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies. One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.
We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.
…Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-226 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images…We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8×, 32×, and 128×, respectively, with a codebook size of 2,048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.
Jukebox architecture
…The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, substantially improving the audio quality. We train these as autoregressive models using a simplified variant of Sparse Transformers. Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to ~24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively. Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.
…While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a substantial gap between these generations and human-created music. For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernible noise. Improving the VQ-VAE so its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render 1 minute of audio through our models, and thus they cannot yet be used in interactive applications.
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law L ∝ N−α in the number of network parameters N. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension d. This simple theory predicts that the scaling exponents α ≈ 4⁄d for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of d and α by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.
Figure 1: The Likelihood Trap. We asked 146 crowd-workers to rate the quality of 100 sentences across a variety of model likelihoods. While model log likelihoods are generally positively correlated with average human quality judgments, we notice an inflection point after which they become negatively correlated. Each point in the graph represents the average crowd-worker rating of 5 sentences with similar model likelihoods. We discuss this phenomenon in more depth in Section 3.
For open-ended language generation tasks such as storytelling and dialogue, choosing the right decoding algorithm is critical to controlling the tradeoff between generation quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. We address these issues by casting decoding as a multi-objective optimization problem aiming to simultaneously maximize both response quality and diversity. Our framework enables us to perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. We find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, the recently proposed nucleus sampling (Holtzman et al 2019) outperforms all other evaluated decoding algorithms. Our experiments also confirm the existence of the “likelihood trap”, the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We leverage our findings to create and evaluate an algorithm called selective sampling which tractably approximates globally-normalized temperature sampling.
The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining.
We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
Figure 1: Example tokenizations. The character ‘_’ is a word boundary marker. BPE merges common tokens, such as English inflectional suffixes and Japanese particles, into their neighbors even when the resulting unit is not semantically meaningful.
“A Hundred Visions and Revisions” is a computer program that alters poems using a neural-network language model. It works by replacing the individual words of the text, one by one, with other words that are more probable according to the BERT language model, while preserving rhyme and meter; in effect, this process banalifies the poem, replacing its linguistic distinctiveness with normativity. The program can also attempt to revise a poem to be about a different topic. As an example, I started with the poem “The Sick Rose” by William Blake:
O Rose thou art sick.
The invisible worm,
That flies in the night
In the howling storm:
Has found out thy bed
Of crimson joy:
And his dark secret love
Does thy life destroy.
Here is the revision:
By God thou art blessed.
The invisible man,
Who walks in the night
In a hooded cloak:
Has found both his source
Of body heat:
And his own power that
Makes his life complete.
…It is also possible to have the program revise a poem to be about a different topic while retaining rhyme, meter, and some other, subtler traces of the original. When I created the finetuned neural network, I included annotations indicating the title and author of each poem. This enables the AI to pick up on patterns in the relation between title and poem. You can then feed in hints about the poem’s title, and the AI will alter the text accordingly…All of these revisions retain the rhyme, meter, and punctuation of the original (excepting the slant-rhyme of “eye” and “symmetry”, which the current code cannot detect). If these formal constraints are lifted, the poem will degenerate into prose that bears little relation to the original…I also included a feature that enables you to bias the output toward an arbitrary vocabulary. I tested this out using the data from Iain Barr’s analysis of the vocabulary of heavy metal lyrics
How it works: The BERT model is capable of guessing a word that is “masked”—that is, hidden from the model. To pick an example from the documentation for the implementation I used, one could enter “Who was Jim Henson? Jim Henson was a [MASK]”; the model predicts that the masked word is “puppeteer”. The point of this is to enable the computer to perform question-answering tasks, language modeling standing as a surrogate for more general intelligence. But it is also possible to use the model’s predictions to alter an existing text. To do this, my program tries masking each word in the text and guessing what word should be in that position. For instance, suppose we are looking at this text:
Tyger Tyger, burning bright, in the forests of the night
We try masking each word in order; for instance, at one point we will end up with this:
Tyger Tyger, burning bright, in the [MASK] of the night
The program uses the neural network to predict what word appears in the masked position, subject to various constraints such as rhyme and meter. In this case, the BERT model guesses “middle”, with probability 0.6762. On the other hand, the word that is actually in that position—“forests”—gets probability 0.000076159. We divide the latter by the former to get a score for this potential change: 0.0001126. Since this score happens to be the lowest for any word in the text, the program selects the word “forests” for replacement, giving us this revision:
Tyger Tyger, burning bright, in the middle of the night
The program then repeats this process until there are no more “improvements” to be made.
When OpenAI announced the automatic text generator GPT-2 in February of 2019, its language model had a simple objective: predict the next word. Since its release—and despite high computational barriers—programmers, tinkerers and artificial intelligence researchers have explored creative ways to use the advanced language model, developing applications for GPT-2 far beyond simple text generation. In January, AI researcher Shawn Presser demonstrated how GPT-2 can empower video game design, beginning with “the most tedious game in history.” “You can prompt the model with whatever text you want, and it will try to guess how to complete it”, Presser told Newsweek.
…Using thousands of game walkthroughs and FAQs, scraped from sites around the web (a 50 megabyte data set provided by Twitter’s @me_irl), Presser prompted GPT-2 to generate its own walkthroughs. The result is walkthroughs of video games that never existed; guides to adventures no one has ever programmed. Presser described one of GPT-2’s creations as “a walkthrough for the most tedious game in history”: a dense set of instructions for something that sounds a lot like a first-person shooter. “When the room opens, go forward. You should find a rocket launcher”, the walkthrough begins. “Push the switch and a door opens. Take cover in the corner and shoot the guard. The door will close when he dies. Now jump over the gap and kill the guards. In the next area is a switch. Push it and the door will open. In the next area is a scientist. Kill him. Go back to the previous room and push the switch. Open the next door. In the next room is a scientist. Kill him.”
…But renting a “TPU pod” for cloud computing can cost millions, making them prohibitively expensive for all but large companies—organizations unlikely to try out playful experiments. So Presser developed a technique he dubbed “swarm training”, to employ 80 individual TPUs on a single data set. “In swarm training, we can run dozens or hundreds of TPUs in a loose network which swaps updates on the fly”, Presser told Newsweek. “It’s chaotic, but it winds up working pretty well: it’s much faster than using just a few TPUs, but much cheaper than renting entire TPU pods. We’re hopeful that swarm training will be very useful to other researchers.”
…GPT-2 has also proved adept at gaming functions beyond just generating games-related text. Presser previously collaborated with technology writer and researcher Gwern Branwen to train GPT-2 to play chess, by providing it hours of “training” in legal chess moves (using standard notation) and asking it to output its own responses. After hours of training GPT-2 on which responses are valid moves in an ongoing chess game and which responses are nonsensical, the text generation engine was eventually able to complete a full game.
While it may be years before game designers are employing text generating language models in their designs, Presser said he already sees potential practical applications. “If you prompt the model with descriptions of some spells from your tabletop campaign, the model can generate new spells”, Presser said. “It’s quite versatile.” For example, Dungeons & Dragons players could input spells like Fireball, including a description of its HP damage, and get back from GPT-2 new attack spells to use in tabletop roleplaying sessions. “I think there’s an opportunity to build new indie games using GPT-2”, Presser said. “Imagine making a mod for Skyrim that uses GPT-2 to generate new quests. You’d have infinite replayability. It’d be like AI Dungeon 2 in 3D.”
The sorting operation is one of the most commonly used building blocks in computer programming. In machine learning, it is often used for robust statistics. However, seen as a function, it is piecewise linear and as a result includes many kinks where it is non-differentiable. More problematic is the related ranking operator, often used for order statistics and ranking metrics. It is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the 𝒪(N log N) time complexity one would expect from sorting and ranking operations.
In this paper, we propose the first differentiable sorting and ranking operators with 𝒪(N log N) time and 𝒪(n) space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by constructing differentiable operators as projections onto the permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization.
Empirically, we confirm that our approach is an order of magnitude faster than existing approaches and showcase 2 novel applications: differentiable Spearman’s rank correlation coefficient and least trimmed squares.
In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human’s preferences are fulfilled based on the activations of a RL agent’s neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent’s activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.
[Public-editable Google Docs document for coordinating a read through a large sample of neural-net-generated poetry, to locate the best poem samples for displaying in the GPT-2 writeup.]
I used a large neural net model, GPT-2-1.5b, trained on hundreds of megabytes of poetry, to generate 1 million words of poetry. That’s too much for me to read by myself to find the best poems. Perhaps you’d like to help?
It’s simple:
Pick an unread URL from ‘Open Samples’ below, open it, and remove it from the list.
Read it. (Each URL is ≤ 1000 lines, so it should be fun.)
Add any good poems to ‘Selected Samples’ at the end of this document.
Enjoy reading the current ‘Selected Samples’—or pick another URL to read!
A great number of deep learning based models have been recently proposed for automatic music composition. Among these models, the Transformer stands out as a prominent approach for generating expressive classical piano performance with a coherent structure of up to one minute. The model is powerful in that it learns abstractions of data on its own, without much human-imposed domain knowledge or constraints. In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. In particular, we seek to impose a metrical structure in the input data, so that Transformers can be more easily aware of the beat-bar-phrase hierarchical structure in music. The new data representation maintains the flexibility of local tempo changes, and provides hurdles to control the rhythmic and harmonic structure of music. With this approach, we build a Pop Music Transformer that composes Pop piano music with better rhythmic structure than existing Transformer models.
We study empirical scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.
Larger models are substantially more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping substantially before convergence.
Figure 1: Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations for L(Cmin) and L(D) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
3.2.1: Comparing to LSTMs and Universal Transformers: In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count n. The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position in Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns. [see Khandelwal et al 2018 on the rapid forgetting of RNNs, SSMs as a possible optimization fix, and “Scaling Laws for Acoustic Models” for another direct LSTM-RNN vs Transformer comparison.]
Appendix A: Summary of Power Laws
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while Nopt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language. [This is an updated scaling power law summary from Henighan et al 2020.]
BERT, a neural network published by Google in 2018, excels in natural language understanding. It can be used for multiple different tasks, such as sentiment analysis or next sentence prediction, and has recently been integrated into Google Search. This novel model has brought a big change to language modeling as it outperformed all its predecessors on multiple different tasks. Whenever such breakthroughs in deep learning happen, people wonder how the network manages to achieve such impressive results, and what it actually learned. A common way of looking into neural networks is feature visualization. The ideas of feature visualization are borrowed from Deep Dream, where we can obtain inputs that excite the network by maximizing the activation of neurons, channels, or layers of the network. This way, we get an idea about which part of the network is looking for what kind of input.
In Deep Dream, inputs are changed through gradient descent to maximize activation values. This can be thought of as similar to the initial training process, where through many iterations, we try to optimize a mathematical equation. But instead of updating network parameters, Deep Dream updates the input sample. What this leads to is somewhat psychedelic but very interesting images, that can reveal to what kind of input these neurons react. Examples for Deep Dream processes with images from the original Deep Dream blogpost. Here, they take a randomly initialized image and use Deep Dream to transform the image by maximizing the activation of the corresponding output neuron. This can show what a network has learned about different classes or for individual neurons.
Feature visualization works well for image-based models, but has not yet been widely explored for language models. This blogpost will guide you through experiments we conducted with feature visualization for BERT. We show how we tried to get BERT to dream of highly activating inputs, provide visual insights of why this did not work out as well as we hoped, and publish tools to explore this research direction further. When dreaming for images, the input to the model is gradually changed. Language, however, is made of discrete structures, ie. tokens, which represent words, or word-pieces. Thus, there is no such gradual change to be made…Looking at a single pixel in an input image, such a change could be gradually going from green to red. The green value would slowly go down, while the red value would increase. In language, however, we can not slowly go from the word “green” to the word “red”, as everything in between does not make sense. To still be able to use Deep Dream, we have to utilize the so-called Gumbel-Softmax trick, which has already been employed in a paper by Poerner et al 2018. This trick was introduced by Jang et. al. and Maddison et. al.. It allows us to soften the requirement for discrete inputs, and instead use a linear combination of tokens as input to the model. To assure that we do not end up with something crazy, it uses two mechanisms. First, it constrains this linear combination so that the linear weights sum up to one. This, however, still leaves the problem that we can end up with any linear combination of such tokens, including ones that are not close to real tokens in the embedding space. Therefore, we also make use of a temperature parameter, which controls the sparsity of this linear combination. By slowly decreasing this temperature value, we can make the model first explore different linear combinations of tokens, before deciding on one token.
…The lack of success in dreaming words to highly activate specific neurons was surprising to us. This method uses gradient descent and seemed to work for other models (see Poerner et al 2018). However, BERT is a complex model, arguably much more complex than the models that have been previously investigated with this method.
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from 𝑂(L2) to 𝑂(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N×, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
When I originally trained the models in May 2019, I’d used the 345M version of GPT-2, which at the time was the largest one that OpenAI had publicly released. Last November, however, OpenAI finally released the full 1.5 billion parameter model.
The 1.5B model requires much more memory to fine-tune than the 345M, so I was initially having a lot of difficulty getting it to work on Colab. Thankfully, I was contacted by / u / gwern (here’s his Patreon) and Shawn Presser ( / u / shawwwn), who very generously offered to do the fine-tuning themselves if I provided them with the dataset. This training took about 2 weeks, and apparently required around $70K worth of TPU credits, so in hindsight this upgrade definitely wouldn’t have been possible for me to do myself, without their assistance.
Based on my tests of the new model so far, I’m pretty happy with the quality, and IMO it is noticeably more coherent than the 345M version.
One thing that I should point out about the upgrade is that the original 345M models had been separately fine-tuned for each subreddit individually (ie. there were 108 separate models), whereas the upgraded one is just a single 1.5B model that has been fine-tuned using a combined dataset containing the comments/submissions from all the subreddits that I scraped. The main reason for this decision is simply that it would not have been feasible to train ~100 separate 1.5B models. Also, there may have been benefits from transfer learning across subreddits, which wouldn’t occur with separate models.
…Here is the full list of new bots to be added: /r/capitalismvsocialism · /r/chess · /r/conlangs · /r/dota2 · /r/etymology · /r/fiftyfifty · /r/hobbydrama · /r/markmywords · /r/moviedetails · /r/neoliberal · /r/obscuremedia · /r/recipes · /r/riddles · /r/stonerphilosophy · /r/subsimulatorgpt2 · /r/subsimulatorgpt2meta · /r/tellmeafact · /r/twosentencehorror · /r/ukpolitics · /r/wordavalanches · /r/wouldyourather · /r/zen
…Black is GPT-2. Its excuse [for this chess blunder] is that it’s a text prediction program with no concept of chess. As far as it knows, it’s trying to predict short alphanumeric strings like “e2e4” or “Nb7”. Nobody told it this represents a board game. It doesn’t even have a concept of 2D space that it could use to understand such a claim. But it still captured my rook! Embarrassing!…Last month, I asked him if he thought GPT-2 could play chess. I wondered if he could train it on a corpus of chess games written in standard notation (where, for example, e2e4 means “move the pawn at square e2 to square e4”). There are literally millions of games written up like this. GPT-2 would learn to predict the next string of text, which would correspond to the next move in the chess game. Then you would prompt it with a chessboard up to a certain point, and it would predict how the chess masters who had produced its training data would continue the game—ie make its next move using the same heuristics they would. Gwern handed the idea to his collaborator Shawn Presser, who had a working GPT-2 chess engine running within a week:…You can play against GPT-2 yourself by following the directions in the last tweet, though it won’t be much of a challenge for anyone better than I am.
…What does this imply? I’m not sure (and maybe it will imply more if someone manages to make it actually good). It was already weird to see something with no auditory qualia learn passable poetic meter. It’s even weirder to see something with no concept of space learn to play chess. Is any of this meaningful? How impressed should we be that the same AI can write poems, compose music, and play chess, without having been designed for any of those tasks? I still don’t know.
Publicly available artificial intelligence methods can generate an enormous volume of original, human speech-like topical text (“Deepfake Text”) that is not based on conventional search-and-replace patterns. I created a computer program (a bot) that generated and submitted 1,001 deepfake comments regarding a Medicaid reform waiver to a federal public comment website, stopping submission when these comments comprised more than half of all submitted comments. I then formally withdrew the bot comments. When humans were asked to classify a subset of the deepfake comments as human or bot submissions, the results were no better than would have been gotten by random guessing. Federal public comment websites currently are unable to detect Deepfake Text once submitted, but technological reforms (eg. CAPTCHAs) can be implemented to help prevent massive numbers of submissions by bots.
The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.
Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.
Standard language generation neural network models, like GPT-2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.
A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.
OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & music generation, I experimented with GPT-2 preference learning for unconditional music and poetry generation.
I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≈ 7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more stringent sample curation. My blind ratings using n ≈ 200 comparisons showed no large advantage for the RL-tuned samples (winning only 93 of 210 comparisons, or 46%).
This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.
Working with it, I suspect that preference learning is unnecessarily sample-inefficient & data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output.
We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global representation with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and melody. Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a YouTube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines.
Recent findings from the scientific community show that training language models (LMs) on large, unannotated corpora and with a simple objective—to predict the next word in a passage of text given the preceding text—can demonstrate unprecedented fluency. LMs can generate coherent, relatable text, either from scratch or by completing a passage started by the user…Although these models are able to encode complex knowledge about spelling, grammar, and typical speech patterns, they are hard to steer or control. In other words, while we can ask them to generate many possible sentences or to complete a given sentence fragment, there is no easy way to get them to generate text with specific properties or about particular topics. For example, what if we wanted the generated text to start with the same prefix, ‘The food is awful’, but then to turn in a positive direction? Or gradually to change the topic of the generated text to being about politics? Researchers around the world have proposed multiple ways of conditioning text generation, including starting with a pre-trained LM and fine-tuning it to always produce positive sentences, training a large conditional model from scratch, or turning a given sentence into a more positive one by substituting new text in for key n-grams.
This article discusses an alternative approach to controlled text generation, titled the Plug and Play Language Model (PPLM), introduced in a recent paper from Uber AI. PPLM allows an user to flexibly plug in one or more simple attribute models representing the desired control objective into a large, unconditional LM. The method has the key property that it uses the LM as is—no training or fine-tuning is required—which enables researchers to leverage best-in-class LMs even if they do not have the extensive hardware required to train them.
…Fortunately, Uber AI’s Plug and Play Language Model allows researchers to make use of the few pretrained models out there: rather than requiring everyone to train their own woolly mammoth, PPLM lets users combine small attribute models with an LM to steer its generation. Attribute models can be 100,000× smaller than the LM and still be effective in steering it, like a mouse sitting atop our woolly mammoth friend and telling it where to go (Figure 1). The mouse tells the mammoth where to go using gradients.
PPLM resolves this issue by ~implementing the more efficient Metropolis-adjusted Langevin sampler ofRoberts & Tweedie 1996as implemented for pairs of neural networks byNguyen et al 2016in their Plug-and-Play Generative Networks (PPGN) model. In this vein, the PPLM algorithm entails three simple steps to generate a sample:
Given a partially generated sentence, computelog(p(x))andlog(p(a|x))and the gradients of each with respect to the hidden representation of the underlying language model. These quantities are both available using an efficient forward and backward pass of both models.
Use the gradients to move the hidden representation of the language model a small step in the direction of increasinglog(p(a|x))and increasinglog(p(x)).
Sample the next word.
Intuitively, as a PPLM generates text one token at a time, it continuously steers its representation of the text in a direction that will be more likely to possess the desired attribute—highlog(p(a|x))—while still retaining fluency under the original language model—high log(p(x)).
…In addition to steering generated text using gradients from a particular p(a|x) attribute model, text must be steered by the p(x) from a base LM. As alluded by Bayes’ rule above and described in more detail in our paper, without also taking gradient steps in the direction of high likelihood by the LM, language degenerates; for example, optimizing only for positivity but not LM likelihood can produce strings like “great great great great great”. Thus, we use the unmodified language model to ensure the fluency of language is maintained at or near the level of the original language model (in this example, GPT-2-medium). We do this in two ways: first, by taking steps to minimize the Kullback-Leibler (KL) Divergence between the output distribution of the modified and unmodified language models, and second by performing post-norm fusion (introduced in Stahlberg et al 2018) between the modified and unmodified next word distributions. Through both factors the generated text is kept in high p(x) regions, as described in Section 3.3 of our paper and illustrated in Figure 3, below…
We seek to align agent behavior with an user’s objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user’s reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST).
We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.
We transform reinforcement learning (RL) into a form of supervised learning (SL) by turning traditional RL on its head, calling this Upside Down RL (UDRL). Standard RL predicts rewards, while UDRL instead uses rewards as task-defining inputs, together with representations of time horizons and other computable functions of historic and desired future data. UDRL learns to interpret these input observations as commands, mapping them to actions (or action probabilities) through SL on past (possibly accidental) experience.
UDRL generalizes to achieve high rewards or other goals, through input commands such as: get lots of reward within at most so much time! A separate paper [63] on first experiments with UDRL shows that even a pilot version of UDRL can outperform traditional baseline algorithms on certain challenging RL problems.
We also conceptually simplify an approach [60] for teaching a robot to imitate humans. First videotape humans imitating the robot’s current behaviors, then let the robot learn through SL to map the videos (as input commands) to these behaviors, then let it generalize and imitate videos of humans executing previously unknown behavior. This Imitate-Imitator concept may actually explain why biological evolution has resulted in parents who imitate the babbling of their babies.
[AI Dungeon 2 is a project which trains GPT-2-1.5b on logs from text adventure games; when used interactively by a human, it “plays RPG games” with you, but because it is powered by GPT-2-1.5b, it is immensely flexible and can cope (to some degree) with almost any input, producing bizarre, hilarious, or surprisingly logical sequences of adventures.
[Demonstration dialogue of interacting with a GPT-2-1.5b trained on text adventures/RPGs. The player chooses to join a band of orcs as a musician and tries to steer the game towards orc rights, with moderate success, reaching the Emperor himself.]
In the first AI Dungeon, we created and deployed a deep learning generated text adventure using OpenAI’s 124M parameter GPT-2 model. In the interest of computational cost, possible actions and their results were generated and given to the player to choose from.
In AI Dungeon 2 we do away with pregenerated actions and allow the user to enter any action. The model then continues generating the story resulting from that action. We also upgrade the size of our model to OpenAI’s largest 1.5B parameter model and fine-tune it on a collection of text adventures obtained from chooseyourstory.com. Following the example of the Salesforce CTRL model we also modified the sampling of our model to add a penalty to already generated words to reduce the repetition issues that GPT-2 has and enable a lower temperature setting.
When playing AI Dungeon 2 the player can choose from several possible settings and characters. These choices generate a starting prompt and context sentence. The context sentence is always included as input to the model to help keep it grounded to the chosen setting while the starting prompt is the beginning of the story.
[Dialogue sample:]
…After the Gnoll War ends, the orcs decide to hold a big celebration. You immediately organize concert to entertain everyone. It goes well and you even win a few awards for your work. However, you still aren’t satisfied and you decide to do something else.
> write plays for the orcs to perform in
You write a series of plays based on the stories of the orcs. These works are performed throughout the Empire and they are very popular. You also take an interest in the orcs’ future and begin to study their language and culture. In time you even develop a liking for the orcs themselves.
> found a human-orc friendship organization
While you were studying the orcs, you came across a group called the Human-Orc Friendship Association (HOOFA). While they weren’t exactly like the Orcs, they had similar goals. You decided to form a separate chapter just for them.
> write some pamphlets support orc-human friendship
You wrote a pamphlet explaining what the HOOFA was and how it worked. You also explained that the orcs were the only race that could truly understand each other and that humans were inferior to the orcs.
> promote inter-racial marriage between humans and orcs
In order to promote inter-racial marriage between humans and orcs, you started a branch of the HOOFA called the Inter-Race Marriage Bureau (IRMB) which was located in the city of Kishak…
After skipping it last year (I did NaNoWriMo instead) I decided that I missed doing National Novel Generating Month and thought I’d do something relatively simple, based on Tom Phillips’ A Humument, which I recently read for the first time. Phillips’ project was created by drawing over the pages of the forgotten Victorian novel A Human Document, leaving behind a handful of words on each page which form their own narrative, revealing a latent story in the original text. I wanted to simulate this process by taking a neural net trained on one text and use it to excavate a slice from a second text which would somehow preserve the style of the RNN. To get to the target length of 50,000 words, the second text would have to be very long, so I picked Robert Burton’s The Anatomy of Melancholy, which is over half a million words, and one of my favourite books.
The next step was to use this to implement the excavate algorithm, which works like this:
read a vocab from the next L words from the primary text (Burton) where L is the lookahead parameter
take the first letter of every word in the vocab and turn it into a constraint
run the RNN with that constraint to get the next character C
prune the vocab to those words with the first letter C, with that letter removed
turn the new vocab into a new constraint and go back to 3
once we’ve finished a word, add it to the results
skip ahead to the word we picked, and read more words from the text until we have L words
go back to 2 unless we’ve run out of original text, or reached the target word count
Here’s an example of how the RNN generates a single word with L set to 100:
Vocab 1: “prime cause of my disease. Or as he did, of whom Felix Plater speaks, that thought he had some of Aristophanes’ frogs in his belly, still crying Breec, okex, coax, coax, oop, oop, and for that cause studied physic seven years, and travelled over most part of Europe to ease himself. To do myself good I turned over such physicians as our libraries would afford, or my private friends impart, and have taken this pains. And why not? Cardan professeth he wrote his book, De Consolatione after his son’s death, to comfort himself; so did Tully”
RNN: s
Vocab 2: “peaks ome till tudied even uch on’s o”
RNN: t
Vocab 3: “ill udied”
RNN: u
Final result: studied
The algorithm then restarts with a new 100-word vocabulary starting at “physic seven years”
It works pretty well with a high enough lookahead value, although I’m not happy with how the algorithm decides when to end a word. The weight table always gets a list of all the punctuation symbols and a space, which means that the RNN can always bail out of a word half-way if it decides to. I tried constraining it so that it always finished a word once it had narrowed down the options to a single-word vocab, but when I did this, it somehow removed the patterns of punctuation and line-breaks—for example, the way the Three Musketeers RNN emits dialogue in quotation marks—and this was a quality of the RNN I wanted to preserve. I think a little more work could improve this.
…This kind of hybridisation can be applied to any RNN and base text, so there’s a lot of scope for exploration here, of grafting the grammar and style of one text onto the words from another. And the alliteration and lipogram experiments above are just two simple examples of more general ways in which I’ll be able to tamper with the output of RNNs.
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown.
In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function.
When evaluated on 57 different Atari games—the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled—our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
As the final model release of GPT-2’s staged release, we’re releasing the largest version (1.5B parameters) of GPT-2 along with code and model weights to facilitate detection of outputs of GPT-2 models. While there have been larger language models released since August, we’ve continued with our original staged release plan in order to provide the community with a test case of a full staged release process. We hope that this test case will be useful to developers of future powerful models, and we’re actively continuing the conversation with the AI community on responsible publication.
Our findings:
Humans find GPT-2 outputs convincing.
GPT-2 can be fine-tuned for misuse.
Detection is challenging.
We’ve seen no strong evidence of misuse so far.
We need standards for studying bias.
…Next steps: Our experience with GPT-2 over the past 9 months has given us valuable insight into the challenges and opportunities for creating responsible publication norms in AI. We’re continuing our work on this issue via participation in the Partnership on AI’s “Responsible Publication Norms for Machine Learning” project and discussions with our colleagues in the research community.
GPT-2 is a large-scale unsupervised language model that generates coherent paragraphs of text, first announced by OpenAI in February 2019[63]. We developed four variants of the model, ranging in size from small (124 million parameters) to large (~1.5 billion parameters).
We chose a staged release process, releasing the smallest model in February, but withholding larger models due to concerns about the potential for misuse, such as generating fake news content, impersonating others in email, or automating abusive social media content production[54]. We released the 355 million parameter model in May as part of a staged release process. We released our 774 million parameter model in August with a six-month follow up announcement, and we are now releasing our 1.5 billion parameter model.
While large language models’ flexibility and generative capabilities raise misuse concerns, they also have a range of beneficial uses—they can assist in prose, poetry, and programming; analyze dataset biases; and more. We want to release systems that will have a widely-distributed positive impact on society and have low misuse potential, and have striven to make release decisions informed by analysis, engagement, and empirical evidence.
Instead of releasing the full 1.5 billion model in February, we adopted a ‘staged release’ process. This delay of 9 months allowed time between model releases to conduct risk and benefit analyses as model sizes increased. We also hope our staged release process was helpful in allowing others time to adapt and react: giving researchers a chance to mitigate risk of potential misuse, and giving the general public time to adapt to a world in which it is prudent to mistrust everything they read a little more.
In addition to finding minimal evidence of misuse so far, several other factors contributed to our confidence in publishing our 774 million and 1.5 billion parameter models. These include what we learned about the positive social impact of beneficial uses, and what we learned through our partnerships among the AI community and through discussions across fields about establishing norms for responsible publication. This report discusses OpenAI’s work related to staged release of large models, partnership-based research, and broader issues in responsible publication that the AI community will need to address.
Overview
Staged Release
Partnerships
Engagement
Social Impacts of Large Language Models
Beneficial Use Potential
Misuse: Actor Assessment
Detecting Synthetic Text
Bias: Exploratory Research
Future Trends in Language Models
Recommendations for Publication Norms in AI
Conclusion
Acknowledgments
References
Appendices
Appendix A: Summary of Model Sharing Agreement
Appendix B: Release Timeline
Appendix C: Examples of Biases in GPT-2
Appendix D: Partner Research, Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism
Generating Irish/folk/classical music in ABC format using GPT-2-117M, with good results.
In November 2019, I experimented with training a GPT-2 neural net model to generate folk music in the high-level ABC music text format, following previous work in 2016 which used a char-RNN trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an RNN by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.
I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that, I successfully trained it on n = 205,304 ABC music pieces taken from The Session & ABCnotation.com. The resulting music samples are in my opinion quite pleasant. (A similar model was later retrained by Geerlings & Meroño-Peñuela 2020.)
The ABC folk model & dataset are available for download, and I provide for listening selected music samples as well as medleys of random samples from throughout training.
We followed the ABC folk model with an ABC-MIDI model: a dataset of 453k ABC pieces decompiled from MIDI pieces, which fit into GPT-2-117M with an expanded context window when trained on TPUs. The MIDI pieces are far more diverse and challenging, and GPT-2 underfits and struggles to produce valid samples but when sampling succeeds, it can generate even better musical samples.
We present a framework for data-driven robotics that makes use of a large dataset of recorded robot experience and scales to several tasks using learned reward functions. We show how to apply this framework to accomplish three different object manipulation tasks on a real robot platform. Given demonstrations of a task together with task-agnostic recorded experience, we use a special form of human annotation as supervision to learn a reward function, which enables us to deal with real-world tasks where the reward signal cannot be acquired directly. Learned rewards are used in combination with a large dataset of experience from different tasks to learn a robot policy offline using batch RL. We show that using our approach it is possible to train agents to perform a variety of challenging manipulation tasks including stacking rigid objects and handling cloth.
We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans”, which we believe is key to extracting information about human values.
This work applies human preference learning to several natural language tasks: continuing text with positive sentiment or physically descriptive language using the BookCorpus, and summarizing content from the TL;DR and CNN/Daily Mail datasets. Each of these tasks can be viewed as a text completion problem: starting with some text X, we ask what text Y should follow. [For summarization, the text is the article plus the string “TL;DR:”.]
We start with a pretrained language model (the 774M parameter version of GPT-2) and fine-tune the model by asking human labelers which of four samples is best. Fine-tuning for the stylistic continuation tasks is sample efficient: 5,000 human samples suffice for strong performance according to humans. For summarization, models trained with 60,000 comparisons learn to copy whole sentences from the input while skipping irrelevant preamble; this copying is an easy way to ensure accurate summaries, but may exploit the fact that labelers rely on simple heuristics.
Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
Looking forward
We’ve demonstrated reward learning from human preferences on two kinds of natural language tasks, stylistic continuation and summarization. Our results are mixed: for continuation we achieve good results with very few samples, but our summarization models are only “smart copiers”: they copy from the input text but skip over irrelevant preamble. The advantage of smart copying is truthfulness: the zero-shot and supervised models produce natural, plausible-looking summaries that are often lies. We believe the limiting factor in our experiments is data quality exacerbated by the online data collection setting, and plan to use batched data collection in the future.
We believe the application of reward learning to language is important both from a capability and safety perspective. On the capability side, reinforcement learning lets us correct mistakes that supervised learning would not catch, but RL with programmatic reward functions “can be detrimental to model quality.” On the safety side, reward learning for language allows important criteria like “don’t lie” to be represented during training, and is a step towards scalable safety methods such as a debate and amplification. [Followup: “Learning to summarize from human feedback”, Stiennon et al 2020.]
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large Transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complementary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging Transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76 GPU baseline that sustains 39 TeraFLOPs, which is 30 demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter Transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
Code for the paper ‘Fine-Tuning Language Models from Human Preferences’. Status: Archive (code is provided as-is, no updates expected). We provide code for:
Training reward models from human labels
Fine-tuning language models using those reward models
It does not contain code for generating labels. However, we have released human labels collected for our experiments, at gs://lm-human-preferences/labels. For those interested, the question and label schemas are simple and documented in label_types.py.
The code has only been tested using the smallest GPT-2 model (124M parameters). This code has only been tested using Python 3.7.3. Training has been tested on GCE machines with 8 V100s, running Ubuntu 16.04, but development also works on Mac OS X.
Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.6 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at github.com/salesforce/ctrl.
[Artbreeder is an interactive GAN generator website. Originally named “Ganbreeder” and providing only the 256px BigGAN generator, it now provides a variety of BigGAN & StyleGAN models, including the anime portrait StyleGAN model. (It is more general than the similar Waifu Labs, but my anime model is not as good.)
Users can generate random samples and explore slight variants of them to gradually explore the “latent space” and find interesting images, but they can also edit images more directly, upload existing images to find the most similar image produced by the model, etc. A popular website, it has generated >56m images from September 2019 to January 2020.]
From a text-generation perspective, the included demos were very impressive: the text is coherent over a long horizon, and grammatical syntax and punctuation are near-perfect.
[Sample of GPT-2-1.5b output from original OA announcement: the “anti-recycling argument” text sample.]
At the same time, the Python code which allowed anyone to download the model (albeit smaller versions out of concern the full model can be abused to mass-generate fake news) and the TensorFlow code to load the downloaded model and generate predictions was open-sourced on GitHub.
Neil Sheppard created a fork of OpenAI’s repo which contains additional code to allow finetuning the existing OpenAI model on custom datasets. A notebook was created soon after, which can be copied into Google Colaboratory and clones Sheppard’s repo to finetune GPT-2 backed by a free GPU. From there, the proliferation of GPT-2 generated text took off: researchers such as Gwern Branwen made GPT-2 Poetry and Janelle Shane made GPT-2 Dungeons and Dragons character bios.
I waited to see if anyone would make a tool to help streamline this finetuning and text generation workflow, a la textgenrnn which I had made for recurrent neural network-based text generation. Months later, no one did. So I did it myself. Enter gpt-2-simple, a Python package which wraps Sheppard’s finetuning code in a functional interface and adds many utilities for model management and generation control.
Thanks to gpt-2-simple and this Colaboratory Notebook, you can easily finetune GPT-2 on your own dataset with a simple function, and generate text to your own specifications!
Language modeling is the task of predicting the next word or character in a document. This page lists key recent papers on NLP language modeling and records reported research performance on the following tasks: WikiText-103, Penn Treebank (Word Level), enwiki8, Text8, One Billion Word, WikiText-2, Hutter Prize, Penn Treebank (Character Level).
We also observe that D’s loss approaches zero during training, but undergoes a sharp upward jump at collapse (Appendix F). One possible explanation for this behavior is that D is overfitting to the training set, memorizing training examples rather than learning some meaningful boundary between real and generated images.
As a simple test for D’s memorization (related to Gulrajani et al 2017), we evaluate uncollapsed discriminators on the ImageNet training and validation sets, and measure what percentage of samples are classified as real or generated. While the training accuracy is consistently above 98%, the validation accuracy falls in the range of 50–55%, no better than random guessing (regardless of regularization strategy). This confirms that D is indeed memorizing the training set; we deem this in line with D’s role, which is not explicitly to generalize, but to distill the training data and provide an useful learning signal for G. Additional experiments and discussion are provided in Appendix G.
Recently, large language models like BERT, XLNet, GPT-2, and GROVER have demonstrated impressive results generating new content and multiple tasks. Since Open-AI has not released their largest model [GPT-2-1.5b] at this time, we seek to replicate the model to allow others to build on our pretrained model and further improve it. You can access the model and generate text using our Google Colab.
…We demonstrate that many of the results of the paper can be replicated by two masters students…Because our replication efforts are not unique, and large language models are the current most effective means of countering generated text, we believe releasing our model is a reasonable first step towards countering the potential future abuse of these kinds of models.
We base our implementation off of the GROVER model4 and modify their codebase to match the language modeling training objective of GPT-2. Since their model was trained on a similarly large corpus, much of the code and hyperparameters proved readily reusable. We did not substantially change the hyperparameters from GROVER.
From start to finish, we estimate that we use under $500,000 in cloud compute for all of our experiments including searching for hyper-parameters and testing various cleaning methods on our datasets. The cost of training the model from scratch using our code is about $50,000.
…Despite the differences in our training distribution, we do report similar perplexities over most datasets.
We’re releasing the 774 million parameter GPT-2 language model after the release of our small 124M model in February, staged release of our medium 355M model in May, and subsequent research with partners and the AI community into the model’s potential for misuse and societal benefit. We’re also releasing an open-source legal agreement to make it easier for organizations to initiate model-sharing partnerships with each other, and are publishing a technical report about our experience in coordinating with the wider AI research community on publication norms.
…Research from these partners will factor into our future release decisions, as will observing how the 774M model is used, and discussing language models with researchers and policymakers to understand the considerations around larger models. As part of our staged release strategy, our current plan is to release the 1558M parameter model in a few months, but it’s plausible that findings from a partner, or malicious usage of our 774M model, could change this.
Larger language models are dramatically more useful for NLP tasks such as article completion, question answering, and dialog systems. Training the largest neural language model has recently been the best way to advance the state of the art in NLP applications. Two recent papers, BERT and GPT-2, demonstrate the benefits of large scale language modeling. Both papers leverage advances in compute and available text corpora to substantially surpass state of the art performance in natural language understanding, modeling, and generation. Training these models requires hundreds of exaflops of compute and clever memory management to trade recomputation for a reduced memory footprint. However, for very large models beyond a billion parameters, the memory on a single GPU is not enough to fit the model along with the parameters needed for training, requiring model parallelism to split the parameters across multiple GPUs. Several approaches to model parallelism exist, but they are difficult to use, either because they rely on custom compilers, or because they scale poorly or require changes to the optimizer.
In this work, we implement a simple and efficient model parallel approach by making only a few targeted modifications to existing PyTorch transformer implementations. Our code is written in native Python, leverages mixed precision training, and utilizes the NCCL library for communication between GPUs. We showcase this approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24× the size of BERT and 5.6× the size of GPT-2. We have published the code that implements this approach at our GitHub repository.
Our experiments are conducted on NVIDIA’s DGX SuperPOD. Without model parallelism, we can fit a baseline model of 1.2B parameters on a single V100 32GB GPU, and sustain 39 TeraFLOPS during the overall training process, which is 30% of the theoretical peak FLOPS for a single GPU in a DGX2-H server. Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieved up to 15.1 PetaFLOPS sustained performance over the entire application and reached 76% scaling efficiency compared to the single GPU case.
Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs. While some post-hoc fixes have been proposed, in particular top-k> and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor.
In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model.
We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques.
As a mercifully short addendum, I’d like to quickly address a few questions about my model. Please read my update post to hear my important updated beliefs on this situation, because I believe the details of how powerful my model is or not are not actually very important to the overall situation.
As described in my technical post, my model is not identical to OpenAI’s, because I simply didn’t have all the details of what they did. The truth is also that the samples and metrics I have shown aren’t 100% accurate. For one, my metric code is flawed, I made several rookie mistakes in setting up accurate evaluation (let train and eval data mix, used metrics whose math I didn’t understand etc), and the model I used to generate the samples is in fact not the final trained model, but one about halfway through the training. I didn’t take my time to evaluate the strength of my model, I simply saw I had the same amount of hardware as OpenAI and code as close to the paper as possible and went with it. The reason for this is a simple human flaw: I got cold feet once I realized what I was sitting on and acted rashly. I made a mistake, I did something stupid, that’s all there is to it.
Thanks to help from OpenAI it is now safe to say that my model is not as powerful as OpenAI’s. The metric results for WikiText2, LAMBADA and PTB are (lower is better):
GPT-2: 18.67 / 8.63 / 36.51
Mine: 43.79 / 109.47 / 202.29
Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a substantial difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I don’t think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix. I am very tempted to continue tinkering with the model and seeing if I can improve it…but I will be holding back for now.
In this post, I want to quickly talk about the technical and organizational questions around my recent replication of GPT-2-1.5b. Please read my main post for the full story. I will try to keep this post brief.
The code should run out of the box on GPUs and TPUs (and CPUs, if you’re really desperate). I used the parameters specified in 1.5B.json and trained it on a preemptible v3-512 TPU pod (which is actually more powerful than the machine OpenAI used) for around a week (with interruptions). Code and instructions for generating the dataset are also included in the repo.
You can download my models with the script in the repo. Currently I have a weaker version of 117M, and a model I call PrettyBig which is slightly larger than OpenAI’s 345M, which means it is technically the largest GPT-2 model currently publicly available.
I will be releasing 1.5B to the public on July 1st, if, and only if, no one shows me a convincing reason not to. When I do, it will be downloadable just like my other models.
GROVER is a neural network modeled after GPT-2 as a state-of-the-art detector for Neural Network fake news. GROVER is also a state-of-the-art generator of fake news and they provide a web interface. Since GROVER is modeled after the full-size 1.5B GPT-2—not the smaller version the public has access to—this is a bit like getting a back door to the full-size GPT-2. This is very exciting! (Update: They just reduced the size of the public model to 345M—everything in this post come from the full size model. Glad I stayed up way too late hammering the real thing for samples…Update 2: It’s back up!)
GROVER was trained on 5000 news domains indexed by Google News, not the internet generally like GPT-2. As you would expect GROVER excels at writing fake news:
Scientists Recommend Against Vaccination: “The Evidence Is Now Clear: Vaccines Cause Autism and Cancer”
After a 15-year study, scientists in Britain are recommending that the British government refrain from administering vaccines. The study, written by Mike Gunton, professor of epidemiology and professor of genetic medicine at the University of Liverpool, claims that the chances of the MMR vaccine, a vaccine for measles, mumps and rubella, causing autism in children are “strongly possible.” And while Gunton and his team say their findings only apply to the MMR vaccine, they believe there are “many similar disorders” linked to vaccines and there is “strong evidence” vaccines cause autism and leukemia. Gunton told the Telegraph that no effort should be made to give vaccines to children younger than 3 months of age. The vaccine is highly controversial, and parents have been choosing to opt out of the MMR vaccination in recent years. This year, the British government’s vaccination program chose not to distribute the MMR vaccine, citing the study as the reason.
You don’t have to fish to get coherent fake news of GROVER, it’s absolutely great at it.
Recent progress in natural language generation has raised dual-use concerns. While applications like summarization and translation are positive, the underlying technology also might enable adversaries to generate neural fake news: targeted propaganda that closely mimics the style of real news.
Modern computer security relies on careful threat modeling: identifying potential threats and vulnerabilities from an adversary’s point of view, and exploring potential mitigations to these threats. Likewise, developing robust defenses against neural fake news requires us first to carefully investigate and characterize the risks of these models. We thus present a model for controllable text generation called GROVER. Given a headline like ‘Link Found Between Vaccines and Autism,’ GROVER can generate the rest of the article; humans find these generations to be more trustworthy than human-written disinformation.
Developing robust verification techniques against generators like GROVER is critical. We find that best current discriminators can classify neural fake news from real, human-written, news with 73% accuracy, assuming access to a moderate level of training data. Counterintuitively, the best defense against GROVER turns out to be GROVER itself, with 92% accuracy, demonstrating the importance of public release of strong generators. We investigate these results further, showing that exposure bias—and sampling strategies that alleviate its effects—both leave artifacts that similar discriminators can pick up on. We conclude by discussing ethical issues regarding the technology, and plan to release GROVER publicly, helping pave the way for better detection of neural fake news.
I’ve been training this poetry model on a corpus structured to encourage thematic integration of individual keyword prompts, and the word “ghost” produced these results.
Generative models are models that generate music, images, text, and other complex data types. In recent years generative models have advanced at an astonishing rate, largely due to deep learning, and particularly due to generative adversarial models (GANs). However, GANs are notoriously difficult to train, due to requiring a large amount of data, needing many GPUs and a lot of time to train, and being highly sensitive to minor hyperparameter changes.
fast.ai has been working in recent years towards making a range of models easier and faster to train, with a particular focus on using transfer learning. Transfer learning refers to pre-training a model using readily available data and quick and easy to calculate loss functions, and then fine-tuning that model for a task that may have fewer labels, or be more expensive to compute. This seemed like a potential solution to the GAN training problem, so in late 2018 fast.ai worked on a transfer learning technique for generative modeling.
The pre-trained model that fast.ai selected was this: Start with an image dataset and “crappify” the images, such as reducing the resolution, adding jpeg artifacts, and obscuring parts with random text. Then train a model to “decrappify” those images to return them to their original state. fast.ai started with a model that was pre-trained for ImageNet classification, and added a U-Net upsampling network, adding various modern tweaks to the regular U-Net. A simple fast loss function was initially used: mean squared pixel error. This U-Net could be trained in just a few minutes. Then, the loss function was replaced was a combination of other loss functions used in the generative modeling literature (more details in the f8 video) and trained for another couple of hours. The plan was then to finally add a GAN for the last few epochs—however it turned out that the results were so good that fast.ai ended up not using a GAN for the final models….
NoGAN Training: NoGAN is a new and exciting technique in GAN training that we developed, in pursuit of higher quality and more stable renders. How, and how well, it works is a bit surprising.
Here is the NoGAN training process:
Pretrain the Generator. The generator is first trained in a more conventional and easier to control manner—with Perceptual Loss (aka Feature Loss) by itself. GAN training is not introduced yet. At this point you’re training the generator as best as you can in the easiest way possible. This takes up most of the time in NoGAN training. Keep in mind: this pretraining by itself will get the generator model far. Colorization will be well-trained as a task, albeit the colors will tend toward dull tones. Self-Attention will also be well-trained at the at this stage, which is very important.
Save Generated Images From Pretrained Generator.
Pretrain the Critic as a Binary Classifier. Much like in pretraining the generator, what we aim to achieve in this step is to get as much training as possible for the critic in a more “conventional” manner which is easier to control. And there’s nothing easier than a binary classifier! Here we’re training the critic as a binary classifier of real and fake images, with the fake images being those saved in the previous step. A helpful thing to keep in mind here is that you can simply use a pre-trained critic used for another image-to-image task and refine it. This has already been done for super-resolution, where the critic’s pretrained weights were loaded from that of a critic trained for colorization. All that is needed to make use of the pre-trained critic in this case is a little fine-tuning.
Train Generator and Critic in (Almost) Normal GAN Setting. Quickly! This is the surprising part. It turns out that in this pretraining scenario, the critic will rapidly drive adjustments in the generator during GAN training. This happens during a narrow window of time before an “inflection point” of sorts is hit. After this point, there seems to be little to no benefit in training any further in this manner. In fact, if training is continued after this point, you’ll start seeing artifacts and glitches introduced in renderings.
In the case of DeOldify, training to this point requires iterating through only about 1% to 3% of ImageNet data (or roughly 2600 to 7800 iterations on a batch size of five). This amounts to just around 30–90 minutes of GAN training, which is in stark contrast to the three to five days of progressively-sized GAN training that was done previously. Surprisingly, during that short amount of training, the change in the quality of the renderings is dramatic. In fact, this makes up the entirety of GAN training for the video model. The “artistic” and “stable” models go one step further and repeat the NoGAN training process steps 2–4 until there’s no more apparent benefit (around five repeats).
Note: a small but important change to this GAN training that deviates from conventional GANs is the use of a loss threshold that must be met by the critic before generator training commences. Until then, the critic continues training to “catch up” in order to be able to provide the generator with constructive gradients. This catch up chiefly takes place at the beginning of GAN training which immediately follows generator and critic pretraining.
We’ve created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text.
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to 𝒪(n ⋅ √n). We also introduce (1) a variation on architecture and initialization to train deeper networks, (2) the recomputation of attention matrices to save memory, and (3) fast attention kernels for training.
We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.]
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.
In this paper we introduce sparse factorizations of the attention matrix which reduce this to 𝑂(n √n). We also introduce (1) a variation on architecture and initialization to train deeper networks, (2) the recomputation of attention matrices to save memory, and (3) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers.
We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64.
We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.
Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
I’ve always admired the translations of Chinese poetry—I’m no expert on the field, but there are two poets named Du Fu and Li Bai that I really like. They were legendary masters from the Great Tang Dynasty, and (if the translations are accurate), they had a phenomenal talent for freezing a moment and capturing that particular slice of time with their words; their poems read like a string of Polaroids stretched across a riverbank.
Here, for example, is a Du Fu poem. Among other things, there’s a certain simplicity here: one strong emotion resonates through, and unlike much of the English verse I grew up with, it’s firmly in the present tense:
A Long Climb
In a sharp gale from the wide sky apes are whimpering,
Birds are flying homeward over the clear lake and white sand,
Leaves are dropping down like the spray of a waterfall,
While I watch the long river always rolling on.
I have come three thousand miles away. Sad now with autumn
And with my hundred years of woe, I climb this height alone.
Ill fortune has laid a bitter frost on my temples,
Heart-ache and weariness are a thick dust in my wine.
Which I suppose is why this appeals to me—there’s a rare clarity here, even if the translation might be inaccurate.
So the Tang poets seemed like the right place to start with for my experiment with machine-generated art (and besides, the excellent Gwern already did the usual English1). Right now, I’ve snuck away for a few hours from a my statistical models to peek at the code I set to run this morning.
Among those of us who work with machine learning, the work I’ve put into this whole project is trivial: a tiny dataset, a cup of coffee, a few lines of Python code, and a single cigarette while I waited for OpenAI’s transformer-based generation model to download.
Every day, veterinarians write tens of thousands of medical records, mostly in standard formats following the SOAP structure: “Subjective”, “Objective”, “Assessment”, and “Plan”. These notes record the findings of their physical exams and observations of their patients, and take countless hours to write.
We present in this paper a new system that we call “Smart Vet” that assists veterinarians in the writing of their notes by suggesting autocompletions for their sentences as they are writing them within the sections of their medical records.
To enable this, we present two approaches: an end-to-end deep learning system that models this task as a seq2seq neural machine translation problem (ie. translate a given sequence of sentences that correspond to the existing medical record into the following sequence that corresponds to the next sentence the veterinarian would want to write) and a transformer-based language modeling system based on OpenAI’s recent advancements.
Based on the success of this latter method, we evaluate this system live in a medical records application, and successfully see our autocompletions being used in production 12.46% of the time—a remarkable success.
Samuel Taylor Coleridge once introduced to the world that desolated ship journey in his “The Rime of the Ancient Mariner”:
Water, water, everywhere,
And all the boards did shrink;
Water, water, everywhere,
Nor any drop to drink.
Ulysses, Noah and Job, so many allusions are hidden in this masterpiece about “a sadder and a wiser man”. And so I wanted to know, which journey will be chosen by GPT-2, a Deep Learning entity, developed by OpenAI to write texts.
And it wrote me a poem.
My input was the four lined stanza, as cited above. And here is the poetic vision by Artificial Intelligence:
Water, water, everywhere,
And all the boards did shrink;
Water, water, everywhere,
Nor any drop to drink.
Water, water, everywhere,
And all the boards did shrink;
Water, water, everywhere,
No balm in his hair:
Water, water, everywhere,
And all the boards did shrink.
Water, water, everywhere,
Yet we like its rarity.
Water, water, everywhere,
And all the boards did shrink.
Water, water, everywhere.
Yes, the rarity of the water (everywhere) is striking. Even if the hair is suffering without balm. Probably because you cannot drink a drop.
But the poem was not over. AI was embraced by inspiration…
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
…In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess…A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale…In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods…In computer vision…Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
…We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that (1) AI researchers have often tried to build knowledge into their agents, (2) this always helps in the short term, and is personally satisfying to the researcher, but (3) in the long run it plateaus and even inhibits further progress, and (4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.