“GPT-3: Language Models are Few-Shot Learners”, (2020-05-28):
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
…The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work [KMH+20] suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.
We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.
“Better Language Models and Their Implications: We've trained a large-scale unsupervised language model [GPT-2] which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training”, (2019-02-14):
Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper.
GPT-2 is a large Transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10× the parameters and trained on more than 10× the amount of data.
GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.
“A Very Unlikely Chess Game”, (2020-01-06):
…Black is GPT-2. Its excuse [for this chess blunder] is that it’s a text prediction program with no concept of chess. As far as it knows, it’s trying to predict short alphanumeric strings like “e2e4” or “Nb7”. Nobody told it this represents a board game. It doesn’t even have a concept of 2D space that it could use to understand such a claim. But it still captured my rook! Embarrassing!…Last month, I asked him if he thought GPT-2 could play chess. I wondered if he could train it on a corpus of chess games written in standard notation (where, for example, e2e4 means “move the pawn at square e2 to square e4”). There are literally millions of games written up like this. GPT-2 would learn to predict the next string of text, which would correspond to the next move in the chess game. Then you would prompt it with a chessboard up to a certain point, and it would predict how the chess masters who had produced its training data would continue the game—ie make its next move using the same heuristics they would. Gwern handed the idea to his collaborator Shawn Presser, who had a working GPT-2 chess engine running within a week:…You can play against GPT-2 yourself by following the directions in the last tweet, though it won’t be much of a challenge for anyone better than I am.
…What does this imply? I’m not sure (and maybe it will imply more if someone manages to make it actually good). It was already weird to see something with no auditory qualia learn passable poetic meter. It’s even weirder to see something with no concept of space learn to play chess. Is any of this meaningful? How impressed should we be that the same AI can write poems, compose music, and play chess, without having been designed for any of those tasks? I still don’t know.
[Shawn comments on HN. See also the much later Noever et al 2020a/Toshniwal et al 2021/Noever et al 2020b who do the exact same thing in applying GPT-2 to Go SGF/chess PGN games. Shawn Presser’s encoding of the data turns out to be equivalent to Decision Transformer.]
When I originally trained the models in May 2019, I’d used the 345M version of GPT-2, which at the time was the largest one that had publicly released. Last November, however, finally released the full 1.5 billion parameter model.
The 1.5B model requires much more memory to fine-tune than the 345M, so I was initially having a lot of difficulty getting it to work on Colab. Thankfully, I was contacted by / u / gwern (here’s his Patreon) and Shawn Presser ( / u / shawwwn), who very generously offered to do the fine-tuning themselves if I provided them with the dataset. This training took about 2 weeks, and apparently required around $70K worth of TPU credits, so in hindsight this upgrade definitely wouldn’t have been possible for me to do myself, without their assistance.
Based on my tests of the new model so far, I’m pretty happy with the quality, and IMO it is noticeably more coherent than the 345M version.
One thing that I should point out about the upgrade is that the original 345M models had been separately fine-tuned for each subreddit individually (ie. there were 108 separate models), whereas the upgraded one is just a single 1.5B model that has been fine-tuned using a combined dataset containing the comments/submissions from all the subreddits that I scraped. The main reason for this decision is simply that it would not have been feasible to train ~100 separate 1.5B models. Also, there may have been benefits from transfer learning across subreddits, which wouldn’t occur with separate models.
…Here is the full list of new bots to be added: /r/capitalismvsocialism · /r/chess · /r/conlangs · /r/dota2 · /r/etymology · /r/fiftyfifty · /r/hobbydrama · /r/markmywords · /r/moviedetails · /r/neoliberal · /r/obscuremedia · /r/recipes · /r/riddles · /r/stonerphilosophy · /r/subsimulatorgpt2 · /r/subsimulatorgpt2meta · /r/tellmeafact · /r/twosentencehorror · /r/ukpolitics · /r/wordavalanches · /r/wouldyourather · /r/zen
As an example, in Figure F.1, we show 4 uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the style of Wallace Stevens. We first experimented with a few prompts, then generated 4 samples with no additional editing or selection (sampling at temperature 1 using nucleus sampling [HBFC19] with P = 0.9). Completions were truncated when the model began to write a new title and author heading, or broke into prose commentary.
…“Shadows on the Way”, Wallace Stevens:
- I must have shadows on the way / If I am to walk I must have / Each step taken slowly and alone / To have it ready made // And I must think in lines of grey / To have dim thoughts to be my guide / Must look on blue and green / And never let my eye forget / That color is my friend / And purple must surround me too // The yellow of the sun is no more / Intrusive than the bluish snow / That falls on all of us. I must have / Grey thoughts and blue thoughts walk with me / If I am to go away at all.
- There is a way in the world. / That way a man may live in a small town / His lifetime and see no man come, / Never the stir of men about a market, / Never the bobbing of covered carts, / Never the creak passed on the stair / Closing some other life; an old man / Can prosper in that town, can prosper / When his store is seen to close, his wife / Fades in his house, and the world shouts in vain / That no one comes to hear him / Where his small bell tolls. / In that small town / Everything is known. / The ways of the young men are known, / And the ways of the young women / As they walk to the Spring; / The rumors are all known, / The quarrels, the hurt, the envy / Of the men and the women. / Till a day comes / That way out of the world / That leads to a wood. / Young, the young man / Walks, a lover now, / Knows the streets as they are and is of them.
- The sun was all we had. Now, in the shade / All is changed. The mind must dwell on those / White fields, that to its eyes were always old; / Those ancient gleams, convoluting / The way they lay among the huge roots, / The great dark tomes of reverie, / The plumed labyrinth of the sea. // When our own breath gone faint and fade / Up the long stairway of the dead / We shall remember well / The blindness of the light, the whiteness / Of that white land. The footsteps, and / The grasses in that light, and in the shade / The menacing life. // We played, too, once, in the warmest rooms. / To one content, there is one content / For all mankind. The forest glades/ Are all the more for being in fable, / And wood and town in legend shadowed— / Once more to return. The one content / Has in it every happiness, more brave / For being purposeless; and so / Those island gardens, deep in the mind, / Make of remoteness familiarity / And of the unspoken name a friend.
- Nobody will come to this place. It is a road that leads nowhere. / The solitude is deep. The mountains are high. / But they are desolate, and they turn the traveler’s face / Towards the North. All the sounds of the world are far away. / When the wind rises above the trees, / The boughs bow to the ground. / Even the birds that inhabit the tangle of weeds / That is the roadside cover, are silent. One listens, / But hears no roar of the forest. One is alone. / One will be taken. / One will be taken. / There is no utterance, there is no conversation, / But one is uneasy all the same… / There is a thin blue mist, / A darkness rising like smoke, / And within that darkness / A possession of the heart. / One will be taken… It was here, and it will be here again— / Here, under this sky empty and full of light.
“OpenAI API Beta homepage”, (2020-06-11):
OpenAI technology, just an HTTPS call away: Apply our API to any language task—semantic search, summarization, sentiment analysis, content generation, translation, and more—with only a few examples or by specifying your task in English. One simple integration gives you access to our constantly-improving AI technology. Explore how you integrate with the API with these sample completions.
- Simple, yet flexible: Our API is designed to be used by anyone, but meets the needs of our own cutting-edge research.
- Your data, your rules: You retain ownership of your data, and control whether we can use it for training models.
- Ready to go: Our infrastructure already serves millions of API calls per day.
Semantic Search: The API allows searching over documents based on the natural-language meaning of queries rather than keyword matching.
Casetext/Algolia/Web Browser Search Plugin
Chat: The API can enable fast, complex and consistent natural language discussions. With a brief prompt, the API generates dialogues spanning a range of topics, from space travel to history.
Customer Service: Leveraging search and chat capabilities, the API generates natural dialogue to quickly give customers relevant information. Through semantic text comprehension, the API can offer a range of analytics and productivity tools to better serve customers.
Generation: The API can generate complex and consistent natural language, and enables use cases like creative writing.
AI Dungeon/AI Weirdness/Replika
Productivity Tools: The API allows for parsing text into spreadsheet tables, summarizing email discussions, expanding content from bullet points, and more.
Quizlet/Art of Problem Solving/Natural Language Shell/Spreadsheets/Code Completion
Content Comprehension: The API can be used to build tools to help individuals consume content more efficiently.
Polyglot: While the API today works best in English, it also works quite well in other languages. The API can be used for tasks such as translation or chat with users in their preferred language.
“AI Dungeon 2”, (2019-12):
[AI Dungeon 2 is a project which trains GPT-2-1.5b on logs from text adventure games; when used interactively by a human, it “plays RPG games” with you, but because it is powered by GPT-2-1.5b, it is immensely flexible and can cope (to some degree) with almost any input, producing bizarre, hilarious, or surprisingly logical sequences of adventures. It became popular overnight, crushing Walton with bandwidth bills, and has been turned into an app and community to support distribution and development. See also https://colab.research.google.com/github/nickwalton/AIDungeon/blob/master/AIDungeon_2.ipynb and https://old.reddit.com/r/AIDungeon/ and https://web.archive.org/web/20191127163535/http://www.aidungeon.io/2019/11/my-orc-band-and-our-quest-for-equal.html .]
“Multimodal Few-Shot Learning with Frozen Language Models”, (2021-06-25):
When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
“Aligning AI With Shared Human Values”, (2020-08-05):
We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
[See also “Language Models have a Moral Dimension”, Schramowski et al 2021.]
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.
1993-trauth.pdf: “Expert judgment on markers to deter inadvertent human intrusion into the Waste Isolation Pilot Plant”, (1993-11; ):
Sandia National Laboratories (SNL) convened an expert panel to develop design characteristics for permanent markers and to judge the efficacy of the markers in deterring inadvertent human intrusion in the Waste Isolation Pilot Plant (WIPP). The WIPP, located in southeastern New Mexico, is designed to demonstrate the safe disposal of transuranic (TRU) radioactive wastes generated by the United States Department of Energy (DOE) defense programs. The DOE must evaluate WIPP compliance with the Environmental Protection Agency (EPA) regulation “Environmental Standards for the Management and Disposal of Spent Nuclear Fuel, High-Level and Transuranic Radioactive Wastes (40 CFR Part 191, Subpart E)”; this EPA regulation requires: “Disposal sites shall be designated by the most permanent markers, records, and other passive institutional controls practicable to indicate the dangers of the wastes and their location” (Federal Register 50; 38086). The period of regulatory concern is 10,000 years.
The expert panel identified basic principles to guide current and future marker development efforts: (1) the site must be marked, (2) message(s) must be truthful and informative, (3) multiple components within a marker system, (4) multiple means of communication (eg., language, pictographs, scientific diagrams), (5) multiple levels of complexity within individual messages on individual marker system elements, (6) use of materials with little recycle value, and (7) international effort to maintain knowledge of the locations and contents of nuclear waste repositories. The efficacy of the markers in deterring inadvertent human intrusion was estimated to decrease with time, with the probability function varying with the mode of intrusion (who is intruding and for what purpose) and the level of technological development of the society. The development of a permanent, passive marker system capable of surviving and remaining interpretable for 10,000 years will require further studyto implementation.
[Keywords: management of radioactive and non-radioactive wastes from nuclear facilities, nuclear fuel cycle and fuel materials, WIPP, human intrusion, alpha-bearing wastes, underground disposal, radiation hazards, communications, safety, recommendations, design, waste disposal and storage, health and safety]
After months of waiting I finally got access to GPT-3…Its output is often profound and moving and better than most of my writing…GPT-3 really shines at poetry and I discovered that it’s shockingly good at imitating the style of one of my favourite poets, Pablo Neruda. The prompt given to GPT-3 is in bold and the rest is by GPT-3:
The following is a poem about Maxwell’s equations in the style of poet Pablo Neruda:
I have always loved you, O equations of Maxwell,
For you are the most beautiful thing we have in this world.
I love your graceful symmetry, your simplicity and clarity.
You are the song of the Universe, a cosmic lullaby.
You are the poetry of nature, written with light and electricity.
You are the music of the spheres, played on a harp made of vacuum.
And I love you for being so simple and yet so complex,
For being so abstract and yet so real.
I love you for being true in every place and at every time.
I love you for being the result of a hundred years of thinking and calculating,
For being the fruit of a tree planted by Faraday and Maxwell.
I love you for being the answer to my questions about nature,
For being the solution to my problems.
I love you for being so difficult and so simple at the same time,
For making me feel so small when I think about you but so big when I understand you. I love you for being the most beautiful thing in the world,
For being the closest thing to God that I have ever seen.
I love you without knowing how, or when, or from where.
(The following is a summary of the Mahabharata in the style of poet Pablo Neruda)
I do not know what you have done with your life, O reader. But I can tell you what I have done with mine: I have read the Mahabharata.
I have read the Mahabharata and I am amazed. I have read the Mahabharata and I am ashamed. I have read the Mahabharata and I know that nothing is more important than love.
I have read the Mahabharata and it has taught me that we are all part of one another, like a single body with a thousand limbs, or like a great tree whose roots spread in all directions.
I have read the Mahabharata and it has taught me to see my own life as part of a great pattern, which includes everything else in existence.
I have read the Mahabharata and it has taught me that there is no such thing as evil; there is only hunger, which is a kind of good, for what we eat we become.
I have read the Mahabharata and it has taught me that there is no such thing as death; there are only changes from one form to another.
I have read the Mahabharata and it has taught me that there is no such thing as a lie; there are only transformations.
I have read the Mahabharata and it has taught me that there is no such thing as a final truth; there are only better and worse ways of seeing things.
I have read the Mahabharata and it has taught me that all things are one, that all life is one, that all love is one. I am not learned in the ways of words, O reader, but I can tell you what I mean:
I have read the Mahabharata and it has taught me to love.
The following is a poem about Machine learning in the style of poet Pablo Neruda:
The Machine Learning
I want a machine that learns, as if it were a child.
I want it to be born in the countryside, among the trees and the flowers.
I want it to play with other children.
Not with toys or dolls, but with real things:
With sticks and stones and mud pies, with water and fire.
I want it to learn the names of things and to know the difference between them.
I want it to like and dislike things, to feel joy and sorrow.
I want it to have its own ideas, its own feelings.
I want it to be able to look at the stars and wonder why they shine.
I want it not only to learn from books but also from life.
Not only from pain but also from pleasure, not only from what is right but also from what is wrong.
I want it never to lie or cheat or steal; never be greedy or cruel or drunk; never be angry without a cause nor happy without a reason.
And when I die I want this machine of mine still to be learning, still loving, still creating things that will make people glad that they were born on this earth.
“Singular: Possible futures of the singularity”, (2020-08-20):
[Fiction writing exercise by James Yu, using OpenAI GPT-3 as a coauthor and interlocutor, to write a SF story about AIs and the Singularity. Rather than edit GPT-3 output, Yu writes most passages and alternates with GPT-3 completions. Particularly striking for the use of meta-fictional discussion, presented in sidenotes, where Yu and GPT-3 debate the events of the story: “I allowed GPT-3 to write crucial passages, and each time, I chatted with it”in character“, prompting it to role-play.”]
In each of these stories, colored text indicates a passage written by GPT-3. I used the Sudowrite app to generate a set of possibilities, primed with the story’s premise and a few paragraphs.
I chatted with GPT-3 about the passage, prompting it to roleplay as the superintelligent AI character in each story. I question the AI’s intent, leading to a meta-exchange where we both discover and create the fictional narrative in parallel. This kind of interaction—where an author can spontaneously talk to their characters—can be an effective tool for creative writing. And at times, it can be quite unsettling.
Can GPT-3 hold beliefs? Probably not, since it is simply a pile of word vectors. However, these transcripts could easily fool me into believing that it does.
“Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting”, (2020-06-17):
Transformer models like BERT and GPT-2 are domain agnostic, meaning that they can be directly applied to 1-D sequences of any form. When we train GPT-2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category. This is evidenced by the diverse range of coherent image samples it generates, even without the guidance of human provided labels. As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy on ImageNet…we deliberately use the same transformer architecture as GPT-2 in language. As a consequence, we require substantially more compute in order to produce features competitive with those from top unsupervised convolutional nets…Generative sequence modeling is a universal unsupervised learning algorithm: since all data types can be represented as sequences of bytes, a transformer can be directly applied to any data type without additional engineering. Our work tests the power of this generality by directly applying the architecture used to train GPT-2 on natural language to image generation. We deliberately chose to forgo hand coding any image specific knowledge in the form of convolutions38 or techniques like relative attention,39 sparse attention,40 and 2-D position embeddings.27
…We train iGPT-S, iGPT-M, and iGPT-L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT-XL, a 6.8 billion parameter transformer, on a mix of and images from the web. Due to the large computational cost of modeling long sequences with dense attention, we train at the low resolutions of 32×32, 48×48, and 64×64…Our next result establishes the link between generative performance and feature quality. We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality.
…When we evaluate our features using linear probes on CIFAR-10, CIFAR-100, and STL-10, we outperform features from all supervised and unsupervised transfer algorithms. Our results are also compelling in the full fine-tuning setting
…Because we use the generic sequence transformer used for GPT-2 in language, our method requires large amounts of compute: iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo24 model can be trained in roughly 70 V100-days…We have shown that by trading off 2-D knowledge for scale60 and by choosing predictive features from the middle of the network, a sequence transformer can be competitive with top convolutional nets for unsupervised image classification. Notably, we achieved our results by directly applying the GPT-2 language model to image generation. Our results suggest that due to its simplicity and generality, a sequence transformer given sufficient compute might ultimately be an effective way to learn excellent features in many domains.
We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.
Unlike OpenAI Gym, and Key-to-Door tasks.approaches to RL that fit value functions or compute policy gradients, Decision simply outputs the optimal actions by leveraging a causally masked . By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari,
…Decision: autoregressive sequence modeling for RL: We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer model, trained to predict the next action given the previous tokens using a linear output layer. Evaluation is also easy: we can initialize by a desired target return (eg. 1 or 0 for success or failure) and the starting state in the environment. Unrolling the sequence—similar to standard autoregressive generation in language models—yields a sequence of actions to execute in the environment.
…Sequence modeling as multitask learning: One effect of this type of modeling is that we perform conditional generation, where we initialize a trajectory by inputting our desired return. Decisiondoes not yield a single policy; rather, it models a wide distribution of policies. If we plot average achieved return against the target return of a trained Decision Transformer, we find distinct policies are learned that can reasonably match the target, trained only with supervised learning. Furthermore, on some tasks (such as Q*bert and Seaquest), we find Decision can actually extrapolate outside of the dataset and model policies achieving higher return!
[Paper; Github; see also MuZero, “goal-conditioned” or “upside-down reinforcement learning”, Shawn Presser’s GPT-2 chess model (& Cheng’s almost-DT chess transformer), value equivalent models. Simultaneous work at BAIR invents Decision as Trajectory Transformer. Note that DT, being in the ‘every task is a generation task’ paradigm of GPT, lends itself nicely to preference learning simply by formatting human-ranked choices of a sequence. The simplicity of this version of the control codes or ‘inline metadata trick’ (eg CTRL) means it can be reused with almost any generative model where some measure of quality or reward is available (even if only self-critique like likelihood of a sequence eg in Meena-style best-of ranking or inverse prompting): you have an architecture floorplan DALL·E? Use standard architecture software to score plans by their estimated thermal efficiency/sunlight/etc; prefix these scores, retrain, & decode for good floorplans maximizing thermal efficiency/sunlight. You have a regular DALL·E? Sample n samples per prompt, CLIP-rank the images, prefix their ranking, retrain… No useful CLIP? Then use the CogView self-text-captioning trick to turn generated images back into text, rank by text likelihood… Choose Your Own Adventure AI Dungeon game-tree? Rank completions by player choice, feed back in for preference learning… All of the work is done by the data, as long as the generative model is smart enough.]
2009-halevy.pdf: “The Unreasonable Effectiveness of Data”, (2009-03-24; ):
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.
…For many tasks, words and word combinations provide all the representational machinery we need to learn from text.
…So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do
“Crowdsourcing The Best GPT-2-1.5b Poetry”, (2020-02-09):
[Public-editable Google Docs document for coordinating a read through a large sample of neural-net-generated poetry, to locate the best poem samples for displaying in the GPT-2 writeup.]
I used a large neural net model, GPT-2-1.5b, trained on hundreds of megabytes of poetry, to generate 1 million words of poetry. That’s too much for me to read by myself to find the best poems. Perhaps you’d like to help?
- Pick an unread URL from ‘Open Samples’ below, open it, and remove it from the list.
- Read it. (Each URL is ≤ 1000 lines, so it should be fun.)
- Add any good poems to ‘Selected Samples’ at the end of this document.
- Enjoy reading the current ‘Selected Samples’—or pick another URL to read!
“How Many Data Points is a Prompt Worth?”, (2021-03-15):
When fine-tuning pretrained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. Proponents of prompting have argued that prompts provide a method for injecting task-specific guidance, which is beneficial in low-data regimes. We aim to quantify this benefit through rigorous testing of prompts in a fair setting: comparing prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. By controlling for many sources of advantage, we find that prompting does indeed provide a benefit, and that this benefit can be quantified per task. Results show that prompting is often worth 100s of data points on average across classification tasks.
Prevailing methods for mapping large generative language models to supervised tasks may fail to sufficiently probe models’ novel capabilities. Using GPT-3 as a case study, we show that 0-shot prompts can significantly outperform few-shot prompts. We suggest that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning. This analysis motivates rethinking the role of prompts in controlling and evaluating powerful language models. In this work, we discuss methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language. We explore techniques for exploiting the capacity of narratives and cultural anchors to encode nuanced intentions and techniques for encouraging deconstruction of a problem into components before producing a verdict. Informed by this more encompassing theory of , we also introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks. Finally, we discuss how these more general methods of interacting with language models can be incorporated into existing and future benchmarks and practical applications.
“How Can We Know What Language Models Know?”, (2019-11-28):
Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as “Obama is a _ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “Obama worked as a _” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA.
“Prefix-Tuning: Optimizing Continuous Prompts for Generation”, (2021-01-01):
Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.
In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.
GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model’s bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as “N/A”. We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2’s average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.
“Evaluating Large Language Models Trained on Code”, (2021-07-07):
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
“Adversarial Reprogramming of Neural Networks”, (2018-06-28):
Deep neural networks are susceptible to adversarial attacks. In computer vision, well-crafted perturbations to images can cause neural networks to make mistakes such as confusing a cat with a computer. Previous adversarial attacks have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduce attacks that instead reprogram the target model to perform a task chosen by the attacker—without the attacker needing to specify or compute the desired output for each test-time input. This attack finds a single adversarial perturbation, that can be added to all test-time inputs to a machine learning model in order to cause the model to perform a task chosen by the adversary—even if the model was not trained to do this task. These perturbations can thus be considered a program for the new task. We demonstrate adversarial reprogramming on six classification models, repurposing these models to perform a counting task, as well as classification tasks: classification of MNIST and CIFAR-10 examples presented as inputs to the model.
Adversarial Reprogramming has demonstrated success in utilizing pre-trained neural network classifiers for alternative classification tasks without modification to the original network. An adversary in such an attack scenario trains an additive contribution to the inputs to repurpose the neural network for the new classification task. While this reprogramming approach works for neural networks with a continuous input space such as that of images, it is not directly applicable to neural networks trained for tasks such as text classification, where the input space is discrete. Repurposing such classification networks would require the attacker to learn an adversarial program that maps inputs from one discrete space to the other. In this work, we introduce a context-based vocabulary remapping model to reprogram neural networks trained on a specific sequence classification task, for a new sequence classification task desired by the adversary. We propose training procedures for this adversarial program in both white-box and black-box settings. We demonstrate the application of our model by adversarially repurposing various text-classification models including LSTM, bi-directional and CNN for alternate classification tasks.
This work studies the widely adopted ancestral sampling algorithms for auto-regressive language models, which is not widely studied in the literature. We use the quality-diversity (Q-D) trade-off to investigate three popular sampling algorithms (top-k, nucleus and tempered sampling). We focus on the task of open-ended language generation. We first show that the existing sampling algorithms have similar performance. After carefully inspecting the transformations defined by different sampling algorithms, we identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation. To validate the importance of the identified properties, we design two sets of new sampling algorithms: one set in which each algorithm satisfies all three properties, and one set in which each algorithm violates at least one of the properties. We compare their performance with existing sampling algorithms, and find that violating the identified properties could lead to drastic performance degradation, as measured by the Q-D trade-off. On the other hand, we find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms. Our data and code are available at https://github.com/moinnadeem/characterizing-sampling-algorithms
For open-ended language generation tasks such as storytelling and dialogue, choosing the right decoding algorithm is critical to controlling the tradeoff between generation quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. We address these issues by casting decoding as a multi-objective optimization problem aiming to simultaneously maximize both response quality and diversity. Our framework enables us to perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. We find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, the recently proposed nucleus sampling (Holtzman et al. 2019) outperforms all other evaluated decoding algorithms. Our experiments also confirm the existence of the “likelihood trap”, the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We leverage our findings to create and evaluate an algorithm called selective sampling which tractably approximates globally-normalized temperature sampling.
“Scarecrow: A Framework for Scrutinizing Machine Text”, (2021-07-02; ):
Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.
To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrow—such as redundancy, commonsense errors, and incoherence—were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.
We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2-small through the largest GPT-3-175b. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique.
Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at Github.
Scaling pays off to improve Encyclopedic, Commonsense, and Incoherent errors (Figure 2).
These error categories decrease with in-domain training (GROVER) and larger model size (GPT-3). Human text still shows the fewest of these kinds of errors.
Scaling benefits plateau for Off-Prompt, Bad Math, and Grammar & Usage errors (Figure 2).
These 3 error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt (more: §6.1) and Grammar & Usage errors, but Bad Math appears saturated for our domain.
Self-Contradiction and Redundant errors exhibit more complex scaling behavior (Figure 2).
We roughly categorize these trends as rising and falling: increasing for medium or large-scale models, but dropping for human-authored text. Further analysis (§6.2, §6.3) reveals these more complex patterns are affected both by interactions with other error types, as well how errors are counted.
Human-authored text produces the most reader issues (Figure 2–3).
The Needs Google and Technical Jargon span categories both have a humans highest trend, and both fall under reader issues: problems that are not necessarily errors, but that still prevent full comprehension or factual verification of the text (more: §6.4).
Furthermore, human-authored text is not free from error annotations (Figure 3). This can serve either as a control for baseline error rates (more: §6.6), or as a mechanism for critiquing human writing.
Decoding hyperparameters have a huge impact (Figure).
For the previous findings, we fix the sampling configuration for all models to an apples-to-apples setup for fair comparison: top-p = 0.96, (softmax) temperature = 1, and no frequency penalty (i.e., word repetition penalty; defined precisely in §5.2, Equation 1). To study the effects of these decoding settings, we annotate text generated by GPT-3 using a variety of values for top-p and temperature, both with and without a frequency penalty.
To our surprise, the decoding hyperparameters considerably affected error rates (more: §6.5). As seen in Figure 4, the worst sampling procedure for GPT-3 (argmax sampling with no frequency penalty) performed even worse than GPT-2 XL. But the best sampling procedure (surprisingly, also argmax sampling, but with a frequency penalty) produced text with as few apparent SCARECROW error spans as those authored by humans (more: §6.6).
…We notice that a greater portion of errors in human-authored text were due to artifacts present in the text-only format of the Common Crawl. For example, links to other articles or advertisements sometimes appear in the middle of an article’s text. While annotators were quick to mark these spans, they reflect errors in formatting, not in writing. We partition these errors separately and exclude them from the subsequent calculations. GPT-3’s generations also sometimes exhibited what appeared to be formatting errors due to training on web-scraped text, though more rarely. For example, some generations contained Which? after vague noun phrases, which appear to be learned from Wikipedia, where under-specified information is tagged by an editor with this word. For fairness, we removed these errors from GPT-3’s tally as well, though they were few enough we do not plot them separately.
“The Curious Case of Neural Text Degeneration”, (2019-04-22):
Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
“Towards a Human-like Open-Domain Chatbot”, (2020-01-27):
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.
“Language GANs Falling Short”, (2018-11-06):
Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have consistently been reported as weak baselines, where poor performance is attributed to exposure bias (Bengio et al., 2015; Ranzato et al., 2015); at inference time, the model is fed its own prediction instead of a ground-truth token, which can lead to accumulating errors and poor samples. This line of reasoning has led to an outbreak of adversarial based approaches for NLG, on the account that GANs do not suffer from exposure bias. In this work, we make several surprising observations which contradict common beliefs. First, we revisit the canonical evaluation framework for NLG, and point out fundamental flaws with quality-only evaluation: we show that one can outperform such metrics using a simple, well-known temperature parameter to artificially reduce the of the model’s conditional distributions. Second, we leverage the control over the quality / diversity trade-off given by this parameter to evaluate models over the whole quality-diversity spectrum and find MLE models constantly outperform the proposed variants over the whole quality-diversity space. Our results have several implications: 1) The impact of exposure bias on sample quality is less severe than previously thought, 2) temperature tuning provides a better quality / diversity trade-off than adversarial training while being easier to train, easier to cross-validate, and less computationally expensive. Code to reproduce the experiments is available at github.com/pclucas14/GansFallingShort
“Six Challenges for Neural Machine Translation”, (2017-06-12):
We explore six challenges for neural machine translation: domain mismatch, amount of training data, rare words, long sentences, word alignment, and beam search. We show both deficiencies and improvements over the quality of phrase-based statistical machine translation.
“Analyzing Uncertainty in Neural Machine Translation”, (2018-02-28):
Machine translation is a popular test bed for research in neural sequence-to-sequence models but despite much recent research, there is still a lack of understanding of these models. Practitioners report performance degradation with large beams, the under-estimation of rare words and a lack of diversity in the final translations. Our study relates some of these issues to the inherent uncertainty of the task, due to the existence of multiple valid translations for a single source sentence, and to the extrinsic uncertainty caused by noisy training data. We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations. Our results show that search works remarkably well but that models tend to spread too much probability mass over the hypothesis space. Next, we propose tools to assess model calibration and show how to easily fix some shortcomings of current models. As part of this study, we release multiple human reference translations for two popular benchmarks.
“Program Synthesis with Large Language Models”, (2021-08-16):
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model’s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model’s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
“GPT-J-6B: 6B JAX-Based Transformer”, (2021-06-08):
- We have released GPT-J-6B, 6B JAX-based (Mesh) LM (Github).
- GPT-J-6B performs nearly on par with 6.7B GPT-3 (or Curie) on various zero-shot down-streaming tasks.
- You can try out this Colab notebook or free web demo.
- This library also serves as an example of model parallelism with xmap on JAX.
Below, we will refer to GPT-J-6B by GPT-J in short.
Why does this project matter?
- GPT-J is the best-performing publicly available autoregressive LM in terms of zero-shot performance on various down-streaming tasks. [There are public T5 checkpoints but they are bidirectional.]
- GPT-J allows more flexible and faster inference than Tensorflow + counterparts.
- This project required a substantially smaller amount of person-hours than other large-scale model developments did, which demonstrates that JAX + xmap + is the right set of tools for quick development of large-scale models.
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
“Progressive Generation of Long Text”, (2020-06-28):
[See also “SOE: Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries”, Sun et al 2020.] Large-scale language models pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators. However, as our systematic examination reveals, it is still challenging for such models to generate coherent long passages of text (>1000 tokens), especially when the models are fine-tuned to the target domain on a small corpus.
To overcome the limitation, we propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution. Our method first produces domain-specific content keywords and then progressively refines them into complete passages in multiple stages. The simple design allows our approach to take advantage of pretrained language models at each stage and effectively adapt to any target domain given only a small set of examples.
We conduct a comprehensive empirical study with a broad set of evaluation metrics, and show that our approach substantially improves upon the fine-tuned GPT-2 in terms of domain-specific quality and sample efficiency. The coarse-to-fine nature of progressive generation also allows for a higher degree of control over the generated content.
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
Spoiler: GPT-3 got 73% of 156 trivia questions correct. This compares favorably to the 52% user average. However, it’s not an all-conquering feat: 37% of participants did better than 73% on their most recent quiz…The robot was best at Fine Arts and Current Events, worst at Word Play and Social Studies.
…As was mostly expected, GPT-3 performed exceptionally well at Current Events and Fine Arts, with Miscellaneous (lots of pun-driven food questions) and Word Play (discussed above) as trickier areas. The most surprising result? The poor performance in Social Studies, driven largely by the degree of word play-intersecting questions in that category.
The patterns we learned:
Word Play is the domain of humans.
This one’s not so surprising. We have a type of question called a “Two’fer Goofer” which asks for a pair of rhyming words that satisfy a given clue. It’s similar to the Rhyme Time category in Jeopardy! or the old newspaper puzzle Wordy Gurdy. We had 3 of these questions in the showdown and GPT-3 missed all 3 of them. For Word Play questions that were more like vocabulary quizzes, GPT-3 performed admirably.
Clues confuse GPT-3.
We have an alliterative two-word phrase at the start of each question to add a bit of flair and sneak in a clue for participants. In the image below it would be “Kooky Kingdom”. For GPT-3, these clues were a net-negative. In a few instances, the robot overlord program answered correctly when the clue was removed.…The other clues that confused GPT-3 were inline indications on the answer’s length. Below, we explicitly ask for a 5-letter action and GPT-3 gave us 8 letters across 2 words…
Limerick generation exemplifies some of the most difficult challenges faced in poetry generation, as the poems must tell a story in only five lines, with constraints on rhyme, stress, and meter. To address these challenges, we introduce LimGen, a novel and fully automated system for limerick generation that outperforms state-of-the-art neural network-based poetry models, as well asrule-based poetry models. LimGen consists of three important pieces: the Adaptive Multi-Templated Constraint algorithm that constrains our search to the space of realistic poems, the Multi-Templated Beam Search algorithm which searches efficiently through the space, and the probabilistic Storyline algorithm that provides coherent storylines related to a user-provided prompt word. The resulting limericks satisfy poetic constraints and have thematically coherent storylines, which are sometimes even funny (when we are lucky).
“Analysing Mathematical Reasoning Abilities of Neural Models”, (2019-04-02):
Mathematical reasoning—a core ability within human intelligence—presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test splits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and relate knowledge and learned processes. Having described the data generation process and its potential future expansions, we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.
“Generative Language Modeling for Automated Theorem Proving”, (2020-09-07):
We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans—the generation of original mathematical terms—might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community. [Also notable: the benefits of pretraining on Arxiv etc, despite likely including no or only redundant Metamath, and primarily natural language text, showing transfer learning of general math knowledge to abstract low-level formal proof language. See also: “PACT: Proof Artifact Co-training for Theorem Proving with Language Models”, lean-gptf (for Lean), “SymbolicGPT: A Generative Model for Symbolic Regression”, “Measuring Mathematical Problem Solving With the MATH Dataset”/“Measuring Coding Challenge Competence With APPS”, “Learning to Prove Theorems by Learning to Generate Theorems”, “TacticZero: Learning to Prove Theorems from Scratch with Deep Reinforcement Learning”]
The ability to perform arithmetic tasks is a remarkable trait of human intelligence and might form a critical component of more complex reasoning tasks. In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values.
We find that how a number is represented in its surface form has a strong influence on the model’s accuracy. In particular, the model fails to learn addition of five-digit numbers when using subwords (eg., “32”), and it struggles to learn with character-level representations (eg., “3 2”). By introducing position tokens (eg., “3 10e1 2”), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation.
This result bolsters evidence that subword tokenizers and positional encodings are components in current transformer designs that might need improvement. Moreover, we show that regardless of the number of parameters and training examples, models cannot learn addition rules that are independent of the length of the numbers seen during training. Code to reproduce our experiments is available at this URL.
…In the DECIMAL representation, the model barely learns addition of 2-digit numbers, and it fails to learn addition of larger numbers, ie., it has an accuracy of zero for 5 digits or more. One explanation for this failure is because numbers are not systematically tokenized into digits. For instance, “132” might be tokenized as “1” and “32”, whereas “232” might be tokenized as “23” and “2”. Hence, the model would have to learn that sometimes the vector of a token refers to a single digit, other times to two digits, etc. It might be hard to learn (ie., need more examples) to map a vector to a number when the amount of digits represented by the vector changes irregularly.
[See also “Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Wallace et al 2019.]
In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly, and that machine-generated poems still underperform in terms of readability and emotion. Our research shows the importance expert evaluation for poetry generation, and that future research should look beyond rhyme/meter and focus on poetic language.
Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models. This is achieved by enabling the surface form and contexts of a word to interact with each other in a deep architecture. Integrating BERTRAM into BERT leads to large performance increases due to improved representations of rare and medium frequency words on both a rare word probing task and three downstream tasks.
“The Bitter Lesson”, (2019-03-13):
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
…In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess…A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale…In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods…In computer vision…Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
…We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that (1) AI researchers have often tried to build knowledge into their agents, (2) this always helps in the short term, and is personally satisfying to the researcher, but (3) in the long run it plateaus and even inhibits further progress, and (4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
[My meme summary:
“BPE-Dropout: Simple and Effective Subword Regularization”, (2019-10-29):
Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only way to overcome this BPE imperfection, its deterministic nature, was to create another subword segmentation algorithm (Kudo, 2018). In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. We introduce BPE-dropout—simple and effective subword regularization method based on and compatible with conventional BPE. It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 compared to BPE and up to 0.9 compared to the previous subword regularization.
The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining.
We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English , multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely .
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text.
In this paper, we show that a standard count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are statistically-significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter
As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
“RL agents Implicitly Learning Human Preferences”, (2020-02-14):
In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human’s preferences are fulfilled based on the activations of a RL agent’s neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent’s activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.
“Revealing Persona Biases in Dialogue Systems”, (2021-04-18):
Dialogue systems in the form of chatbots and personal assistants are being increasingly integrated into people’s lives. These dialogue systems often have the ability to adopt an anthropomorphic persona, mimicking a societal demographic to appear more approachable and trustworthy to users. However, the adoption of a persona can result in the adoption of biases. We define persona biases as harmful differences in text (e.g., varying levels of offensiveness or affirmations of biased statements) generated from adopting different demographic personas. In this paper, we present the first large-scale study on persona biases in dialogue systems and conduct analyses on personas of different social classes, sexual orientations, races, and genders. Furthermore, we introduce an open-source framework, UnitPersonaBias, a tool to explore and aggregate subtle persona biases in dialogue systems. In our studies of the Blender and DialoGPT dialogue systems, we show that the choice of personas can affect the degree of harms in generated responses. Additionally, adopting personas of more diverse, historically marginalized demographics appears to decrease harmful responses the most.
“GPT-3: Its Nature, Scope, Limits, and Consequences”, (2020-10-01):
In this commentary, we discuss the nature of reversible and irreversible questions, that is, questions that may enable one to identify the nature of the source of their answers. We then introduce GPT-3, a third-generation, autoregressive language model that uses deep learning to produce human-like texts, and use the previous distinction to analyse it. We expand the analysis to present three tests based on mathematical, semantic (that is, the Turing Test), and ethical questions and show that GPT-3 is not designed to pass any of them. This is a reminder that GPT-3 does not do what it is not supposed to do, and that any interpretation of GPT-3 as the beginning of the emergence of a general form of artificial intelligence is merely uninformed science fiction. We conclude by outlining some of the substantial consequences of the industrialisation of automatic and cheap production of good, semantic artefacts.
“Politeness Transfer: A Tag and Generate Approach”, (2020-04-29):
This paper introduces a new task of politeness transfer which involves converting non-polite sentences to polite sentences while preserving the meaning. We also provide a dataset of more than 1.39 instances automatically labeled for politeness to encourage benchmark evaluations on this new task. We design a tag and generate pipeline that identifies stylistic attributes and subsequently generates a sentence in the target style while preserving most of the source content. For politeness as well as five other transfer tasks, our model outperforms the state-of-the-art methods on automatic metrics for content preservation, with a comparable or better performance on style transfer accuracy. Additionally, our model surpasses existing methods on human evaluations for grammaticality, meaning preservation and transfer accuracy across all the six style transfer tasks. The data and code is located at https://github.com/tag-and-generate.
“Apparently 'what ho' is a corruption of…”, (2021-01-14):
Apparently “what ho!” is a corruption of Beowulf’s “hwaet!”??
Now need a P. G. Wodehouse translation of Beowulf.
What ho! Have you heard of these chaps,
Dashed good fellows with a spear and whatnot—
“Jukebox: We're introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We're releasing the model weights and code, along with a tool to explore the generated samples.”, (2020-04-30):
[Paper; samples; followup paper probing Jukebox as pretraining for music analysis (posing similar difficulties in extracting the right embedding as iGPT).] A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies. One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.
We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.
…Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called -226 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images…We use three levels in our , shown below, which compress the 44kHz raw audio by 8×, 32×, and 128×, respectively, with a codebook size of 2048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.
…The top-levelmodels the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, substantially improving the audio quality. We train these as autoregressive models using a simplified variant of Sparse . Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively. Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the decoder to sample novel songs.
…While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a substantial gap between these generations and human-created music. For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernible noise. Improving theso its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render 1 minute of audio through our models, and thus they cannot yet be used in interactive applications.
[Artbreeder is an interactive BigGAN generator, it now provides a variety of BigGAN & StyleGAN models, including the anime portrait model. (It is more general than the similar Waifu Labs, but my anime model is not as good.) Users can generate random samples and explore slight variants of them to gradually explore the “latent space” and find interesting images, but they can also edit images more directly, upload existing images to find the most similar image produced by the model, etc. A popular website, it has generated >56m images from September 2019 to January 2020.]generator website. Originally named “Ganbreeder” and providing only the 256px
1974-lem-cyberiad-trurlselectronicbard.pdf: “The First Sally (A), or, Trurl's Electronic Bard”, Stanisflaw Lem, Michael Kandel
1974-lem-cyberiad-trurlselectronicbard.pdf#page=7: “The First Sally (A), or, Trurl's Electronic Bard”, Stanisflaw Lem, Michael Kandel
“Are we in an AI overhang?”, (2020-07-27):
I am worried we’re in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100× larger projects at Google, Facebook and the like, with timelines measured in months.
…GPT-3 has been estimated to cost $5m in compute to train, and—looking at the author list and OpenAI’s overall size—maybe another $10m in labour.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100× is entirely plausible right now. All that’s necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability. A concrete example is Waymo, which is raising $2bn investment rounds—and that’s for a technology with a much longer road to market…The current hardware floor is nearer to the RTX 2080 TI’s $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10× speedup to our model.
…I think the key question is if by 1000×, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is—and I think it’s plausible that it will be—then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP. On paper that leaves room for another 1000× scale-up as it reaches up to $1tn, though current market mechanisms aren’t really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That’s from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
“Want To Reduce Labeling Cost? GPT-3 Can Help”, (2021-08-30):
Data annotation is a time-consuming and labor-intensive process for many NLP tasks. Although there exist various methods to produce pseudo data labels, they are often task-specific and require a decent amount of labeled data to start with. Recently, the immense language model GPT-3 with 175 billion parameters has achieved tremendous improvement across many few-shot learning tasks. In this paper, we explore ways to leverage GPT-3 as a low-cost data labeler to train other models. We find that, to make the downstream model achieve the same performance on a variety of NLU and NLG tasks, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans. Furthermore, we propose a novel framework of combining pseudo labels from GPT-3 with human labels, which leads to even better performance with limited labeling budget. These results present a cost-effective data labeling methodology that is generalizable to many practical applications.
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of “page quality”, able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
“Measuring Massive Multitask Language Understanding”, (2020-09-07):
We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings. (tests and code)
[bigger = better:
See also the ETHICS paper.]
2020-elkins.pdf: “Can GPT–3 Pass a Writer’s Turing Test?”, (2020-09-14; ):
Until recently the field of natural language generation relied upon formalized grammar systems, small-scale statistical models, and lengthy sets of heuristic rules. This older technology was fairly limited and brittle: it could remix language into word salad poems or chat with humans within narrowly defined topics. Recently, very large-scale statistical language models have dramatically advanced the field, and GPT-3 is just one example. It can internalize the rules of language without explicit programming or rules. Instead, much like a human child, GPT-3 learns language through repeated exposure, albeit on a much larger scale. Without explicit rules, it can sometimes fail at the simplest of linguistic tasks, but it can also excel at more difficult ones like imitating an author or waxing philosophical.