This page is a changelog for Gwern.net: a monthly reverse chronological list of recent major writings/changes/additions.
Following my writing can be a little difficult because it is often so incremental. So every month, in addition to my regular /r/Gwern subreddit submissions, I write up reasonably-interesting changes and send it out to the mailing list in addition to a compilation of links & reviews (archives).
A subreddit for posting links of interest and also for announcing updates to gwern.net (which can be used as a RSS feed). Submissions are categorized similar to the monthly newsletter and typically will be collated there.
Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using torch-rnn on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.
Individuals with lower socio-economic status (SES) are at increased risk of physical and mental illnesses and tend to die at an earlier age [1-3]. Explanations for the association between SES and health typically focus on factors that are environmental in origin . However, common single nucleotide polymorphisms (SNPs) have been found collectively to explain around 18% (SE = 5%) of the phenotypic variance of an area-based social deprivation measure of SES . Molecular genetic studies have also shown that physical and psychiatric diseases are at least partly heritable . It is possible, therefore, that phenotypic associations between SES and health arise partly due to a shared genetic etiology. We conducted a genome-wide association study (GWAS) on social deprivation and on household income using the 112,151 participants of UK Biobank. We find that common SNPs explain 21% (SE = 0.5%) of the variation in social deprivation and 11% (SE = 0.7%) in household income. Two independent SNPs attained genome-wide significance for household income, rs187848990 on chromosome 2, and rs8100891 on chromosome 19. Genes in the regions of these SNPs have been associated with intellectual disabilities, schizophrenia, and synaptic plasticity. Extensive genetic correlations were found between both measures of socioeconomic status and illnesses, anthropometric variables, psychiatric disorders, and cognitive ability. These findings show that some SNPs associated with SES are involved in the brain and central nervous system. The genetic associations with SES are probably mediated via other partly-heritable variables, including cognitive ability, education, personality, and health.
“Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs”, Lee, S. Hong Ripke, Stephan Neale, Benjamin M. Faraone, Stephen V. Purcell, Shaun M. Perlis, Roy H. Mowry, Bryan J. Thapar, Anita Goddard, Michael E. Witte, John S. Absher, Devin Agartz, Ingrid Akil, Huda Amin, Farooq Andreassen, Ole A. Anjorin, Adebayo Anney, Richard Anttila, Verneri Arking, Dan E. Asherson, Philip Azevedo, Maria H. Backlund, Lena Badner, Judith A. Bailey, Anthony J. Banaschewski, Tobias Barchas, Jack D. Barnes, Michael R. Barrett, Thomas B. Bass, Nicholas Battaglia, Agatino Bauer, Michael Bayés, Mònica Bellivier, Frank Bergen, Sarah E. Berrettini, Wade Betancur, Catalina Bettecken, Thomas Biederman, Joseph Binder, Elisabeth B. Black, Donald W. Blackwood, Douglas H. R Bloss, Cinnamon S. Boehnke, Michael Boomsma, Dorret I. Breen, Gerome Breuer, René Bruggeman, Richard Cormican, Paul Buccola, Nancy G. Buitelaar, Jan K. Bunney, William E. Buxbaum, Joseph D. Byerley, William F. Byrne, Enda M. Caesar, Sian Cahn, Wiepke Cantor, Rita M. Casas, Miguel Chakravarti, Aravinda Chambert, Kimberly Choudhury, Khalid Cichon, Sven Cloninger, C. Robert Collier, David A. Cook, Edwin H. Coon, Hilary Cormand, Bru Corvin, Aiden Coryell, William H. Craig, David W. Craig, Ian W. Crosbie, Jennifer Cuccaro, Michael L. Curtis, David Czamara, Darina Datta, Susmita Dawson, Geraldine Day, Richard De Geus, Eco J. Degenhardt, Franziska Djurovic, Srdjan Donohoe, Gary J. Doyle, Alysa E. Duan, Jubao Dudbridge, Frank Duketis, Eftichia Ebstein, Richard P. Edenberg, Howard J. Elia, Josephine Ennis, Sean Etain, Bruno Fanous, Ayman Farmer, Anne E. Ferrier, I. Nicol Flickinger, Matthew Fombonne, Eric Foroud, Tatiana Frank, Josef Franke, Barbara Fraser, Christine Freedman, Robert Freimer, Nelson B. Freitag, Christine M. Friedl, Marion Frisén, Louise Gallagher, Louise Gejman, Pablo V. Georgieva, Lyudmila Gershon, Elliot S. Geschwind, Daniel H. Giegling, Ina Gill, Michael Gordon, Scott D. Gordon-Smith, Katherine Green, Elaine K. Greenwood, Tiffany A. Grice, Dorothy E. Gross, Magdalena Grozeva, Detelina Guan, Weihua Gurling, Hugh De Haan, Lieuwe Haines, Jonathan L. Hakonarson, Hakon Hallmayer, Joachim Hamilton, Steven P. Hamshere, Marian L. Hansen, Thomas F. Hartmann, Annette M. Hautzinger, Martin Heath, Andrew C. Henders, Anjali K. Herms, Stefan Hickie, Ian B. Hipolito, Maria Hoefels, Susanne Holmans, Peter A. Holsboer, Florian Hoogendijk, Witte J. Hottenga, Jouke-Jan Hultman, Christina M. Hus, Vanessa Ingason, Andrés Ising, Marcus Jamain, Stéphane Jones, Edward G. Jones, Ian Jones, Lisa Tzeng, Jung-Ying Kähler, Anna K. Kahn, René S. Kandaswamy, Radhika Keller, Matthew C. Kennedy, James L. Kenny, Elaine Kent, Lindsey Kim, Yunjung Kirov, George K. Klauck, Sabine M. Klei, Lambertus Knowles, James A. Kohli, Martin A. Koller, Daniel L. Konte, Bettina Korszun, Ania Krabbendam, Lydia Krasucki, Robert Kuntsi, Jonna Kwan, Phoenix Landén, Mikael Långström, Niklas Lathrop, Mark Lawrence, Jacob Lawson, William B. Leboyer, Marion Ledbetter, David H. Lee, Phil H. Lencz, Todd Lesch, Klaus-Peter Levinson, Douglas F. Lewis, Cathryn M. Li, Jun Lichtenstein, Paul Lieberman, Jeffrey A. Lin, Dan-Yu Linszen, Don H. Liu, Chunyu Lohoff, Falk W. Loo, Sandra K. Lord, Catherine Lowe, Jennifer K. Lucae, Susanne MacIntyre, Donald J. Madden, Pamela A. F Maestrini, Elena Magnusson, Patrik K. E Mahon, Pamela B. Maier, Wolfgang Malhotra, Anil K. Mane, Shrikant M. Martin, Christa L. Martin, Nicholas G. Mattheisen, Manuel Matthews, Keith Mattingsdal, Morten McCarroll, Steven A. McGhee, Kevin A. McGough, James J. McGrath, Patrick J. McGuffin, Peter McInnis, Melvin G. McIntosh, Andrew McKinney, Rebecca McLean, Alan W. McMahon, Francis J. McMahon, William M. McQuillin, Andrew Medeiros, Helena Medland, Sarah E. Meier, Sandra Melle, Ingrid Meng, Fan Meyer, Jobst Middeldorp, Christel M. Middleton, Lefkos Milanova, Vihra Miranda, Ana Monaco, Anthony P. Montgomery, Grant W. Moran, Jennifer L. Moreno-De-Luca, Daniel Morken, Gunnar Morris, Derek W. Morrow, Eric M. Moskvina, Valentina Muglia, Pierandrea Mühleisen, Thomas W. Muir, Walter J. Müller-Myhsok, Bertram Murtha, Michael Myers, Richard M. Myin-Germeys, Inez Neale, Michael C. Nelson, Stan F. Nievergelt, Caroline M. Nikolov, Ivan Nimgaonkar, Vishwajit Nolen, Willem A. Nöthen, Markus M. Nurnberger, John I. Nwulia, Evaristus A. Nyholt, Dale R. O'Dushlaine, Colm Oades, Robert D. Olincy, Ann Oliveira, Guiomar Olsen, Line Ophoff, Roel A. Osby, Urban Owen, Michael J. Palotie, Aarno Parr, Jeremy R. Paterson, Andrew D. Pato, Carlos N. Pato, Michele T. Penninx, Brenda W. Pergadia, Michele L. Pericak-Vance, Margaret A. Pickard, Benjamin S. Pimm, Jonathan Piven, Joseph Posthuma, Danielle Potash, James B. Poustka, Fritz Propping, Peter Puri, Vinay Quested, Digby J. Quinn, Emma M. Ramos-Quiroga, Josep Antoni Rasmussen, Henrik B. Raychaudhuri, Soumya Rehnström, Karola Reif, Andreas Ribasés, Marta Rice, John P. Rietschel, Marcella Roeder, Kathryn Roeyers, Herbert Rossin, Lizzy Rothenberger, Aribert Rouleau, Guy Ruderfer, Douglas Rujescu, Dan Sanders, Alan R. Sanders, Stephan J. Santangelo, Susan L. Sergeant, Joseph A. Schachar, Russell Schalling, Martin Schatzberg, Alan F. Scheftner, William A. Schellenberg, Gerard D. Scherer, Stephen W. Schork, Nicholas J. Schulze, Thomas G. Schumacher, Johannes Schwarz, Markus Scolnick, Edward Scott, Laura J. Shi, Jianxin Shilling, Paul D. Shyn, Stanley I. Silverman, Jeremy M. Slager, Susan L. Smalley, Susan L. Smit, Johannes H. Smith, Erin N. Sonuga-Barke, Edmund J. S St Clair, David State, Matthew Steffens, Michael Steinhausen, Hans-Christoph Strauss, John S. Strohmaier, Jana Stroup, T. Scott Sutcliffe, James S. Szatmari, Peter Szelinger, Szabocls Thirumalai, Srinivasa Thompson, Robert C. Todorov, Alexandre A. Tozzi, Federica Treutlein, Jens Uhr, Manfred van den Oord, Edwin J. C G. Van Grootheest, Gerard Van Os, Jim Vicente, Astrid M. Vieland, Veronica J. Vincent, John B. Visscher, Peter M. Walsh, Christopher A. Wassink, Thomas H. Watson, Stanley J. Weissman, Myrna M. Werge, Thomas Wienker, Thomas F. Wijsman, Ellen M. Willemsen, Gonneke Williams, Nigel Willsey, A. Jeremy Witt, Stephanie H. Xu, Wei Young, Allan H. Yu, Timothy W. Zammit, Stanley Zandi, Peter P. Zhang, Peng Zitman, Frans G. Zöllner, Sebastian Devlin, Bernie Kelsoe, John R. Sklar, Pamela Daly, Mark J. O'Donovan, Michael C. Craddock, Nicholas Sullivan, Patrick F. Smoller, Jordan W. Kendler, Kenneth S. Wray, Naomi R (2013):
Most psychiatric disorders are moderately to highly heritable. The degree to which genetic variation is unique to individual disorders or shared across disorders is unclear. To examine shared genetic etiology, we use genome-wide genotype data from the Psychiatric Genomics Consortium (PGC) for cases and controls in schizophrenia, bipolar disorder, major depressive disorder, autism spectrum disorders (ASD) and attention-deficit/hyperactivity disorder (ADHD). We apply univariate and bivariate methods for the estimation of genetic variation within and covariation between disorders. SNPs explained 17-29% of the variance in liability. The genetic correlation calculated using common SNPs was high between schizophrenia and bipolar disorder (0.68 ± 0.04 s.e.), moderate between schizophrenia and major depressive disorder (0.43 ± 0.06 s.e.), bipolar disorder and major depressive disorder (0.47 ± 0.06 s.e.), and ADHD and major depressive disorder (0.32 ± 0.07 s.e.), low between schizophrenia and ASD (0.16 ± 0.06 s.e.) and non-significant for other pairs of disorders as well as between psychiatric disorders and the negative control of Crohn's disease. This empirical evidence of shared genetic etiology for psychiatric disorders can inform nosology and encourages the investigation of common pathophysiologies for related disorders.
We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing data. We demonstrate using simulations based on whole-genome sequencing data that ~97% and ~68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ~17 million imputed variants explain 56% (standard error (s.e.) = 2.3%) of variance for height and 27% (s.e. = 2.5%) of variance for body mass index (BMI), and we find evidence that height-associated and BMI-associated variants have been under natural selection. Considering the imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60–70% for height and 30–40% for BMI. Therefore, the missing heritability is small for both traits. For further discovery of genes associated with complex traits, a study design with SNP arrays followed by imputation is more cost-effective than whole-genome sequencing at current prices.
Higher paternal age at offspring conception increases de novo genetic mutations (Kong et al., 2012). Based on evolutionary genetic theory we predicted that the offspring of older fathers would be less likely to survive and reproduce, i.e. have lower fitness. In a sibling control study, we find clear support for negative paternal age effects on offspring survival, mating and reproductive success across four large populations with an aggregate n > 1.3 million in main analyses. Compared to a sibling born when the father was 10 years younger, individuals had 4-13% fewer surviving children in the four populations. Three populations were pre-industrial (1670-1850) Western populations and showed a pattern of paternal age effects across the offspring’s lifespan. In 20th-century Sweden, we found no negative paternal age effects on child survival or marriage odds. Effects survived tests for competing explanations, including maternal age and parental loss. To the extent that we succeeded in isolating a mutation-driven effect of paternal age, our results can be understood to show that de novo mutations reduce offspring fitness across populations and time. We can use this understanding to predict the effect of increasingly delayed reproduction on offspring genetic load, mortality and fertility.
Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of language tasks. However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images. Based on an analysis of the DMN, we propose several improvements to its memory and input modules. Together with these changes we introduce a novel input module for images in order to be able to answer visual questions. Our new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the -10k text question-answering dataset without supporting fact supervision.
We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.
Gatys et al. (2015) showed that optimizing pixels to match features in a convolutional network with respect reference image features is a way to render images of high visual quality. We show that unrolling this gradient-based optimization yields a recurrent computation that creates images by incrementally adding onto a visual "canvas". We propose a recurrent generative model inspired by this view, and show that it can be trained using adversarial training to generate very good image samples. We also propose a way to quantitatively compare adversarial networks by having the generators and discriminators of these networks compete against each other.
Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; this can lead to exponentially faster learning. We demonstrate these benefits in complex stochastic MDPs and in the large-scale Arcade Learning Environment. Bootstrapped DQN substantially improves learning times and performance across most Atari games.
Context: There is substantial debate about whether the results of nonrandomized studies are consistent with the results of randomized controlled trials on the same topic.
Objectives: To compare results of randomized and nonrandomized studies that evaluated medical interventions and to examine characteristics that may explain discrepancies between randomized and nonrandomized studies.
Data Sources: MEDLINE (1966–March 2000), the Cochrane Library (Issue 3, 2000), and major journals were searched.
Study Selection: Forty-five diverse topics were identified for which both randomized trials (n = 240) and nonrandomized studies (n = 168) had been performed and had been considered in meta-analyses of binary outcomes.
Data Extraction: Data on events per patient in each study arm and design and characteristics of each study considered in each meta-analysis were extracted and synthesized separately for randomized and nonrandomized studies.
Data Synthesis: Very good correlation was observed between the summary odds ratios of randomized and nonrandomized studies (r = 0.75; p < 0.001); however, nonrandomized studies tended to show larger treatment effects (28 vs 11; p = 0.009). Between-study heterogeneity was frequent among randomized trials alone (23%) and very frequent among nonrandomized studies alone (41%). The summary results of the 2 types of designs differed beyond chance in 7 cases (16%). Discrepancies beyond chance were less common when only prospective studies were considered (8%). Occasional differences in sample size and timing of publication were also noted between discrepant randomized and nonrandomized studies. In 28 cases (62%), the natural logarithm of the odds ratio differed by at least 50%, and in 15 cases (33%), the odds ratio varied at least 2-fold between nonrandomized studies and randomized trials.
Conclusions: Despite good correlation between randomized trials and nonrandomized studies—in particular, prospective studies—discrepancies beyond chance do occur and differences in estimated magnitude of treatment effect are very common.
Design principles and operational modes are explored that underlie the information processing capacity of the human brain. The hypothesis is put forward that in higher organisms, especially in primates, the complexity of the neural circuitry of the cerebral cortex is the neural correlate of the brain’s coherence and predictive power, and, thus, a measure of intelligence. It will be argued that with the evolution of the human brain we have nearly reached the limits of biological intelligence. [Keywords: Biological intelligence, Cognition, Consciousness, Cerebral cortex, Primates, Information processing, Neural networks, Cortical design, Human brain evolution]
Personality psychology aims to explain the causes and the consequences of variation in behavioural traits. Because of the observational nature of the pertinent data, this endeavour has provoked many controversies. In recent years, the computer scientist Judea Pearl has used a graphical approach to extend the innovations in causal inference developed by Ronald Fisher and Sewall Wright. Besides shedding much light on the philosophical notion of causality itself, this graphical framework now contains many powerful concepts of relevance to the controversies just mentioned. In this article, some of these concepts are applied to areas of personality research where questions of causation arise, including the analysis of observational data and the genetic sources of individual differences.
The Pirahã are an indigenous people of the Amazon Rainforest in Brazil. They are the sole surviving subgroup of the Mura people, and are hunter-gatherers. They live mainly on the banks of the Maici River in Humaitá and Manicoré in the state of Amazonas. As of 2018, they number 800 individuals. The Pirahã people do not call themselves Pirahã but instead the Hi'aiti'ihi, roughly translated as "the straight ones."
The Tale of the Princess Kaguya is a 2013 Japanese animated fantasy drama film co-written and directed by Isao Takahata, based on The Tale of the Bamboo Cutter, a 10th-century Japanese literary tale. It was produced by Studio Ghibli for Nippon Television Network, Dentsu, Hakuhodo DYMP, Walt Disney Japan, Mitsubishi, Toho and KDDI, and distributed by Toho.
Subscription page for the monthly gwern.net newsletter. There are monthly updates, which will include summaries of projects I’ve worked on that month (the same as the changelog), collations of links or discussions from my subreddit, and book/movie reviews. You can also browse the archives since December 2013.
Newsletter tag: archive of all issues back to 2013 for the gwern.net newsletter (monthly updates, which will include summaries of projects I’ve worked on that month (the same as the changelog), collations of links or discussions from my subreddit, and book/movie reviews.)
In February 2019, following up on my 2015–2016 text-generation experiments with char-RNNs, I experiment with the cutting-edge Transformer NN architecture for language modeling & text generation. Using OpenAI’s GPT-2-117M (117M) model pre-trained on a large Internet corpus and nshepperd’s finetuning code, I retrain GPT-2-117M on a large (117MB) Project Gutenberg poetry corpus. I demonstrate how to train 2 variants: “GPT-2-poetry”, trained on the poems as a continuous stream of text, and “GPT-2-poetry-prefix”, with each line prefixed with the metadata of the PG book it came from. In May 2019, I trained the next-largest GPT-2, GPT-2-345M, similarly, for a further quality boost in generated poems. In October 2019, I retrained GPT-2-117M on a Project Gutenberg corpus with improved formatting, and combined it with a contemporary poem dataset based on Poetry Foundation’swebsite. .> With just a few GPU-days on 1080ti GPUs, GPT-2-117M finetuning can produce high-quality poetry which is more thematically consistent than my char-RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems. For the highest-quality AI poetry to date, see my followup page, “GPT-3 Creative Writing”.
I continue my AI poetry generation experiments with OpenAI’s 2020 GPT-3, which is 116× larger, and much more powerful, than the 2019 GPT-2. GPT-3, however, is not merely a quantitative tweak yielding “GPT-2 but better”—it is qualitatively different, exhibiting eerie runtime learning capabilities allowing even the raw model, with zero finetuning, to “meta-learn” many textual tasks purely by example or instruction. One does not train or program GPT-3 in a normal way, but one engages in dialogue and writes prompts to teach GPT-3 what one wants.
Experimenting through the OpenAI Beta API in June 2020, I find that GPT-3 does not just match my finetuned GPT-2-1.5b-poetry for poem-writing quality, but exceeds it, while being versatile in handling poetry, Tom Swifty puns, science fiction, dialogue like Turing’s Turing-test dialogue, literary style parodies… As the pièce de résistance, I recreate Stanislaw Lem’s Cyberiad’s “Trurl’s Electronic Bard” poetry using GPT-3. (Along the way, I document instances of how the BPE text encoding unnecessarily damagesGPT-3’s performance on a variety of tasks, how to best elicit the highest-quality responses, common errors people make in using GPT-3, and test out GPT-3’s improvements in NN weak points like logic or commonsense knowledge.)
GPT-3’s samples are not just close to human level: they are creative, witty, deep, meta, and often beautiful. They demonstrate an ability to handle abstractions, like style parodies, I have not seen in GPT-2 at all. Chatting with GPT-3 feels uncannily like chatting with a human. I was impressed by the results reported in the GPT-3 paper, and after spending a week trying it out, I remain impressed.
This page records GPT-3 samples I generated in my explorations, and thoughts on how to use GPT-3 and its remaining weaknesses. I hope you enjoy them even a tenth as much as I enjoyed testing GPT-3 and watching the completions scroll across my screen.
The Poetry Foundation is a Chicago-based American foundation created to promote poetry in the wider culture. It was formed from Poetry magazine, which it continues to publish, with a 2003 gift of $200 million from philanthropist Ruth Lilly.
In November 2019, I experimented with training a GPT-2 neural net model to generate folk music in the high-level ABC music text format, following previous work in 2016 which used a char-RNN trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an RNN by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.
I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that, I successfully trained it on n = 205,304 ABC music pieces taken from The Session & ABCnotation.com. The resulting music samples are in my opinion quite pleasant. (A similar model was later retrained by Geerlings & Meroño-Peñuela 2020.)
We followed the ABC folk model with an ABC-MIDI model: a dataset of 453k ABC pieces decompiled from MIDI pieces, which fit into GPT-2-117M with an expanded context window when trained on TPUs. The MIDI pieces are far more diverse and challenging, and GPT-2 underfits and struggles to produce valid samples but when sampling succeeds, it can generate even better musical samples.
Standard language generation neural network models, like GPT-2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.
A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.
OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & music generation, I experimented with GPT-2 preference learning for unconditional music and poetry generation.
I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n≅7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more stringent sample curation. My blind ratings using n≅200 comparisons showed no large advantage for the RL-tuned samples (winning only 93 of 210 comparisons, or 46%).
This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.
Working with it, I suspect that preference learning is unnecessarily sample-inefficient & data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output.
…Black is GPT-2. Its excuse [for this chess blunder] is that it’s a text prediction program with no concept of chess. As far as it knows, it’s trying to predict short alphanumeric strings like “e2e4” or “Nb7”. Nobody told it this represents a board game. It doesn’t even have a concept of 2D space that it could use to understand such a claim. But it still captured my rook! Embarrassing!…Last month, I asked him if he thought GPT-2 could play chess. I wondered if he could train it on a corpus of chess games written in standard notation (where, for example, e2e4 means “move the pawn at square e2 to square e4”). There are literally millions of games written up like this. GPT-2 would learn to predict the next string of text, which would correspond to the next move in the chess game. Then you would prompt it with a chessboard up to a certain point, and it would predict how the chess masters who had produced its training data would continue the game – ie make its next move using the same heuristics they would. Gwern handed the idea to his collaborator Shawn Presser, who had a working GPT-2 chess engine running within a week:…You can play against GPT-2 yourself by following the directions in the last tweet, though it won’t be much of a challenge for anyone better than I am.
…What does this imply? I’m not sure (and maybe it will imply more if someone manages to make it actually good). It was already weird to see something with no auditory qualia learn passable poetic meter. It’s even weirder to see something with no concept of space learn to play chess. Is any of this meaningful? How impressed should we be that the same AI can write poems, compose music, and play chess, without having been designed for any of those tasks? I still don’t know.
“Language Models are Few-Shot Learners”, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020-05-28):
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Compared to GPT-2, GPT-3 improves performance on character-level tasks like rhyming, alliteration, punning, anagrams or permutations, acrostic poems, and arithmetic less than expected, despite being very good at many other closely-related kinds of writings like satire.
Why? A plausible explanation is an obscure technical detail: as a performance optimization, GPT does not see characters but sub-word-chunks called “byte-pair encodings” (BPEs). Because GPTs never see characters but opaque partial-words, which vary chaotically based on the specific word and even the surrounding context, they are unable to easily learn about character-level aspects of language, like similar spellings or sounds, and are forced to learn relationships much more indirectly, like by brute-force memorizing of pairs of words.
Some experiments with reformatting GPT-3’s poorest-performing tasks to avoid inconsistent BPE encodings of strings shows small to large performance gains, consistent with this theory.
The GPT-3 neural network is so large a model in terms of power and dataset that it exhibits qualitatively different behavior: you do not apply it to a fixed set of tasks which were in the training dataset, requiring retraining on additional data if one wants to handle a new task (as one would have to retrain GPT-2); instead, you interact with it, expressing any task in terms of natural language descriptions, requests, and examples, tweaking the prompt until it “understands” & it meta-learns the new task based on the high-level abstractions it learned from the pretraining.
This is a rather different way of using a DL model, and it’s better to think of it as a new kind of programming, where the prompt is now a “program” which programs GPT-3 to do new things.
While training a GPT-2-117M on a folk music corpus written in ABC format, persistent syntax errors kept being generated by an otherwise-high-quality model: random spaces would be generated, rendering a music piece either erroneous or lower-quality. Why? It seems to be some issue with the GPT BPE encoder handling of spaces which makes it difficult to emit the right space-separated characters. We found that ABC does not actually require spaces, and we simply removed all spaces from the corpus—noticeably improving quality of generated pieces.
Generating symbolic music with language models is a promising research area, with potential applications in automated music composition. Recent work shows that Transformer architectures can learn to generate compelling four-instrument scores from large MIDI datasets. In this paper, we re-train the small (117M) GPT-2 model with a large dataset in ABC notation, and generate samples of single-instrument folk music. Our BLEU and ROUGE based quantitative, and survey based qualitative, evaluations suggest that ABC notation is learned with syntactical and semantic correctness, and that samples contain robust and believable n-grams.
To expand the ABC GPT-2 model to cover a wider variety of musical genres, I turn to the next-most compact widespread music encoding format: MIDI. There are hundreds of thousands of MIDIs which can be decompiled to ABC format, averaging ~10k BPEs—within GPT-2-117M’s feasible context window when trained on TPUs (which permit training of context windows up to 30k wide).
We compile the ABC from before and 2 large MIDI datasets, and convert to ABC, yielding ~453k usable ABC-MIDI musical files (~5.1GB of text). We trained January–April 2020 on our TPU swarm (with many interruptions), achieving a final loss of ~0.2 (underfit).
Sampling from the final model is hit-or-miss as it is prone to the likelihood repetition trap and it generates instruments one-by-one so it is common for instruments to be cut off or otherwise broken during sampling (indicating that sampling is increasingly a bigger problem than training for long-range sequence modeling). However, successful pieces are possible, and are musically far more diverse than the folk ABC corpus, with many pleasingly complex samples.
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets.
We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset—matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text.
These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
This work applies natural language modeling to generate plausible strategic moves in the ancient game of Go. We train the Generative Pretrained Transformer (GPT-2) to mimic the style of Go champions as archived in Smart Game Format (SGF), which offers a text description of move sequences. The trained model further generates valid but previously unseen strategies for Go. Because GPT-2 preserves punctuation and spacing, the raw output of the text generator provides inputs to game visualization and creative patterns, such as the Sabaki project’s game engine using auto-replays. Results demonstrate that language modeling can capture both the sequencing format of championship Go games and their strategic formations. Compared to random game boards, the GPT-2 fine-tuning shows efficient opening move sequences favoring corner play over less advantageous center and side play. Game generation as a language modeling task offers novel approaches to more than 40 other board games where historical text annotation provides training data (e.g., Amazons & Connect 4/6).
This work demonstrates that natural language transformers can support more generic strategic modeling, particularly for text-archived games. In addition to learning natural language skills, the abstract transformer architecture can generate meaningful moves on a chessboard. With further fine-tuning, the transformer learns complex gameplay by training on 2.8 million chess games in Portable Game Notation. After 30,000 training steps, OpenAI’s Generative Pre-trained Transformer (GPT-2) optimizes weights for 774 million parameters. This fine-tuned Chess Transformer generates plausible strategies and displays game formations identifiable as classic openings, such as English or the Slav Exchange. Finally, in live play, the novel model demonstrates a human-to-transformer interface that correctly filters illegal moves and provides a novel method to challenge the transformer’s chess strategies. We anticipate future work will build on this transformer’s promise, particularly in other strategy games where features can capture the underlying complex rule syntax from simple but expressive player annotations.
A decompiler is a computer program that takes an executable file as input, and attempts to create a high level source file which can be recompiled successfully. It is therefore the opposite of a compiler, which takes a source file and makes an executable. Decompilers are usually unable to perfectly reconstruct the original source code, and as such, will frequently produce obfuscated code. Nonetheless, decompilers remain an important tool in the reverse engineering of computer software.