May 2020 gwern.net newsletter with anime GAN updates, links on AI scaling, discussion of GPT-3, and 1 book review.
26 Dec 2019–07 Jun 2020
finished
certainty: log
importance: 0
This is the May 2020 edition of the gwern.net newsletter; previous, April 2020 (archives). This is a summary of the revision-history RSS feed, overlapping with my Changelog & /r/gwern; brought to you by my donors on Patreon.
Writings
Ganbooru prototype: released 256px BigGAN trained on Danbooru2019; Danbooru2019 Figures dataset
gwern.net:- experimental
<srcset>mobile image optimization popups.js: +support for reverse-footnote popups
Mailing List SwitchThe newsletter moved this month to Substack due to reaching the TinyLetter 5000-subscriber limit. Please let me know of any issues beyond the known issue of length truncation. (Note that reading the website version on desktop is the recommended way for annotations etc.)- experimental
GPT-3
On “GPT-3: Language Models are Few-Shot Learners”, Brown et al 2020 (poems; random samples)
Learning to learn. OA releases the long-awaited followup to GPT-2, one model to rule them all: a 117✕ larger 175b-parameter model with far more powerful language generation, which lets it solve a wide variety of problems from arithmetic to English translation to unscrambling anagrams to SAT analogies—purely from being prompted with text examples, without any specialized training or finetuning whatsoever, merely next-word prediction training on a big Internet text corpus. This implies GPT-3’s attention mechanisms serve as “fast weights” that have “learned to learn” by training on sufficiently varied data1, forcing it to do more than just learn ordinary textual relationships. Like OpenAI’s Jukebox just weeks ago, the announcement of GPT-3 appears to have sunk almost without a trace, so I will go into more depth than usual.
“Attacks only get better.” 2 years ago, GPT-1 was interestingly useful pretraining and adorable with its “sentiment neuron”. 1 year ago, GPT-2 was impressive with its excellent text generation & finetuning capabilities. This year, GPT-3 is scary because it’s a small & shallow model compared to what’s possible2, with a simple uniform architecture3 trained in the dumbest way possible (unidirectional prediction of next text token) on a single impoverished modality (random Internet text dumps) on tiny data (fits on a laptop), and yet, the first version already manifests crazy runtime meta-learning—and the scaling curves still are not bending! The samples are also better than ever, whether it’s GPT-3 inventing new dick jokes or writing (mostly working) JavaScript tutorials about rotating arrays. Does it set SOTA on every task? No, of course not. But the question is not whether we can lawyerly find any way in which it might not work, but whether there is any way which it might work. And there are many ways it might work better (see the “Limitations” section for just a few).
Scaling still working. I was surprised, as I had expected closer to 100b parameters, and I thought that the performance of CTRL/Meena/MegatronLM/T5/Turing-NLG/GPipe suggested that, the scaling papers4 notwithstanding, the scaling curves had started to bend and by 100b, it might be hard to justify further scaling. However, GPT-3 hits twice that without noticeable change in scaling factors. This suggests that it would be both possible and useful to head to trillions of parameters, and eyeballing the graphs, many benchmarks like the Winograd schema WinoGrande would fall by 10t parameters.
We don’t know how to train NNs. As I keep saying, “NNs are lazy” and can do far more than we make them do when we push them beyond easy answers & cheap shortcuts. The bitter lesson is the harder and bigger, the better. (Besides GPT-3, one could mention recent progress in semi-supervised learning & the model-based DRL renaissance.)
Blessings of scale: stability→generalization→meta-learning. GPT-3 is hamstrung by its training & data, but DL enjoys an unreasonably effective blessing of dimensionality—just simply training a big model on a lot of data induces better properties like meta-learning without even the slightest bit of that architecture being built in; and in general, training on more and harder tasks creates ever more human-like performance, generalization, and robustness. OA5 does not just scale to, but stabilizes at, minibatches of millions due to gradient noise. OA5-like, BigGAN stabilizes at large-scale image datasets like JFT-300M & benefits from unusually large minibatches, while classifier CNNs like BiT transfer & robustify with human-like errors5, multimodal learning produces better representations on less data (eg ViLBERT/VideoBERT, motivating OA’s interest), and RNNs can predict videos. AlphaStar reaches human-level with hundreds of competing self-players to cover possible strategies. Imitation learning DRL like MetaMimic generalizes at hundreds of tasks to train a deep net. Disentanglement emerges in StyleGAN with sufficiently deep w embeddings, or in relational networks/GQN/Transformers with enough samples to force factorization. Training Dactyl on millions of domain randomizations induced similar implicit meta-learning where during each runtime invocation, the RNN probes its environment and encodes its understanding of robot hand control into its hidden state; and DD-PPO outperforms classical robot planners by scaling 2 orders. Or in Procgen, training on hundreds of levels trains agents individually, but at thousands of levels, they begin to generalize to unseen levels. AlphaZero demonstrated truly superhuman Go without ‘delusions’ just by training a bigger model on a richer signal & pro-level play without any search—and MuZero, for that matter, demonstrated that just training an RNN end-to-end to predict a reward on enough data is enough to obsolete even AlphaZero and learn tree search implicitly (but better). And on and on.
The scaling hypothesis that, once we find a scalable architecture like self-attention or convolutions, we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data, looks increasingly plausible.
Keeping track. GPT-3 in 2020 makes as good a point as any to take a look back on the past decade. In 2010, one could easily fit everyone in the world who genuinely believed in deep learning into a moderate-sized conference room (assisted slightly by the fact that 3 of them were busy founding DeepMind). Someone interested in machine learning in 2010 might have read about some stuff in recognizing hand-written digits using all of 1–2 million parameters, or some modest neural tweaks to standard hidden Markov model voice-recognition. In 2010, who would have predicted that over the next 10 years, deep learning would undergo a Cambrian explosion causing a mass extinction of alternative approaches throughout machine learning, that models would scale up to 175,000 million parameters, and that these enormous models would just spontaneously develop all these capabilities, aside from a few diehard connectionists written off as willfully-deluded old-school fanatics by the rest of the AI community (never mind the world), such as Moravec, Schmidhuber, Sutskever, Legg, & Amodei?
Hindsight is 20/20. Even in 2015, the scaling hypothesis seemed highly dubious: you needed something to scale, after all, and it was all too easy to look at flaws in existing systems and imagine that they would never go away and progress would sigmoid any month now, soon. Like the genomics revolution where a few far-sighted seers extrapolated that the necessary n for GWASes would increase exponentially & deliver powerful PGSes soon, while sober experts wrung their hands over “missing heritability” & the miraculous complexity of biology & scoff about how such n requirements proved GWAS was a failed paradigm, the future arrived at first slowly and then quickly. Yet, here we are: all honor to the fanatics, and shame and humiliation to the critics!6 If only one could go back 10 years, or even 5, to watch every AI researchers’ head explode reading this paper… Unfortunately, few heads appear to be exploding now, because human capacity for hindsight & excuses is boundless (“I can get that much with finetuning, anyway I predicted it all along, how boring”) and “there is no fire alarm”. (If you are still certain that there is near-zero probability of AGI in the next few decades, why? Did you predict—in writing—capabilities like GPT-3? Is this how you expect AI failure to look in the decades beforehand? What specific task, what specific number, would convince you otherwise? How would the world look different than it does now if these crude prototype insect-brain-sized DL systems were not on a path to success?)
Authority without accountability. What should we think about the experts? Projections of failure were made by eminent, respectable, serious people. They spoke in considered tones of why AI hype was excessive and might trigger an “AI winter”, and the fundamental flaws of fashionable approaches and why brute force could not work. These statements were made routinely in 2014, 2015, 2016… And they were wrong. I am aware of few issuing a mea culpa or reflecting on it.
Phatic, not predictive. There is, however, a certain tone of voice the bien pensant all speak in, whose sound is the same whether right or wrong; a tone shared with many statements in January to March of this year; a tone we can also find in a 1940 Scientific American article authoritatively titled, “Don’t Worry—It Can’t Happen”, which advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.)
The iron law of bureaucracy: Cathedral gothic. This tone of voice is the voice of authority. The voice of authority insists on calm, and people not “panicking” (the chief of sins). The voice of authority assures you that it won’t happen (because it can’t happen). The voice utters simple arguments about why the status quo will prevail, and considers only how the wild new idea could fail (and not all the possible options). The voice is not, and does not deal in, uncertainty; things will either happen or they will not, and since it will not happen, there is no need to take any precautions (and you should not worry because it can’t happen). The voice does not believe in drawing lines on graphs (it is rank numerology). The voice does not issue any numerical predictions (which could be falsified). The voice is opposed to unethical things like randomized experiments on volunteers (but will overlook the insult). The voice does not have a model of the future (because a model implies it does not already know the future). The voice is concerned about its public image (and unkind gossip about it by other speakers of the voice). The voice is always sober, respectable, and credentialed (the voice would be pleased to write an op-ed for your national magazine and/or newspaper). The voice speaks, and is not spoken to (you cannot ask the voice what objective fact would change its mind). The voice never changes its mind (until it does). The voice is never surprised by events in the world (only disappointed). The voice advises you to go back to sleep (right now).
When someone speaks about future possibilities, what is the tone of their voice?
Media
Links
AI:
Matters Of Scale:
- GPT-3: see above
- “Measuring the Algorithmic Efficiency of Neural Networks”, Hernandez & Brown 2020 (blog/interview; the first prototype is never the best one, but given enough compute & time, you can refine it and figure out how it should have been done all along, and this paper quantifies the neural net hardware overhang just since 2012: “it now takes 44✕ less compute to train…to the level of AlexNet”. Unsurprising—eg the experience curve in linear programming: Bixby 2002; see Grace 2013/Yudkowsky 2013. We don’t know how to train the right kind of neural nets and make huge mistakes with the simplest things, as capability jumps like resnets or EfficientNet or R2D2 occasionally remind us.)
- “IntelliCode Compose: Code Generation Using [GPT-2] Transformer”, Svyatkovskiy et al 2020 (video?; unclear if application of ZeRO-2)
- “GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce” (blog; one model, 7 datasets, 89m images, 83 losses/tasks, and +8% search quality boost worldwide)
“Deep neuroethology of a virtual rodent”, Merel et al 2019 (media)
“Go-Explore 2: First return then explore”, Ecoffet et al 2020
“Learning to Simulate Dynamic Environments with GameGAN”, Kim et al 2020 (project page; an unexpected appearance of a Neural Turing Machine)
“Exploring Bayesian Optimization: Breaking Bayesian Optimization into small, sizeable chunks”, Agnihotri & Batra 2020
“This Word Does Not Exist” (GPT-2); “This Fursona Does Not Exist (TFDNE)” editor (a simple but high-quality StyleGAN 2 face model of furries, also available on Artbreeder; interesting for how the fur flew due to legal fuzziness & some artists acting like animals, howling about ‘theft’ & free fursonas being a wolf in sheep’s clothing upsetting their pecking order7—though the creator has outfoxed the paper tiger threats, these kittlesome questions will dog ML as DL models multiply like rabbits)
Genetics:
Everything Is Heritable:
- “Local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits”, Zhang et al 2020 (“autism/IQ rg…could be explained by 2 etiologically-distinct genetic signatures w/bidirectional local genetic correlations”)
- “Identification of 370 loci for age at onset of sexual and reproductive behaviour, highlighting common aetiology with reproductive biology, externalizing behaviour and longevity”, Mills et al 2020
- “Genome-wide association study of school grades identifies a genetic overlap between language ability, psychopathology and creativity”, Rajagopal et al 2020 (“math performance was severely affected whereas language performance (Danish and English) was relatively unaffected or enhanced in those with psychiatric disorders”)
- “GWAS of Depression Phenotypes in the Million Veteran Program and Meta-analysis in More than 1.2 Million Participants Yields 178 Independent Risk Loci”, Levey et al 2020
- “A Single Gene Causes Thelytokous Parthenogenesis, the Defining Feature of the Cape Honeybee Apis mellifera capensis”, Yagound et al 2020
- “Insights into the genetic architecture of the human face”, White et al 2020
Recent Evolution:
- “Sex-biased reduction in reproductive success drives selective constraint on human genes”, Gardner et al 2020; “Genome-wide analysis identifies genetic effects on reproductive success and ongoing natural selection at the FADS locus”, Mathieson et al 2020 (previously: Barban et al 2016/Tropf et al 2016/Verweij et al 2017)
- “Disentangling selection on genetically correlated polygenic traits using whole-genome genealogies”, Stern et al 2020
Engineering:
Statistics/meta-science/mathematics:
- “Variability in the analysis of a single neuroimaging dataset by many teams”, Botvinik-Nezer et al 2020
- “Remembering John Conway’s FRACTRAN, a ridiculous, yet surprisingly deep language”, Reginald Braithwaite (how does the recently-deceased John Conway’s 1980 esolang lead to the Collatz conjecture?)
- “Tumbling toast, Murphy’s Law and the fundamental constants”, Matthews 1995 (overview; anthropics size argument from Press 1980; see also Bacon et al 2001/Borghi 2012)
Politics/religion:
- Review of The Cultural Revolution
- “The Voluntariness of Voluntary Consent: Consent Searches and the Psychology of Compliance”, Sommers & Bohns 2019 (people are bad at predicting resistance to police requests; see also Christin et al 2012)
- Operation INFEKTION (see also Gordon 1997)
- “Progress Studies for Aspiring Young Scholars” (experimental online summer class for high school students by Jason Crawford on development)
Psychology/biology:
- “Understanding immunity through the lens of disease ecology”, Hedrick 20178 (“…for the past few thousand years, we human beings have been the most diseased species on earth”; followup to Hedrick 2004)
- “How sanitation conquered disease long before vaccines or antibiotics”, Jason Crawford
- “Everyday Life as an Intelligence Test: Effects of Intelligence and Intelligence Context”, Gordon 1997
- “Objective and subjective experiences of child maltreatment and their relationships with psychopathology”, Danese & Widom 2020 (nothing in psychology makes sense except in the light of individual-differences)
- “Brainless but Multi-Headed: Decision Making by the Acellular Slime Mould Physarum polycephalum”, Beekman & Latty 2015
- “I’m paid biweekly, just not by leprechauns: Evaluating valid-but-incorrect response rates to attention check items”, Curran & Hauser 2019 (how do “lizardman constant” responders justify it? Or, ‘free response is the devil’)
Technology:
- “Reflections on How Designers Design with Data”, Bigelow et al 2014 (why are data visualizations so bad—superficially pretty but misleading or useless? Because many designers don’t look at the data, avoid automation & create manually so they can focus on pretty shapes/colors & enjoying fiddling with it, and ignore readers)
- “Do Ads Harm News Consumption?”, Yan et al 2020 (“Users who adopt ad blockers subsequently consume 20% more news articles corresponding to 10% more categories. The effect persists over time…”; see my ad page)
- “The 1-Bit Instrument: The Fundamentals of 1-Bit Synthesis, Their Implementational Implications, and Instrumental Possibilities”, Troise 2020
Economics:
- “In Ohio, the Amish Take On the Coronavirus” (supply and demand: masks can be easily made anywhere if prices are allowed to rise & they are not illegal to sell)
- “The Story of America’s Most Prolific Counterfeiter” (how Frank Bourassa tricked a Swiss mill into selling him the unique U.S. dollar linen-paper to create $317m in perfect counterfeit money & mostly got away with it)
Fiction:
Misc:
Books
Fiction:
- The Battle Between the Frogs and the Mice: A Tiny Homeric Epic, translated Stallings 2009 (review)
Music
- “Sept Jours sans Elle (Vocal)” (Raven’s Jig; Une Semaine chez les Écarlates {2018}) [classical]
- “Un Jour Joueur” (Raven’s Jig; Une Semaine chez les Écarlates {2018}) [classical]
- “Bons et mauvais Jours” (Raven’s Jig; Une Semaine chez les Écarlates {2018}) [classical]
MLP:
- “Morning in Baltimare” (Mane in Green; II. The Journey [The Quest of the Lost Sapphire—Ep. 2] {2017}) [instrumental rock]
- “Love and Reflection” (Dionte George; Ignite {2020}) [jazz]
- “Second Prances (Vocal VIP)” (Etherium Apex ft. Nicole Carino {2020}) [electronic]
- “Spun” (The Wasteland Wailers feat. Brittany Church & Haymaker; Ignite {2020}) [country]
- “Equiterian Empire” (Carbon Maestro; Celestial Divide OST) [orchestral]
- “The Storm Is Coming VIP [Single Purpose Remix]” (UndreamedPanic feat. Metajoker; Ignite {2020}) [rock]
- “Mare Cognitum” (Idyllia feat. Velvet R. Wings; Ignite {2020}) [orchestral rock]
- “Fire City (Day & Night)” (Wandering Artist; Ignite {2020}) [orchestral]
- “What Remains” (Totalspark; Ignite {2020}) [Liquid Drum & Brass]
Doujin:
- “Come, Sweet Death [Komm, süsser Tod]” (Platina Jazz feat. Niklas Gabrielsson; Anime Standards Vol. 6 {2019}) [jazz]
- “Hope” (Simpsonill {2017}) [electronic]
On implicit meta-learning, see: Botvinick et al 2019, Clune 2019, Schmidhuber 2015/2018, Weng 2018/Weng 2019.↩︎
GPT-3 hardly costs more than a few million dollars of compute (now), while Big Science projects like ITER blow >5000✕ the funding to mostly fail. GPT-3 could have been done decades ago with global computing resources & scientific budgets; what could be done with today’s hardware & budgets that we just don’t know or care to do? There is a hardware overhang.↩︎
Eg no use of brain imitation learning or neural architecture search to try to tailor the model, or even decide basic hyperparameters like widths (which as EfficientNet shows, can make quite a different even in “well-understood and hand-optimized vanilla architectures”).↩︎
Specifically: Sun et al 2017, Hestness et al 2017, Shallue et al 2018, McCandlish et al 2018, Rosenfeld et al 2019, Li et al 2020, Kaplan et al 2020, Roller et al 2020. It is noteworthy that the pursuit of large models is driven almost exclusively by OpenAI & industry entities (the latter of which are content with far smaller models), and that academia has evinced an almost total disinterest (disgust, even). For all that the scaling hypothesis is ‘obvious’ and scaling is ‘predicted’, there is remarkably little interest in actually doing it. Perhaps we should pay more attention to what people do rather than what they say.↩︎
One interesting aspect of image scaling experiments is that even when performance is ‘plateauing’ on the original task & approaching label error, the transfer learning continues to improve. Apparently the internal representations, even when adequate for mere classification and so the score cannot increase more than a small percentage, become more human-like—encoding dark knowledge? I’ve noticed with language models, the final fractions of a loss appear to make a substantial difference to generated sample quality, perhaps because it is only after all the easier modeling is finished that the lazy language model is forced to squeeze out the next bit of performance by more correctly modeling more sophisticated things like logic, objects, world-knowledge, etc.↩︎
Now that GPT-3’s few-shot and T5 finetuning have begun to make people like Gary Marcus feel slightly nervous about WinoGrande, they have begun preparing their excuses for why Winograd schemas weren’t really good measures of commonsense reasoning/intelligence (because intelligence, of course, is whatever AI can’t do yet).↩︎
Don’t worry: we already have short-shorts & ear-TIPS to hedge against fursona inflation. That said, we advise taking a large position in equineties image macro funds to benefit from a flight to quality and herding: it’ll be a bear market for kinky bonds—and that’s no bull.↩︎
Some interesting references:
Coevolution Of Virulence:
- Experimental Epidemiology, Greenwood et al 1936 (editorial)
- “Population biology of infectious diseases: Part I”/“Part II”, Anderson & May 1979
- “Coevolution of hosts and parasites”, Anderson & May 1982
Passaging:
- “Experimental Evolution of Parasites”, Ebert 1998
- “History of Sabin attenuated poliovirus oral live vaccine strains”, Sabin & Boulger 1973 (making Sabin’s polio vaccine by dozens of passages through monkeys & monkey tissues)