May 2020 news & 'On GPT-3'

May 2020 newsletter: GPT-3 scaling, implications, deep theory; anime GAN updates, and 1 book review.
newsletter, NN, insight-porn
26 Dec 201904 Aug 2020 finished certainty: log importance: 0

This is the edition of ; previous, (). This is a summary of the revision-history RSS feed, overlapping with my & ; brought to you by my donors on Patreon.



On , Brown et al 2020 ( & my followup , compare ; random samples; with real-world demos)

Read The Samples

I strongly encourage anyone interested in GPT-3 to also at least skim OA’s random samples, or better yet, my samples in —reading the paper & looking at some standard benchmark graphs does not give a good feel for what working with GPT-3 is like or the diversity of things it can do which are missed by benchmarks.

Learning to learn. OA releases the long-awaited followup to , one model to rule them all: a 117× larger 175b-parameter model with far more powerful language generation, which lets it solve a wide variety of problems from arithmetic1 to English translation to unscrambling anagrams to SAT analogies—purely from being prompted with text examples, without any specialized training or finetuning whatsoever, merely next-word prediction training on a big Internet text corpus. This implies GPT-3’s attention mechanisms serve as that have “learned to learn” by training on sufficiently varied data2, forcing it to do more than just learn ordinary textual relationships. Like OpenAI’s just weeks ago (itself a remarkable demonstration of scaling in synthesizing raw audio music complete with remarkably realistic voices/instruments), the announcement of GPT-3 appears to have sunk almost without a trace, so I will go into more depth than usual.

“Attacks only get better.” 2 years ago, was interestingly useful pretraining and adorable with its “sentiment neuron”. 1 year ago, GPT-2 was impressive with its excellent text generation & finetuning capabilities. This year, GPT-3 is scary because it’s a magnificently obsolete architecture from early 2018, which is small & shallow compared to what’s possible3, with a simple uniform architecture4 trained in the dumbest way possible (unidirectional prediction of next text token) on a single impoverished modality (random Internet HTML text dumps5) on tiny data (fits on a laptop), sampled in a dumb way6, and yet, the first version already manifests crazy runtime meta-learning—and the scaling curves still are not bending! The samples are also better than ever, whether it’s GPT-3 inventing new penis jokes7 or writing (mostly working) about rotating arrays.

Is GPT actually part of AGI—or is the cake a lie? (LeCun 2019)

Not the whole picture, but a big part. Does it set SOTA on every task? No, of course not. But the question is not whether we can lawyerly find any way in which it might not work, but . And there are many ways it might work better (see the “Limitations” section for just a few). Does GPT-3 do anything like steer a robot around SF shooting lasers and rockets at humans⸮ No, of course not. It is ‘just’ a text prediction model, an idiot savant of text; but an idiot savant, we should remember, is only a genetic mutation or bit of brain damage away from a normal human. If RL is the cherry on the top of the supervised learning frosting, and supervised learning is the frosting on top of the unsupervised learning cake, well, it looks like the cake layers are finally rising.

Scaling still working. I was surprised, as I had expected closer to 100b parameters, and I thought that the performance of ///// suggested that, the scaling papers8 notwithstanding, the scaling curves had started to bend and by 100b, it might be hard to justify further scaling. However, in the latest version of , GPT-3 hits twice that without noticeable change in scaling factors: its scaling continues to be roughly logarithmic/power-law, as it was for much smaller models & as forecast, and it has not hit a regime where gains effectively halt or start to require increases vastly beyond feasibility. That suggests that it would be both possible and useful to head to trillions of parameters (which are still well within available compute & budgets, requiring merely thousands of GPUs & perhaps $10–$100m budgets assuming no improvements which of course there will be, see Hernandez & Brown 2020 etc in this issue), and eyeballing the graphs, many benchmarks like the would fall by 10t parameters.

A better GPT-3 lesson.

Anti-scaling: penny-wise, pound-foolish. GPT-3 is an extraordinarily expensive model by the standards of machine learning: it is estimated that training it may require the annual cost of more machine learning researchers than you can count on one hand (~$5m9), up to $30 of hard drive space to store the model (500–800GB), and multiple pennies of electricity per 100 pages of output (0.4 kWH). Researchers are concerned about the prospects for scaling: can ML afford to run projects which cost more than 0.1 milli-Manhattan-Projects⸮10 Surely it would too expensive, even if it represented another large leap in AI capabilities, to spend up to 10 milli-Manhattan-Projects to scale GPT-3 100× to a trivial thing like human-like performance in many domains⸮ Many researchers feel that such a suggestion is absurd and refutes the entire idea of scaling machine learning research further, and that the field would be more productive if it instead focused on research which can be conducted by an impoverished goatherder on an old laptop running off solar panels.11 Nonetheless, I think we can expect further scaling. (10×? No, 10× isn’t cool. You know what’s cool? 100–1000×.)

How far will scaling go? The scaling papers suggest that the leaps we have seen over the past few years are not even half way there in terms of absolute likelihood loss, never mind what real-world capabilities each additional decrement translates into. The scaling curves are extremely clean; from Kaplan et al 2020:

DL scaling laws: compute, data, model parameters.

GPT-3 represents ~103 on this chart, leaving plenty of room for further loss decreases—especially given the uncertainty in extrapolation:

Projecting DL power laws: still room beyond GPT-3.

Lo and behold, the scaling laws continue for GPT-3 models for several orders past Kaplan et al 2020; from Brown et al 2020:

GPT-3 continues to scale as predicted.

If we see such striking gains in reducing the validation loss from ~4 to ~2, what is left to emerge as we reduce to 1.5, or 1? How far does this go, exactly? Bueller? Bueller? (See also Meena’s perplexity vs human-ness chatbot ratings.)

“Extrapolating the spectacular performance of GPT-3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.”

Geoff Hinton

We don’t know how to train NNs. As I keep saying, “NNs are lazy” and can do far more than we make them do when we push them beyond easy answers & cheap shortcuts. The is the harder and bigger, the better. (Besides GPT-3, one could mention recent progress in semi-supervised learning & the model-based DRL renaissance.)

AlphaGo Zero: ‘just stack moar layers lol!’

Blessings of scale: stability→generalization→meta-learning. GPT-3 is hamstrung by its training & data, but DL enjoys an unreasonably effective blessing of dimensionality—just simply training a big model on a lot of data induces better properties like meta-learning without even the slightest bit of that architecture being built in; and in general, training on more and harder tasks creates ever more human-like performance, generalization, and robustness. The GPT models, and for images, show that simply scaling up models & datasets without any supervision produces results competitive with the best (and most complex) alternatives, using the same simple architecture. OA5 does not just scale to, but stabilizes at, minibatches of millions due to . OA5-like, stabilizes at large-scale image datasets like JFT-300M & benefits from unusually large minibatches, while classifier CNNs like /Dojolonga et al 2020 or or transfer & robustify with human-like errors12, multimodal learning produces better representations on less data (eg /, motivating ), and RNNs can . reaches human-level with hundreds of competing self-players to cover possible strategies. Imitation learning DRL like generalizes at hundreds of tasks to train a deep net. Disentanglement emerges in with sufficiently deep w embeddings, with enough parameters to train raw audio in the aforementioned Jukebox, or in // with enough samples to force factorization. Training on millions of domain randomizations induced similar implicit meta-learning where during each runtime invocation, the RNN probes its environment and encodes its understanding of robot hand control into its hidden state; and outperforms classical robot planners by scaling 2 orders. Or in , training on hundreds of levels trains agents individually, but at thousands of levels, they begin to generalize to unseen levels. demonstrated truly superhuman Go without ‘delusions’ just by training a bigger model on a richer signal & pro-level play without any search—and , for that matter, demonstrated that just training an RNN end-to-end to predict a reward on enough data is enough to obsolete even AlphaZero and learn tree search implicitly (but better). And on and on.

The scaling hypothesis that, once we find a scalable architecture like self-attention or convolutions, which like the brain can be applied fairly uniformly (eg “The Brain as a Universal Learning Machine” or Hawkins), we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data, looks increasingly plausible.

Keeping track. GPT-3 in 2020 makes as good a point as any to take a look back on the past decade. In 2010, one could easily fit everyone in the world who genuinely believed in deep learning into a moderate-sized conference room (assisted slightly by the fact that 3 of them were busy founding ). Someone interested in machine learning in 2010 might have read about some stuff in recognizing hand-written digits using all of 1–2 million parameters, or some modest neural tweaks to standard hidden Markov model voice-recognition. In 2010, who would have predicted that over the next 10 years, deep learning would undergo a Cambrian explosion causing a mass extinction of alternative approaches throughout machine learning, that models would scale up to 175,000 million parameters, and that these enormous models would just spontaneously develop all these capabilities, aside from a few diehard connectionists written off as willfully-deluded old-school fanatics by the rest of the AI community (never mind the world), such as , Schmidhuber, Sutskever, Legg, & Amodei? In 1998, 22 years ago, Moravec noted that AI research could be deceptive, and hardware limits meant that “intelligent machine research did not make steady progress in its first 50 years, it marked time for 30 of them!”, predicting that as Moore’s law continued, “things will go much faster in the next 50 years than they have in the last 50.” The accelerating pace of the last 10 years should wake anyone from their dogmatic slumber and make them sit upright. And there are 28 years left in Moravec’s forecast…

The temptation, that many do not resist so much as revel in, is to give in to a déformation professionnelle and dismiss any model as “just” this or that(“just billions of IF statements” or “just a bunch of multiplications” or “just millions of memorized web pages”), missing the forest for the trees, as Moravec commented of chess engines:

The event was notable for many reasons, but one especially is of interest here. Several times during both matches, Kasparov reported signs of mind in the machine. At times in the second tournament, he worried there might be humans behind the scenes, feeding Deep Blue strategic insights!…In all other chess computers, he reports a mechanical predictability stemming from their undiscriminating but limited lookahead, and absence of long-term strategy. In Deep Blue, to his consternation, he saw instead an “alien intelligence.”

…Deep Blue’s creators know its quantitative superiority over other chess machines intimately, but lack the chess understanding to share Kasparov’s deep appreciation of the difference in the quality of its play. I think this dichotomy will show up increasingly in coming years. Engineers who know the mechanism of advanced robots most intimately will be the last to admit they have real minds. From the inside, robots will indisputably be machines, acting according to mechanical principles, however elaborately layered. Only on the outside, where they can be appreciated as a whole, will the impression of intelligence emerge. A human brain, too, does not exhibit the intelligence under a neurobiologist’s microscope that it does participating in a lively conversation.

But of course, if we ever succeed in AI, or in reductionism in general, it must be by reducing Y to ‘just X’. Showing that some task requiring intelligence can be solved by a well-defined algorithm with no ‘intelligence’ is precisely what success must look like! (Otherwise, the question has been thoroughly begged & the problem has only been pushed elsewhere; computer chips are made of Boolean logic, not especially little homunculi.)

“Give it the compute, give it the data, and it will do amazing things. This stuff is like—it’s like alchemy!”

Ilya Sutskever, summer 2019

Hindsight is 20/20. Even in 2015, the scaling hypothesis seemed highly dubious: you needed something to scale, after all, and it was all too easy to look at flaws in existing systems and imagine that they would never go away and progress would sigmoid any month now, soon. Like the genomics revolution where a few far-sighted seers extrapolated that the necessary n for GWASes would increase exponentially & deliver powerful PGSes soon, while sober experts wrung their hands over “missing heritability” & the miraculous complexity of biology & scoff about how such n requirements proved GWAS was a failed paradigm, the future arrived at first slowly and then quickly. Yet, here we are: all honor to the fanatics, shame and humiliation to the critics!13 If only one could go back 10 years, or even 5, to watch every AI researchers’ head explode reading this paper… Unfortunately, few heads appear to be exploding now, because human capacity for hindsight & excuses is boundless (“I can get that much with finetuning, anyway I predicted it all along, how boring”) and, unfortunately, for AGI. (If you are still certain that there is near-zero probability of AGI in the next few decades, why? Did you predict—in writing—capabilities like GPT-3? Is this how you expect AI failure to look in the decades beforehand? What specific task, what specific number, would convince you otherwise? How would the world look different than it does now if these crude prototype insect-brain-sized DL systems were not on a path to success?)

Authority without accountability. What should we think about the experts? Projections of failure were made by eminent, respectable, serious people. They spoke in considered tones of why AI hype was excessive and might trigger an “AI winter”, and the fundamental flaws of fashionable approaches and why brute force could not work. These statements were made routinely in 2014, 2015, 2016… And they were wrong. I am aware of few issuing a mea culpa or reflecting on it.14 It is a puzzling failure, and I’ve reflected on it before.

Phatic, not predictive. There is, however, a certain tone of voice the bien pensant all speak in, whose sound is the same whether right or wrong; a tone shared with many statements in January to March of this year; a tone we can also find in a 1940 Scientific American article authoritatively titled, , which advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.)

The iron law of bureaucracy: Cathedral gothic. This tone of voice is the voice of .
The voice of authority insists on calm, and people not “panicking” (the chief of sins).
The voice of authority assures you that it won’t happen (because it can’t happen).
The voice utters simple arguments about why the status quo will prevail, and considers only how the wild new idea could fail (and not all the possible options).
The voice is not, and does not deal in, uncertainty; things will either happen or they will not, and since it will not happen, there is no need to take any precautions (and you should not worry because it can’t happen).
The voice does not believe in drawing lines on graphs (it is rank numerology).
The voice does not issue any numerical predictions (which could be falsified).
The voice will not share its source code (for complicated reasons which cannot be explained to the laity).
The voice is opposed to unethical things like randomized experiments on volunteers (but will overlook the insult).
The voice does not have a model of the future (because a model implies it does not already know the future).
The voice is concerned about its public image (and unkind gossip about it by other speakers of the voice).
The voice is always sober, respectable, and credentialed (the voice would be pleased to write an op-ed for your national magazine and/or newspaper).
The voice speaks, and is not spoken to (you cannot ask the voice what objective fact would change its mind).
The voice never changes its mind (until it does).
The voice is never surprised by events in the world (only disappointed).
The voice advises you to go back to sleep (right now).

When someone speaks about future possibilities, what is the tone of their voice?








  1. Given the number of comments on the paper’s arithmetic benchmark, I should point out that the arithmetic benchmark appears to greatly understate GPT-3’s abilities due to the BPE encoding issue: even using commas markedly improves its 5-digit addition ability, for example. The BPE issue also appears to explain much of the poor performance on the anagram/shuffling tasks. This is something to keep in mind for any task which requires character-level manipulation or understanding.↩︎

  2. On implicit meta-learning, see: , , /, /.↩︎

  3. GPT-3 hardly costs more than a few million dollars of compute (now) and is cheap to run (pg39: “Even with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or only a few cents in energy costs.”), while IBM’s (otherwise useless) Deep Blue AI project reputedly cost >$10m for the final iteration (reports of $192m appear to be a confusion with the estimated value of publicity mentioned in pg187 of Hsu’s Behind Deep Blue) and Big Science projects like blow >5000× the funding to mostly fail. (The particle physicists, incidentally, are back asking for ≫$$24b, based on, presumably the countless scientific revolutions & world-changing breakthroughs that the LHC’s >$12b investment produced…) GPT-3 could have been done decades ago with global computing resources & scientific budgets; what could be done with today’s hardware & budgets that we just don’t know or care to do? There is a hardware overhang. (See also the Whole Brain Emulation Roadmap & .)↩︎

  4. Eg a narrow context window severely limits it, and motivates the need for efficient attention. More broadly, GPT-3 does nothing exotic— no use of or neural architecture search to try to tailor the model, or even decide basic hyperparameters like widths (which as shows, can make quite a different even in “well-understood and hand-optimized vanilla architectures”).↩︎

  5. Not even PDFs—so no Google Books, no Arxiv, no Libgen, no Sci-Hub…↩︎

  6. Generating text from a LM can reveal the presence of knowledge, but not its absence, and it is universally agreed that the current crude heuristic methods like top-k cannot possibly be optimal.↩︎

  7. ‘A man is at the doctor’s office, and the doctor tells him, “I’ve got some good news and some bad news for you.” / The man says, “Well, I can’t take the bad news right now, so give me the good news first.” / The doctor says, “Well, the good news is that you have an 18-inch penis.” / The man looks stunned for a moment, and then asks, “What’s the bad news?” / The doctor says, “Your brain’s in your dick.”’↩︎

  8. Specifically: , , , , , , , , , , , .

    It is noteworthy that the pursuit of large models is driven almost exclusively by OpenAI & industry entities (the latter of which are content with far smaller models), and that academia has evinced an almost total disinterest—disgust & anger, even, and denial (one might say “green AI” is green with envy). For all that the scaling hypothesis is ‘obvious’ and scaling is ‘predicted’, there is remarkably little interest in actually doing it. Perhaps we should pay more attention to what people do rather than what they say.↩︎

  9. Roughly around Chuan Li’s estimate, using nominal list prices without discounts (which could be steep as the marginal costs of cloud compute are substantially lower). The R&D project cost would be much higher, but is amortized over all subsequent models & projects.↩︎

  10. The Manhattan Project cost ~$24b.↩︎

  11. As if we live in a world where grad students could go to the Moon on a ramen budget if we just wished hard enough, or as if “green AI” approaches to try to create small models without going through big models did not look increasingly futile and like throwing good money after bad, and were not the least green of all AI research…↩︎

  12. One interesting aspect of image scaling experiments like is that even when performance is ‘plateauing’ on the original task & approaching label error, the transfer learning continues to improve. Apparently the internal representations, even when adequate for mere classification and so the score cannot increase more than a small percentage, become more human-like—because it’s encoding or more ? I’ve noticed with language models, the final fractions of a loss appear to make a substantial difference to generated sample quality, perhaps because it is only after all the easier modeling is finished that the lazy language model is forced to squeeze out the next bit of performance by more correctly modeling more sophisticated things like logic, objects, world-knowledge, etc.↩︎

  13. Now that GPT-3’s few-shot and have begun to make people like Gary Marcus feel slightly nervous about WinoGrande, they have begun for why Winograd schemas good measures of commonsense reasoning/intelligence (because intelligence, of course, is whatever AI can’t do yet).↩︎

  14. Feynman: “There are several references to previous flights; the acceptance and success of these flights are taken as evidence of safety. But erosion and blowby are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in the unexpected and not thoroughly understood way. The fact that this danger did not lead to catastrophe before is no guarantee that it will not the next time, unless it is completely understood.”↩︎

  15. Don’t worry: we already have short-shorts & ear- to hedge against fursona inflation. That said, we advise taking a large position in equineties image macro funds to benefit from a flight to quality and herding: it’ll be a bear market for kinky bonds—and that’s no bull.↩︎

  16. Some interesting references: