May 2020 news & 'On GPT-3'

May 2020 newsletter: GPT-3 scaling, implications, deep theory; anime GAN updates, and 1 book review.
newsletter, NN, insight-porn
2019-12-262020-10-29 finished certainty: possible importance: 10

This is the edi­tion of ; pre­vi­ous, (). This is a sum­mary of the revi­sion-his­tory RSS feed, over­lap­ping with my & ; brought to you by my donors on Patreon.


On GPT-3: Meta-Learning, Scaling, Implications, And Deep Theory

On , Brown et al 2020 ( & my fol­lowup , com­pare ; ran­dom sam­ples; with real-­world demos)

GPT-3, announced by Ope­nAI in May 2020, is the largest neural net­work ever trained, by over an order of mag­ni­tude. Trained on Inter­net text data, it is the suc­ces­sor to GPT-2, which had sur­prised every­one by its nat­ural lan­guage under­stand­ing & gen­er­a­tion abil­i­ty. To the sur­prise of most (in­clud­ing myself), this vast increase in size did not run into dimin­ish­ing or neg­a­tive returns, as many expect­ed, but the ben­e­fits of scale con­tin­ued to hap­pen as fore­casted by Ope­nAI. These ben­e­fits were not merely learn­ing more facts & text than GPT-2, but qual­i­ta­tively dis­tinct & even more sur­pris­ing in show­ing meta-learn­ing: while GPT-2 learned how to do com­mon nat­ural lan­guage tasks like text sum­ma­riza­tion, GPT-3 instead learned how to fol­low direc­tions and learn new tasks from a few exam­ples. (As a result, GPT-3 out­puts & inter­ac­tion are more fas­ci­nat­ing & human-­like than GPT-2.)

While the imme­di­ate appli­ca­tions of GPT-3, like my poetry or humor writ­ings, are nice, the short­-term impli­ca­tions of GPT-3 are much more impor­tant.

First, while GPT-3 is expen­sive by con­ven­tional DL stan­dards, it is cheap by scientific/commercial/military/government bud­get stan­dards, and the results indi­cate that mod­els could be made much larg­er. Sec­ond, mod­els can also be made much more pow­er­ful, as GPT is an old approach known to be flawed in both minor & major ways, and far from an ‘ideal’ Trans­former. Third, GPT-3’s capa­bil­i­ties come from learn­ing on raw (un­su­per­vised) data; that has long been one of the weak­est areas of DL, hold­ing back progress in other areas like rein­force­ment learn­ing or robot­ics. Mod­els like GPT-3 sug­gest that large unsu­per­vised mod­els will be vital com­po­nents of future DL sys­tems, as they can be ‘plugged into’ sys­tems to imme­di­ately pro­vide under­stand­ing of the world, humans, nat­ural lan­guage, and rea­son­ing.

The meta-learn­ing has a longer-term impli­ca­tion: it is a demon­stra­tion of the bless­ings of scale, where prob­lems with sim­ple neural net­works van­ish, and they become more pow­er­ful, more gen­er­al­iz­able, more human-­like when sim­ply made very large & trained on very large datasets with very large com­pute—even though those prop­er­ties are believed to require com­pli­cated archi­tec­tures & fancy algo­rithms (and this per­ceived need dri­ves much research). Unsu­per­vised mod­els ben­e­fit from this, as train­ing on large cor­puses like Inter­net-s­cale text present a myr­iad of dif­fi­cult prob­lems to solve; this is enough to drive meta-learn­ing despite GPT not being designed for meta-learn­ing in any way. (This fam­ily of phe­nom­ena is per­haps dri­ven by neural net­works func­tion­ing as ensem­bles of many sub­-net­works with them all aver­ag­ing out to an Occam’s razor, which for small data & mod­els, learn super­fi­cial or mem­o­rized parts of the data, but can be forced into true learn­ing by mak­ing the prob­lems hard & rich enough; as meta-learn­ers learn amor­tized Bayesian infer­ence, they build in infor­ma­tive pri­ors when trained over many tasks, and become dra­mat­i­cally more sam­ple-­ef­fi­cient and bet­ter at gen­er­al­iza­tion.)

The bless­ings of scale in turn sup­port a rad­i­cal the­o­ry: an old AI par­a­digm held by a few pio­neers in con­nec­tion­ism (early arti­fi­cial neural net­work research) and by more recent deep learn­ing researchers, the scal­ing hypoth­e­sis. The scal­ing hypoth­e­sis regards the bless­ings of scale as the secret of AGI: intel­li­gence is ‘just’ sim­ple neural units & learn­ing algo­rithms applied to diverse expe­ri­ences at a (cur­rent­ly) unreach­able scale. As increas­ing com­pu­ta­tional resources per­mit run­ning such algo­rithms at the nec­es­sary scale, the neural net­works will get ever more intel­li­gent.

When? Esti­mates of Moore’s law-­like progress curves decades ago by pio­neers like Hans Moravec indi­cated that it would take until the 2010s for the suf­fi­cient­ly-cheap com­pute for tiny insec­t-level pro­to­type sys­tems to be avail­able, and the 2020s for the first sub­-hu­man sys­tems to become fea­si­ble, and these fore­casts are hold­ing up. (De­spite this vin­di­ca­tion, the scal­ing hypoth­e­sis is so unpop­u­lar an idea, and dif­fi­cult to prove in advance rather than as a fait accom­pli, that while the GPT-3 results finally drew some pub­lic notice after Ope­nAI enabled lim­ited pub­lic access & peo­ple could exper­i­ment with it live, it is unlikely that many enti­ties will mod­ify their research philoso­phies, much less kick off an ‘arms race’.)

More con­cern­ing­ly, GPT-3’s scal­ing curves, unpre­dicted meta-learn­ing, and suc­cess on var­i­ous anti-AI chal­lenges sug­gests that in terms of futur­ol­o­gy, AI researchers’ fore­casts are an emperor sans gar­ments: they have no coher­ent model of how AI progress hap­pens or why GPT-3 was pos­si­ble or what spe­cific achieve­ments should cause alarm, where intel­li­gence comes from, and do not learn from any fal­si­fied pre­dic­tions. Their pri­mary con­cerns appear to be sup­port­ing the sta­tus quo, pla­cat­ing pub­lic con­cern, and remain­ing respectable. As such, their com­ments on AI risk are mean­ing­less: they would make the same pub­lic state­ments if the scal­ing hypoth­e­sis were true or not.

Depend­ing on what invest­ments are made into scal­ing DL, and how fast com­pute grows, the 2020s should be quite inter­est­ing—sig­moid or sin­gu­lar­i­ty?


Read The Sam­ples

I strongly encour­age any­one inter­ested in GPT-3 to also at least skim OA’s ran­dom sam­ples, or bet­ter yet, my sam­ples in —read­ing the paper & look­ing at some stan­dard bench­mark graphs does not give a good feel for what work­ing with GPT-3 is like or the diver­sity of things it can do which are missed by bench­marks.

Learn­ing to learn. In May 2020, OA released—to remark­ably lit­tle inter­est from researchers, no blog post, no media blitz, and lit­tle pub­lic dis­cus­sion beyond the snidely dis­mis­sive—the long-awaited fol­lowup to , one model to rule them all: a 117× larger 175b-­pa­ra­me­ter model with far more pow­er­ful lan­guage gen­er­a­tion, which lets it solve a wide vari­ety of prob­lems from arith­metic1 to Eng­lish trans­la­tion to unscram­bling ana­grams to SAT analo­gies—purely from being prompted with text exam­ples, with­out any spe­cial­ized train­ing or fine­tun­ing what­so­ev­er, merely nex­t-­word pre­dic­tion train­ing on a big Inter­net text cor­pus. This implies GPT-3’s atten­tion mech­a­nisms serve as that have “learned to learn” by train­ing on suf­fi­ciently var­ied data2, forc­ing it to do more than just learn ordi­nary tex­tual rela­tion­ships. Like Ope­nAI’s just weeks ago (it­self a remark­able demon­stra­tion of scal­ing in syn­the­siz­ing raw audio music com­plete with remark­ably real­is­tic voices/instruments), the announce­ment of GPT-3 appears to have sunk almost with­out a trace, so I will go into more depth than usu­al.

Flexing GPT

“Attacks only get bet­ter.” 2 years ago, was inter­est­ingly use­ful pre­train­ing and adorable with its “sen­ti­ment neu­ron”. 1 year ago, GPT-2 was impres­sive with its excel­lent text gen­er­a­tion & fine­tun­ing capa­bil­i­ties. This year, GPT-3 is scary because it’s a mag­nif­i­cently obso­lete archi­tec­ture from early 2018, which is small & shal­low com­pared to what’s pos­si­ble34, with a sim­ple uni­form archi­tec­ture5 trained in the dumb­est way pos­si­ble (uni­di­rec­tional pre­dic­tion of next text token) on a sin­gle impov­er­ished modal­ity (ran­dom Inter­net HTML text dumps6) on tiny data (fits on a lap­top), sam­pled in a dumb way7, its bench­mark per­for­mance sab­o­taged by bad prompts & (espe­cially arith­metic & com­mon­sense rea­son­ing), and yet, the first ver­sion already man­i­fests crazy run­time meta-learn­ing—and the scal­ing curves still are not bend­ing! The sam­ples are also bet­ter than ever, whether it’s GPT-3 invent­ing new penis jokes8 or writ­ing (mostly work­ing) about rotat­ing arrays.

It’s odd that this qual­i­ta­tive leap appears to be largely missed by the stan­dard NLP bench­marks. Noth­ing in the raw met­rics reported on, say, Penn Tree Bank or LAMBADA or Wino­Grande would lead you to expect all of this hilar­i­ous and cre­ative out­put; the meta-learn­ing results might, but only if you thought meta-learn­ing was impor­tant. This sug­gests to me that a use­ful post-GPT-3 con­tri­bu­tion would be fig­ur­ing out how to bench­mark these sorts of flex­i­ble text gen­er­a­tion capa­bil­i­ties (pos­si­bly some­thing along the lines of Chol­let’s image-based ).

Baking The Cake

Is GPT actu­ally part of AGI—or is the cake a lie? (LeCun 2019)

Not the whole pic­ture, but a big part. Does it set SOTA on every task? No, of course not. But the ques­tion is not whether we can lawyerly find any way in which it might not work, but . And there are many ways it might work bet­ter (see the “Lim­i­ta­tions” sec­tion for just a few). Does GPT-3 do any­thing like steer a robot around SF shoot­ing lasers and rock­ets at humans⸮ No, of course not. It is ‘just’ a text pre­dic­tion mod­el, an idiot savant of text; but an idiot savant, we should remem­ber, is only a genetic muta­tion or bit of brain dam­age away from a nor­mal human. If RL is the cherry on the top of the super­vised learn­ing frost­ing, and super­vised learn­ing is the frost­ing on top of the unsu­per­vised learn­ing cake, well, it looks like the cake lay­ers are finally ris­ing.

A bet­ter GPT-3 les­son.

Scal­ing still work­ing. I was sur­prised, as I had expected closer to 100b para­me­ters, and I thought that the per­for­mance of ///// sug­gested that, the scal­ing papers9 notwith­stand­ing, the scal­ing curves had started to bend and by 100b, it might be hard to jus­tify fur­ther scal­ing. How­ev­er, in the lat­est ver­sion of , GPT-3 hits twice that with­out notice­able change in scal­ing fac­tors: its scal­ing con­tin­ues to be roughly logarithmic/power-law, as it was for much smaller mod­els & as fore­cast, and it has not hit a regime where gains effec­tively halt or start to require increases vastly beyond fea­si­bil­i­ty. That sug­gests that it would be both pos­si­ble and use­ful to head to tril­lions of para­me­ters (which are still well within avail­able com­pute & bud­gets, requir­ing merely thou­sands of GPUs & per­haps $10–$100m bud­gets assum­ing no improve­ments which of course there will be, see Her­nan­dez & Brown 2020 etc in this issue), and eye­balling the graphs, many bench­marks like the would fall by 10t para­me­ters.

GPT-3: not even that much com­pute—3640 petaflop/s-day, only 2× their esti­mate for AlphaGo Zero, 1860. (His­tor­i­cal graph mod­i­fied by myself from , Amodei et al 2018.)

Anti-s­cal­ing: pen­ny-­wise, pound-­fool­ish. GPT-3 is an extra­or­di­nar­ily expen­sive model by the stan­dards of machine learn­ing: it is esti­mated that train­ing it may require the annual cost of more machine learn­ing researchers than you can count on one hand (~$5m10), up to $30 of hard drive space to store the model (500–800G­B), and mul­ti­ple pen­nies of elec­tric­ity per 100 pages of out­put (0.4 kWH). Researchers are con­cerned about the prospects for scal­ing: can ML afford to run projects which cost more than 0.1 mil­li-­Man­hat­tan-Pro­jects⸮11 Surely it would be too expen­sive, even if it rep­re­sented another large leap in AI capa­bil­i­ties, to spend up to 10 mil­li-­Man­hat­tan-Pro­jects to scale GPT-3 100× to a triv­ial thing like human-­like per­for­mance in many domains⸮ Many researchers feel that such a sug­ges­tion is absurd and refutes the entire idea of scal­ing machine learn­ing research fur­ther, and that the field would be more pro­duc­tive if it instead focused on research which can be con­ducted by an impov­er­ished goatherder on an old lap­top run­ning off solar pan­els.12 Nonethe­less, I think we can expect fur­ther scal­ing. (10×? No, 10× isn’t cool. You know what’s cool? 100–1000×, trained on a .)


How far will scal­ing go? The scal­ing papers sug­gest that the leaps we have seen over the past few years are not even half way there in terms of absolute like­li­hood loss, never mind what real-­world capa­bil­i­ties each addi­tional decre­ment trans­lates into. The scal­ing curves are clean; from Kaplan et al 2020:

DL scal­ing laws: com­pute, data, model para­me­ters.

GPT-3 rep­re­sents ~103 on this chart, leav­ing plenty of room for fur­ther loss decreas­es—e­spe­cially given the uncer­tainty in extrap­o­la­tion:

Pro­ject­ing DL power laws: still room beyond GPT-3.

Lo and behold, the scal­ing laws con­tinue for GPT-3 mod­els for sev­eral orders past Kaplan et al 2020; from Brown et al 2020:

GPT-3 con­tin­ues to scale as pre­dict­ed. (Note GPT-3’s curve has not ‘bounced’, and it trained only ~0.5 epoches, see Table 2.2)

If we see such strik­ing gains in halv­ing the val­i­da­tion loss but with so far left to go, what is left to emerge as we third or halve again? How far does this go, exact­ly? How do we pre­dict what emerges when? Bueller? Bueller? (See also Meena’s per­plex­ity vs human-­ness chat­bot rat­ings, GPT-3-written news arti­cles’ prob­a­bil­ity of fool­ing humans by para­me­ter count, and GPT-3 model size vs Q&A from .)

Blessings Of Scale

“Extrap­o­lat­ing the spec­tac­u­lar per­for­mance of GPT-3 into the future sug­gests that the answer to life, the uni­verse and every­thing is just 4.398 tril­lion para­me­ters.”

Geoff Hin­ton

We don’t know how to train NNs. The bless­ings of scale is the obser­va­tion that for deep learn­ing, every­thing gets bet­ter as it gets larger (in con­trast to the usual out­come in research, where small things are hard and large things impos­si­ble). The big­ger the neural net/compute/data/problem, the faster it learns, the bet­ter it learns, the sta­bler it learns, and so on. A prob­lem we can’t solve at all at small n may sud­denly become straight­for­ward with mil­lions or bil­lions of n. “NNs are lazy”: they can do far more than we make them do when we push them beyond easy answers & cheap short­cuts. The is the harder and big­ger, the bet­ter. (Be­sides GPT-3, one could men­tion recent progress in semi­-­su­per­vised learn­ing & the mod­el-based DRL renais­sance.)

AlphaGo Zero: ‘just stack moar lay­ers lol!’

Bless­ings of scale: sta­bil­i­ty→­gen­er­al­iza­tion→meta-learn­ing. GPT-3 is ham­strung by its train­ing & data, but DL enjoys an unrea­son­ably effec­tive —just sim­ply train­ing a big model on a lot of data induces bet­ter prop­er­ties like meta-learn­ing with­out even the slight­est bit of that archi­tec­ture being built in; and in gen­er­al, train­ing on more and harder tasks cre­ates ever more human-­like per­for­mance, gen­er­al­iza­tion, and robust­ness. The GPT nat­u­ral-lan­guage & pro­gram­ming lan­guage mod­els, / for images (and to some degree ), show that sim­ply scal­ing up mod­els & datasets with­out any super­vi­sion pro­duces results com­pet­i­tive with the best (and most com­plex) alter­na­tives, using the same sim­ple archi­tec­ture, grad­u­ally pass­ing from super­fi­cial sur­face cor­re­la­tions to more human-­like brain activ­ity () and lin­guis­tic biases as data increases (eg ) OA5 does not just scale to, but sta­bi­lizes at, mini­batches of mil­lions due to . OA5-­like, sta­bi­lizes at large-s­cale image datasets like JFT-300M & ben­e­fits from unusu­ally large mini­batch­es, while clas­si­fier CNNs like /Dojolonga et al 2020 or or trans­fer & robus­tify with human-­like errors13, mul­ti­modal learn­ing pro­duces bet­ter rep­re­sen­ta­tions on fewer data (eg /, moti­vat­ing ), and RNNs can . reaches human-level with hun­dreds of com­pet­ing self­-­play­ers to cover pos­si­ble strate­gies. Imi­ta­tion learn­ing DRL like gen­er­al­izes at hun­dreds of tasks to train a deep net. Dis­en­tan­gle­ment emerges in with suf­fi­ciently deep w embed­dings, with enough para­me­ters to train raw audio in the afore­men­tioned Juke­box, or in // with enough sam­ples to force fac­tor­iza­tion. (See also ///.) Train­ing on mil­lions of domain ran­dom­iza­tions induced sim­i­lar implicit meta-learn­ing where dur­ing each run­time invo­ca­tion, the RNN probes its envi­ron­ment and encodes its under­stand­ing of robot hand con­trol into its hid­den state; and out­per­forms clas­si­cal robot plan­ners by scal­ing 2 orders. Or in , train­ing on hun­dreds of lev­els trains agents indi­vid­u­al­ly, but at thou­sands of lev­els, they begin to gen­er­al­ize to unseen lev­els. demon­strated truly super­hu­man Go with­out ‘delu­sions’ just by train­ing a big­ger model on a richer sig­nal & pro-level play with­out any search—and , for that mat­ter, demon­strated that just train­ing an RNN end-­to-end to pre­dict a reward on enough data is enough to obso­lete even Alp­haZero and learn tree search implic­itly (but bet­ter). And on and on. DM researcher , dis­cussing their meta-re­in­force­ment learn­ing work where they were sur­prised to dis­cover meta-learn­ing emerg­ing, and that it did so regard­less of which spe­cific archi­tec­ture they used:

“…it’s some­thing that just hap­pens. In a sense, you can’t avoid this hap­pen­ing. If you have a sys­tem that has mem­o­ry, and the func­tion of that mem­ory is shaped by rein­force­ment learn­ing, and this sys­tem is trained on a series of inter­re­lated tasks, this is going to hap­pen. You can’t stop it.”

Pace , why? Why do they trans­fer and gen­er­al­ize? Why do these bless­ings of scale exist? Why do we need to train large mod­els when small mod­els prov­ably exist with the same per­for­mance? Why do larger mod­els not over­fit (though they ) and gen­er­al­ize bet­ter than smaller mod­els? What’s up with the whole any­way?

These are all, ahem, deep ques­tion about neural net­works and heav­ily debat­ed, but right now, I would sug­gest that the answer lies in some mix of the model compression/distillation, , , and (like ) lit­er­a­tures.

Big mod­els work because they encode a dizzy­ingly vast num­ber of sub­-­mod­els in an extremely abstract space, rep­re­sent­ing count­less small sub­-­mod­els , one of which is likely to solve the prob­lem well, and so ensures the prob­lem is sol­u­ble by the over­all mod­el. They func­tion as an ensem­ble: even though there count­less over­fit sub­-­mod­els inside the sin­gle big mod­el, they all aver­age out, lead­ing to a pref­er­ence for sim­ple solu­tions. This Occam’s razor biases the model towards sim­ple solu­tions which are flex­i­ble enough to grad­u­ally expand in com­plex­ity to match the data.

How­ev­er, “neural nets are lazy”: sub­-­mod­els which mem­o­rize pieces of the data, or latch onto super­fi­cial fea­tures, learn quick­est and are the eas­i­est to rep­re­sent inter­nal­ly. If the model & data & com­pute are not big or var­ied enough, the opti­miza­tion, by the end of the cur­sory train­ing, will have only led to a sub­-­model which achieves a low loss but missed impor­tant pieces of the desired solu­tion.

On the other hand, for a model like GPT-3, it is suf­fi­ciently pow­er­ful a model that its sub­-­mod­els can do any­thing from poetry to arith­metic, and it is trained on so much data that those super­fi­cial mod­els may do well early on, but grad­u­ally fall behind more abstract mod­els; a sub­-­model which mem­o­rizes some of the data is indeed much sim­pler than a sub­-­model which encodes gen­uine arith­metic (a NN can prob­a­bly mem­o­rize tens of thou­sands of lookup table entries stor­ing exam­ples of addi­tion in the space it would take to encode an abstract algo­rithm like ‘addi­tion’), but it can’t pos­si­bly mem­o­rize all the instances of arith­metic (im­plicit or explic­it) in GPT-3’s Inter­net-s­cale dataset. If a mem­o­riz­ing sub­-­model tried to do so, it would become extremely large and penal­ized. Even­tu­al­ly, after enough exam­ples and enough updates, the sim­plest ‘arith­metic’ model which accu­rately pre­dicts the data just is arith­metic. And then the meta-learn­ing, after see­ing enough instances of algo­rithms which vary slightly within each sam­ple, mak­ing it hard to learn each task sep­a­rate­ly, just is learn­ing of more generic algo­rithms, yield­ing sub­-­mod­els which achieve lower loss than the rival sub­-­mod­els, which either fail to pre­dict well or bloat unac­cept­ably. (GPT-2-1.5b appar­ently was too small or shal­low to ensem­ble eas­ily over sub­-­mod­els encod­ing meta-learn­ing algo­rithms, or per­haps not trained long enough on enough data to locate the meta-learner mod­els; GPT-3 was.)

So, the larger the mod­el, the bet­ter, if there is enough data & com­pute to push it past the easy con­ve­nient sub­-­mod­els and into the sub­-­mod­els which express desir­able traits like gen­er­al­iz­ing, fac­tor­iz­ing per­cep­tion into mean­ing­ful latent dimen­sions, meta-learn­ing tasks based on descrip­tions, learn­ing causal rea­son­ing & log­ic, and so on. If the ingre­di­ents are there, it’s going to hap­pen.

Scaling Hypothesis

The strong scal­ing hypoth­e­sis is that, once we find a scal­able archi­tec­ture like self­-at­ten­tion or con­vo­lu­tions, which like the brain can be applied fairly uni­formly (eg or Hawkin­s), we can sim­ply train ever larger NNs and ever more sophis­ti­cated behav­ior will emerge nat­u­rally as the eas­i­est way to opti­mize for all the tasks & data. More pow­er­ful NNs are ‘just’ scaled-up weak NNs, in much the same way that human brains look much like . While I was highly skep­ti­cal of scal­ing hypoth­e­sis advo­cates when I first became inter­ested in AI 2004–2010 (back when AI was stuck in the dol­drums of hope­lessly nar­row tools and dates like 2028 seemed impos­si­bly far away), which smacked of numerol­ogy and “if you build it they will come” logic (at the time, we cer­tainly did­n’t have gen­eral algo­rithms that you could just throw com­pute at), in 2020, I have to admit, I was wrong and they were right. We built the com­pute, and the algo­rithms did come, and the scal­ing hypoth­e­sis has only looked more and more plau­si­ble every year since 2010.

Why Does Pretraining Work?

The pre­train­ing the­sis goes some­thing like this:

“Fig­ure 1: Envi­sioned evo­lu­tion of NLP research through three dif­fer­ent eras or curves” (the hypo­thet­i­cal S-curves & progress in nat­ural lan­guage mod­el­ing; from )

Humans, one might say, are the : we con­stantly emit large amounts of struc­tured data, which implic­itly rely on log­ic, causal­i­ty, object per­ma­nence, his­to­ry—all of that good stuff. All of that is implicit and encoded into our writ­ings and videos and ‘data exhaust’. A model learn­ing to pre­dict must learn to under­stand all of that to get the best per­for­mance; as it pre­dicts the easy things which are mere sta­tis­ti­cal pat­tern-­match­ing, what’s left are the hard things.

Early on in train­ing, a model learns the crud­est lev­els: that some let­ters are more fre­quent than oth­ers, that every 5 char­ac­ters or so there is a space, and so on. It goes from pre­dicted uni­form­ly-dis­trib­uted bytes to what looks like Base-60 encod­ing—al­phanu­meric gib­ber­ish. As crude as this may be, it’s enough to make quite a bit of absolute pro­gress: a ran­dom pre­dic­tor needs 8 bits to ‘pre­dict’ a byte/character, but just by at least match­ing let­ter and space fre­quen­cies, it can almost halve its error to around 5 bits.14 Because it is learn­ing so much from every char­ac­ter, and because the learned fre­quen­cies are sim­ple, it can hap­pen so fast that if one is not log­ging sam­ples fre­quent­ly, one might not even observe the improve­ment.

As train­ing pro­gress­es, the task becomes more dif­fi­cult. Now it begins to learn what words actu­ally exist and do not exist. It does­n’t know any­thing about mean­ing, but at least now when it’s asked to pre­dict the sec­ond half of a word, it can actu­ally do that to some degree, sav­ing it a few more bits. This takes a while because any spe­cific instance will show up only occa­sion­al­ly: a word may not appear in a dozen sam­ples, and there are many thou­sands of words to learn. With some more work, it has learned that punc­tu­a­tion, plu­ral­iza­tion, pos­ses­sives are all things that exist. Put that togeth­er, and it may have pro­gressed again, all the way down to 3–4 bits error per char­ac­ter! (While the progress is grat­i­fy­ingly fast, it’s still all gib­ber­ish, though, makes no mis­take: a sam­ple may be spelled cor­rect­ly, but it does­n’t make even a bit of sense.)

But once a model has learned a good Eng­lish vocab­u­lary and cor­rect formatting/spelling, what’s next? There’s not much juice left in pre­dict­ing with­in-­words. The next thing is pick­ing up asso­ci­a­tions among words. What words tend to come first? What words ‘clus­ter’ and are often used nearby each oth­er? Nau­ti­cal terms tend to get used a lot with each other in sea sto­ries, and like­wise Bible pas­sages, or Amer­i­can his­tory Wikipedia arti­cle, and so on. If the word “Jef­fer­son” is the last word, then “Wash­ing­ton” may not be far away, and it should hedge its bets on pre­dict­ing that ‘W’ is the next char­ac­ter, and then if it shows up, go all-in on “ash­ing­ton”. Such bag-of-­words approaches still pre­dict bad­ly, but now we’re down to per­haps <3 bits per char­ac­ter.

What next? Does it stop there? Not if there is enough data and the ear­lier stuff like learn­ing Eng­lish vocab does­n’t hem the model in by using up its learn­ing abil­i­ty. Grad­u­al­ly, other words like “Pres­i­dent” or “gen­eral” or “after” begin to show the model sub­tle cor­re­la­tions: “Jef­fer­son was Pres­i­dent after…” With many such pas­sages, the word “after” begins to serve a use in pre­dict­ing the next word, and then the use can be broad­ened.

By this point, the loss is per­haps 2 bits: every addi­tional 0.1 bit decrease comes at a steeper cost and takes more time. How­ev­er, now the sen­tences have started to make sense. A sen­tence like “Jef­fer­son was Pres­i­dent after Wash­ing­ton” does in fact mean some­thing (and if occa­sion­ally we sam­ple “Wash­ing­ton was Pres­i­dent after Jef­fer­son”, well, what do you expect from such an un-­con­verged mod­el). Jar­ring errors will imme­di­ately jos­tle us out of any illu­sion about the mod­el’s under­stand­ing, and so train­ing con­tin­ues. (Around here, Markov chain & n-gram mod­els start to fall behind; they can mem­o­rize increas­ingly large chunks of the train­ing cor­pus, but they can’t solve increas­ingly crit­i­cal syn­tac­tic tasks like bal­anc­ing paren­the­ses or quotes, much less start to ascend from syn­tax to seman­tic­s.)

Now train­ing is hard. Even sub­tler aspects of lan­guage must be mod­eled, such as keep­ing pro­nouns con­sis­tent. This is hard in part because the mod­el’s errors are becom­ing rare, and because the rel­e­vant pieces of text are increas­ingly dis­tant and ‘long-range’. As it makes pro­gress, the absolute size of errors shrinks dra­mat­i­cal­ly. Con­sider the case of asso­ci­at­ing names with gen­der pro­nouns: the dif­fer­ence between “Janelle ate some ice cream, because he likes sweet things like ice cream” and “Janelle ate some ice cream, because she likes sweet things like ice cream” is one no human could fail to notice, and yet, it is a dif­fer­ence of a sin­gle let­ter. If we com­pared two mod­els, one of which did­n’t under­stand gen­der pro­nouns at all and guessed ‘he’/‘she’ purely at ran­dom, and one which under­stood them per­fectly and always guessed ‘she’, the sec­ond model would attain a lower aver­age error of barely <0.02 bits per char­ac­ter!

Nev­er­the­less, as train­ing con­tin­ues, these prob­lems and more, like imi­tat­ing gen­res, get solved, and even­tu­ally at a loss of 1–2 (where a small char-RNN might con­verge on a small cor­pus like Shake­speare or some Project Guten­berg ebook­s), we will finally get sam­ples that sound human—at least, for a few sen­tences. These final sam­ples may con­vince us briefly, but, aside from issues like rep­e­ti­tion loops, even with good sam­ples, the errors accu­mu­late: a sam­ple will state that some­one is “alive” and then 10 sen­tences lat­er, use the word “dead”, or it will digress into an irrel­e­vant argu­ment instead of the expected next argu­ment, or some­one will do some­thing phys­i­cally improb­a­ble, or it may just con­tinue for a while with­out seem­ing to get any­where.

All of these errors are far less than <0.02 bits per char­ac­ter; we are now talk­ing not hun­dredths of bits per char­ac­ters but less than ten-t­hou­sandths.

The pre­train­ing the­sis argues that this can go even fur­ther: we can com­pare this per­for­mance directly with humans doing the same objec­tive task, who can achieve closer to 0.7 bits per char­ac­ter. What is in that miss­ing >0.4?

“Yeah, but there’s more to being smart than know­ing com­pres­sion schemes!” “No there’s not!” “Shoot—he knows the secret!!”

Well—every­thing! Every­thing that the model miss­es. While just bab­bling ran­dom words was good enough at the begin­ning, at the end, it needs to be able to rea­son our way through the most dif­fi­cult tex­tual sce­nar­ios requir­ing causal­ity or com­mon­sense rea­son­ing. Every error where the model pre­dicts that ice cream put in a freezer will “melt” rather than “freeze”, every case where the model can’t keep straight whether a per­son is alive or dead, every time that the model chooses a word that does­n’t help build some­how towards the ulti­mate con­clu­sion of an ‘essay’, every time that it lacks the the­ory of mind to com­press novel scenes describ­ing the Machi­avel­lian schem­ing of a dozen indi­vid­u­als at din­ner jock­ey­ing for power as they talk, every use of logic or abstrac­tion or instruc­tions or Q&A where the model is befud­dled and needs more bits to cover up for its mis­take where a human would think, under­stand, and pre­dict. Each of these cog­ni­tive break­throughs allows ever so slightly bet­ter pre­dic­tion of a few rel­e­vant texts; noth­ing less than true under­stand­ing will suf­fice for ideal pre­dic­tion.

If we trained a model which reached that loss of <0.7, which could pre­dict text indis­tin­guish­able from a human, whether in a dia­logue or quizzed about ice cream or being tested on SAT analo­gies or tutored in math­e­mat­ics, if for every string the model did just as good a job of pre­dict­ing the next char­ac­ter as you could do, how could we say that it does­n’t truly under­stand every­thing? (If noth­ing else, we could, by def­i­n­i­tion, replace humans in any kind of tex­t-writ­ing job!)

The last bits are deep­est. The impli­ca­tion here is that the final few bits are the most valu­able bits, which require the most of what we think of as intel­li­gence. A help­ful anal­ogy here might be our actions: for the most part, all humans exe­cute actions equally well. We all pick up a tea mug with­out drop­ping, and can lift our legs to walk down thou­sands of steps with­out falling even once. For every­day actions, any­body, of any intel­li­gence, can get enough prac­tice & feed­back to do them quite well. Where indi­vid­u­als dif­fer is when they start run­ning into novel choic­es, rare choic­es, choices that take sec­onds but unfold over a life­time, choices where we will never get any feed­back (like after our death). One only has to make a sin­gle bad deci­sion, out of a life­time of mil­lions of dis­crete deci­sions, to wind up in jail or dead. A small absolute aver­age improve­ment in deci­sion qual­i­ty, if it is in those deci­sions, may be far more impor­tant than its quan­tity indi­cates. xo(Why do humans have such large brains, when ani­mals like chim­panzees do so many ordi­nary activ­i­ties seem­ingly as well with a frac­tion of the expense? Why is lan­guage worth­while? Per­haps because of con­sid­er­a­tions like these. We may be at our most human while fill­ing out the paper­work for life insur­ance.)

Rea­sons for doubt. The pre­train­ing the­sis, while log­i­cally impec­ca­ble—how is a model sup­posed to solve all pos­si­ble trick ques­tions with­out under­stand­ing, just guess­ing?—n­ever struck me as con­vinc­ing, an argu­ment admit­ting nei­ther confu­ta­tion nor con­vic­tion. It feels too much like a magic trick: “here’s some infor­ma­tion the­o­ry, here’s a human bench­mark, here’s how we can encode all tasks as a sequence pre­dic­tion prob­lem, hey presto, intel­li­gence!” There are lots of algo­rithms which are Tur­ing-­com­plete or ‘uni­ver­sal’ in some sense; there are lots of algo­rithms like AIXI which solve AI in some the­o­ret­i­cal sense (Schmid­hu­ber et al have many of these cute algo­rithms such as ‘the fastest pos­si­ble algo­rithm for all prob­lems’, with the minor catch of some con­stant fac­tors which require com­put­ers big­ger than the uni­verse).

Why think pre­train­ing or sequence mod­el­ing is not another one of them? Sure, if the model got a low enough loss, it’d have to be intel­li­gent, but how could you prove that would hap­pen in prac­tice? (Train­ing char-RNNs was fun, but they had­n’t exactly rev­o­lu­tion­ized deep learn­ing.) It might require more text than exists, count­less petabytes of data for all of those sub­tle fac­tors like log­i­cal rea­son­ing to rep­re­sent enough train­ing sig­nal, amidst all the noise and dis­trac­tors, to train a mod­el. Or maybe your mod­els are too small to do more than absorb the sim­ple sur­face-level sig­nals, and you would have to scale them 100 orders of mag­ni­tude for it to work, because the scal­ing curves did­n’t coop­er­ate. Or maybe your mod­els are fun­da­men­tally bro­ken, and stuff like abstrac­tion require an entirely dif­fer­ent archi­tec­ture to work at all, and what­ever you do, your cur­rent mod­els will sat­u­rate at poor per­for­mance. Or it’ll train, but it’ll spend all its time try­ing to improve the sur­face-level mod­el­ing, absorb­ing more and more lit­eral data and facts with­out ever ascend­ing to the higher planes of cog­ni­tion as planned. Or…

‘The pos­si­bil­i­ties of devel­op­ing an atomic weapon and the desir­abil­ity of doing it secretly were dis­cussed at a Prince­ton Uni­ver­sity con­fer­ence in which I par­tic­i­pated in March 1939… said this rare vari­ety could not be sep­a­rated from com­mon ura­nium except by turn­ing the coun­try into a gigan­tic fac­to­ry. Bohr was wor­ried that this could be done and that an atomic bomb could be devel­ope­d—but he hoped that nei­ther could be accom­plished. Years lat­er, when Bohr came to Los Alam­os, I was pre­pared to say, “You see . . .” But before I could open my mouth, he said: “You see, I told you it could­n’t be done with­out turn­ing the whole coun­try into a fac­to­ry. You have done just that.”


But appar­ent­ly, it would’ve worked fine. Even RNNs prob­a­bly would’ve worked—­Trans­form­ers are nice, but they seem mostly be about effi­cien­cy.16 (Train­ing large RNNs is much more expen­sive, and doing BPTT over mul­ti­ple nodes is much harder engi­neer­ing-­wise.) It just required more com­pute & data than any­one was will­ing to risk on it until a few true-­be­liev­ers were able to get their hands on a few mil­lion dol­lars of com­pute.

  1. Q: Did any­one pre­dict, quan­ti­ta­tive­ly, that this would hap­pen where it did?

    A: Not that I know of.

  2. Q: What would future scaled-up mod­els learn?

    GPT-2-1.5b had a cross-en­tropy Web­Text val­i­da­tion loss of ~3.3 (based on the per­plex­ity of ~10 in Fig­ure 4, and log2(10) = 3.32). GPT-3 halved that loss to ~1.73 judg­ing from Brown et al 2020 and using the scal­ing for­mula (). For a hypo­thet­i­cal GPT-4, if the scal­ing curve con­tin­ues for another 3 orders or so of com­pute (100–1000×) before cross­ing over and hit­ting harder dimin­ish­ing returns, the cross-en­tropy loss will drop to ~1.24 ().

    If GPT-3 gained so much meta-learn­ing and world knowl­edge by drop­ping its absolute loss ~50% when start­ing from GPT-2’s lev­el, what capa­bil­i­ties would another ~30% improve­ment over GPT-3 gain? (Cut­ting the loss that much would still not reach human-level, as far as I can tell.17) What would a drop to ≤1, per­haps using wider con­text win­dows or recur­ren­cy, gain?

    A: I don’t know.

  3. Q: Does any­one?

    A: Not that I know of.


In the prob­lem of decod­ing, the most impor­tant infor­ma­tion which we can pos­sess is the knowl­edge that the mes­sage which we are read­ing is not gib­ber­ish…In a sim­i­lar way, when we con­sider a prob­lem of nature such as that of atomic reac­tions and atomic explo­sives, the largest sin­gle item of infor­ma­tion which we can make pub­lic is that they exist. Once a sci­en­tist attacks a prob­lem which he knows to have an answer, his entire atti­tude is changed. He is already some 50% of his way toward that answer…the one secret con­cern­ing the atomic bomb which might have been kept and which was given to the pub­lic and to all poten­tial ene­mies with­out the least inhi­bi­tion, was that of the pos­si­bil­ity on its con­struc­tion. Take a prob­lem of this impor­tance and assure the sci­en­tific world that it has an answer; then both the intel­lec­tual abil­ity of the sci­en­tists and the exist­ing lab­o­ra­tory facil­i­ties are so widely dis­trib­uted that the qua­si­-in­de­pen­dent real­iza­tion of the task will be a mat­ter of merely a few years any­where in the world.

, pg124–125, (em­pha­sis added)

“Peo­ple who work in machine learn­ing sim­ply did­n’t think that neural net­works could do much. Peo­ple did­n’t believe large neural net­works could be trained…The ideas were all there, the thing that was miss­ing was a lot of super­vised data and a lot of com­pute. Once you have [those two], then there is a third thing is that is need­ed—and that is con­vic­tion. Con­vic­tion that if you take the right stuff, which already exists, and apply and mix it with a lot of data and a lot of com­pute, that it will in fact work. And so that was the miss­ing piece.”

Ilya Sutskever

What can we expect from future DL work? Will GPT-3 kick­start an arms race where soon we will be dis­cussing, blase, what would seem now like ludi­crously far­fetched schemes like bidi­rec­tional mul­ti­modal Trans­former 100× the size trained on 100× the data (video/text/PDFs-as-images/photo/robotics) with sup­ple­men­tary super­vised learn­ing as the back­bone of a MuZe­ro-­like learn­ing+­plan­ning DRL agent run­ning on thou­sands of tasks (such as cod­ing) simul­ta­ne­ous­ly?

The exis­tence of implies that the lim­it­ing fac­tor here is less hard­ware than humans: will any orga­ni­za­tion treat GPT-3 as a Sput­nik moment and invest aggres­sively in scal­ing pro­grams? Is there a GPT-4-equivalent brew­ing away inside Deep­Mind or Google Brain’s TPU pods now? They aren’t stu­pid, they have the hard­ware, they have the bud­gets, they have the peo­ple.

But I think they lack a vision. As far as I can tell: they do not have any such thing, because Google Brain & Deep­Mind do not believe in the scal­ing hypoth­e­sis the way that Sutskev­er, Amodei and oth­ers at OA do. Just read through machine learn­ing Twit­ter to see the dis­dain for the scal­ing hypoth­e­sis. (A quar­ter year on from GPT-3 and count­ing, can you name a sin­gle dense model as large as the 17b Turing-NLG—never mind larger than GPT-3?)

Google Brain is entirely too prac­ti­cal and short­-term focused to dab­ble in such eso­teric & expen­sive spec­u­la­tion, although Quoc V. Le’s group occa­sion­ally sur­prises you. They’ll dab­ble in some­thing like GShard, but mostly because they expect to be likely to be able to deploy it or some­thing like it to pro­duc­tion in Google Trans­late.

Deep­Mind18 holds what we might call the “weak scal­ing hypoth­e­sis”: they believe that AGI will require us to “find the right algo­rithms” effec­tively repli­cat­ing a mam­malian brain mod­ule by mod­ule, and that while these mod­ules will be extremely large & expen­sive by con­tem­po­rary stan­dards (which is why com­pute is impor­tant, to give us “a more pow­er­ful tool with which to hunt for the right algo­rithms”), they still need to be invented & fine­tuned piece by piece, with lit­tle risk or sur­prise until the final assem­bly. Each piece, how­ev­er, itself can scale: there’s no mag­i­cal intel­li­gence gland or quan­tum woo which cre­ates a bright line between humans and, say, chim­panzees or rodents. (As much as we humans extrav­a­gantly admire our own capa­bil­i­ties like lan­guage or log­ic, those are rel­a­tively minor flour­ishes on the basic brain—each organ­ism solves the same basic prob­lems, like explo­ration, long-term mem­o­ry, learn­ing world-­mod­els, asso­ci­at­ing rewards with spe­cific actions, meta-learn­ing, etc.) As such, once you have a rat-level AGI, a human-level AGI is just more so. (And rats are a lot eas­ier to exper­i­ment on.) That is how you get DM con­trap­tions like which throw the kitchen sink at the wall to see what sticks, and why they place such empha­sis on neu­ro­science as inspi­ra­tion and cross-fer­til­iza­tion for reverse-engi­neer­ing the brain. (See also Sam Alt­man’s pod­cast inter­view com­ments on OA’s advan­tage vs unnamed rivals with more com­pute is because the lack of com­pute makes them stay “small and focused”—“for sure” like a startup approach.) When some­one seems to have come up with a scal­able archi­tec­ture for crack­ing a hard prob­lem, like Alp­haZero or AlphaS­tar, they are will­ing to pour on the gas to make it scale, but oth­er­wise, incre­men­tal refine­ment on ALE and then is the game plan. They have been bit­ing off and chew­ing pieces of the brain for a decade, and it’ll prob­a­bly take another decade or two of steady chew­ing if all goes well. Because they have locked up so much tal­ent and have so much pro­pri­etary code and believe all of that is a major moat to any com­peti­tor try­ing to repli­cate the com­pli­cated brain, they are fairly easy­go­ing. You will not see DM ‘bet the com­pany’ on any moon­shot; Google’s cash­flow isn’t going any­where (and DM’s bud­get), and slow and steady wins the race.

Going beyond that, most other research labs like Tesla or FAIR are irrel­e­vant and unin­ter­est­ed. Chi­nese AI com­pa­nies are a ques­tion mark: past the lan­guage bar­ri­er, I seem to dis­cern inter­est in AGI & lit­tle of the reflex­ive West­ern oppo­si­tion, and com­pa­nies like Baidu occa­sion­ally release impor­tant research (such as the early scal­ing paper Hes­t­ness et al 2017), but over­all, Chi­nese AI may be over­es­ti­mat­ed, and they seem to suf­fer from a kind of Dutch dis­ease—­fund­ing for sur­veil­lance tech­nol­o­gy, and for nar­row e-com­merce nich­es, is so plen­ti­ful that other areas are neglect­ed.

OA, lack­ing any­thing like DM’s long-term fund­ing from Google or its enor­mous head­count, is mak­ing a star­tup-­like bet that they know an impor­tant truth which is a secret: “the scal­ing hypoth­e­sis is true!” So, sim­ple DRL algo­rithms like PPO on top of large sim­ple archi­tec­tures like RNNs or Trans­form­ers can emerge, exploit­ing the bless­ings of scale, and meta-learn their way to pow­er­ful capa­bil­i­ties, enabling fur­ther fund­ing for still more com­pute & scal­ing, in a vir­tu­ous cycle. This is why OA had to revise its cor­po­rate form: lack­ing any enor­mous endow­ment or extremely deep­-pock­eted patron like Google, where does it get the money to scale (or hire machine learn­ing engineer/researchers who can com­mand salaries in the mil­lion­s)? OA has to earn the nec­es­sary mon­ey, so in a move like Mozilla Foun­da­tion own­ing Mozilla Cor­po­ra­tion (to sell Fire­fox search engine place­men­t), or the Her­shey orphan­age own­ing Her­shey Choco­late or the Girl Scouts licens­ing their cook­ies, Ope­nAI switched from a pure non­profit funded by dona­tions to a non­profit which owns a for-profit subsidiary/startup, “Ope­nAI LP”, which can take invest­ments and engage in for-profit activ­i­ties. OA LP, while con­trolled by OA, can then shoot for the moon. And if OA is wrong to trust in the , well, they never could com­pete with DM directly using DM’s favored approach, and were always going to be an also-ran foot­note, so they have no regret.

While all of this hypo­thet­i­cally can be repli­cated rel­a­tively eas­ily (never under­es­ti­mate the amount of tweak­ing and spe­cial sauce it takes) by com­peti­tors if they wished (the nec­es­sary amounts of com­pute bud­gets are still triv­ial in terms of Big Sci­ence or other invest­ments like AlphaGo or AlphaS­tar or Way­mo, after all), said com­peti­tors lack the very most impor­tant thing, which no amount of money or GPUs can ever cure: the courage of their con­vic­tions. They are too hide­bound and deeply philo­soph­i­cally wrong to ever admit fault and try to over­take OA until it’s too late. How can we talk seri­ously about any kind of mil­i­tary Man­hat­tan Project when the US mil­i­tary does­n’t even let its devel­op­ers use Ten­sor­flow or PyTorch, or about gov­ern­ment projects in the shadow of coro­n­avirus? This might seem absurd (surely the Bit­ter Lesson/scaling hypoth­e­sis have now earned enough prior prob­a­bil­ity to be taken seri­ously and receive major research invest­ments to test how far they can go, espe­cially given how impor­tant the impli­ca­tions are), but look at the repeated crit­i­cism of OA every time they release a new exam­ple of the scal­ing hypoth­e­sis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3… To para­phrase St Augustine, most peo­ples’ reac­tion to the Bit­ter Les­son or scal­ing hypoth­e­sis is “grant me scale & com­pute—but not yet”.19

A crit­i­cal indi­ca­tor will be whether orga­ni­za­tions beyond ‘the usual sus­pects’ (Mi­crosoft team has reached , but there is also Nvidia, Sales­force, Allen, Google DM/GB, Connor/EleutherAI, Face­book FAIR) start par­tic­i­pat­ing or if they con­tinue to dis­miss scal­ing. At least as of 2020-10-26, 152 days lat­er, no model has come near GPT-3, and indeed, no model has even exceeded Turing-NLG’s 17b.20

Critiquing The Critics

Keep­ing track. GPT-3 in 2020 makes as good a point as any to take a look back on the past decade. In 2010, one could eas­ily fit every­one in the world who gen­uinely believed in deep learn­ing into a mod­er­ate-­sized con­fer­ence room (as­sisted slightly by the fact that 3 of them were busy found­ing ). Some­one inter­ested in machine learn­ing in 2010 might have read about some inter­est­ing stuff from weirdo diehard con­nec­tion­ists in rec­og­niz­ing hand-writ­ten dig­its using all of 1–2 mil­lion para­me­ters, or some mod­est neural tweaks to stan­dard voice-recog­ni­tion hid­den Markov mod­els. In 2010, who would have pre­dicted that over the next 10 years, deep learn­ing would undergo a Cam­brian explo­sion caus­ing a mass extinc­tion of alter­na­tive approaches through­out machine learn­ing, that mod­els would scale up to 175,000 mil­lion para­me­ters, and that these enor­mous mod­els would just spon­ta­neously develop all these capa­bil­i­ties?

No one. That is, no one aside from a few diehard con­nec­tion­ists writ­ten off as will­ful­ly-de­luded old-school fanat­ics by the rest of the AI com­mu­nity (never mind the world), such as , Schmid­hu­ber, Sutskever, Legg, & Amod­ei? One of the more shock­ing things about look­ing back is real­iz­ing how unsur­pris­ing and eas­ily pre­dicted all of this was if you lis­tened to the right peo­ple. In 1998, 22 years ago, Moravec noted that AI research could be decep­tive, and hard­ware lim­its meant that “intel­li­gent machine research did not make steady progress in its first 50 years, it marked time for 30 of them!”, pre­dict­ing that as Moore’s law con­tin­ued, “things will go much faster in the next 50 years than they have in the last 50.” Moravec fur­ther observed that part of the rea­son for rapid progress was the hard­ware over­hang: while super­com­put­ers of the nec­es­sary power would exist long before the con­nec­tion­ist rev­o­lu­tion began, no one would be allowed to use them, as they would be devoted to ‘more impor­tant’ (pres­ti­gious) hard STEM work, like “physics sim­u­la­tions” (ie cli­mate sim­u­la­tions & nuclear bombs)21, and “AI research must wait for the power to become more afford­able.” Afford­able mean­ing a work­sta­tion roughly ~$1,854; suf­fi­ciently cheap com­pute to rival a human would arrive some­time in the 2020s, with the 2010s see­ing afford­able sys­tems in the lizard–­mouse range. As it hap­pens, the start of the DL rev­o­lu­tion is typ­i­cally dated to in 2012, by a grad stu­dent using 2 GTX 580 3GB GPUs (launch list price of… $657, for a sys­tem build cost of per­haps $1,901). 2020 saw GPT-3 arrive, and as dis­cussed before, there are many rea­sons to expect the cost to fall, in addi­tion to the large hard­ware com­pute gains that are being fore­cast for the 2020s.

The accel­er­at­ing pace of the last 10 years should wake any­one from their dog­matic slum­ber and make them sit upright. And there are 28 years left in Moravec’s fore­cast…

The temp­ta­tion, that many do not resist so much as revel in, is to give in to a défor­ma­tion pro­fes­sion­nelle and dis­miss any model as “just” this or that(“just bil­lions of IF state­ments” or “just a bunch of mul­ti­pli­ca­tions” or “just mil­lions of mem­o­rized web pages”), miss­ing the for­est for the trees, as Moravec com­mented of chess engi­nes:

The event was notable for many rea­sons, but one espe­cially is of inter­est here. Sev­eral times dur­ing both match­es, Kas­parov reported signs of mind in the machine. At times in the sec­ond tour­na­ment, he wor­ried there might be humans behind the sce­nes, feed­ing Deep Blue strate­gic insight­s!…In all other chess com­put­ers, he reports a mechan­i­cal pre­dictabil­ity stem­ming from their undis­crim­i­nat­ing but lim­ited looka­head, and absence of long-term strat­e­gy. In Deep Blue, to his con­ster­na­tion, he saw instead an “alien intel­li­gence.”

…Deep Blue’s cre­ators know its quan­ti­ta­tive supe­ri­or­ity over other chess machines inti­mate­ly, but lack the chess under­stand­ing to share Kas­parov’s deep appre­ci­a­tion of the dif­fer­ence in the qual­ity of its play. I think this dichotomy will show up increas­ingly in com­ing years. Engi­neers who know the mech­a­nism of advanced robots most inti­mately will be the last to admit they have real minds. From the inside, robots will indis­putably be machi­nes, act­ing accord­ing to mechan­i­cal prin­ci­ples, how­ever elab­o­rately lay­ered. Only on the out­side, where they can be appre­ci­ated as a whole, will the impres­sion of intel­li­gence emerge. A human brain, too, does not exhibit the intel­li­gence under a neu­ro­bi­ol­o­gist’s micro­scope that it does par­tic­i­pat­ing in a lively con­ver­sa­tion.

But of course, if we ever suc­ceed in AI, or in reduc­tion­ism in gen­er­al, it must be by reduc­ing Y to ‘just X’. Show­ing that some task requir­ing intel­li­gence can be solved by a well-de­fined algo­rithm with no ‘intel­li­gence’ is pre­cisely what suc­cess must look like! (Other­wise, the ques­tion has been thor­oughly begged & the prob­lem has only been pushed else­where; com­puter chips are made of tran­sis­tors, not espe­cially lit­tle homun­culi.)

“As long as the AI [OA5] can explore, it will learn, given enough time…We just kept wait­ing for the magic to run out. We kept wait­ing to hit a wall, and we never seemed to hit a wall.”

Greg Brock­man

“Give it the com­pute, give it the data, and it will do amaz­ing things. This stuff is like—it’s like alchemy!”

Ilya Sutskever, sum­mer 2019

Hind­sight is 20⁄20. Even in 2015, the scal­ing hypoth­e­sis seemed highly dubi­ous: you needed some­thing to scale, after all, and it was all too easy to look at flaws in exist­ing sys­tems and imag­ine that they would never go away and progress would sig­moid any month now, soon. Like the genomics rev­o­lu­tion where a few far-sighted seers extrap­o­lated that the nec­es­sary n for GWASes would increase expo­nen­tially & deliver pow­er­ful PGSes soon, while sober experts wrung their hands over “miss­ing her­i­tabil­ity” & the mirac­u­lous com­plex­ity of biol­ogy & scoff about how such n require­ments proved GWAS was a failed par­a­digm, the future arrived at first slowly and then quick­ly. Yet, here we are: all honor to the fanat­ics, shame and humil­i­a­tion to the crit­ics!22 If only one could go back 10 years, or even 5, to watch every AI researchers’ head explode read­ing this paper… Unfor­tu­nate­ly, few heads appear to be explod­ing now, because human capac­ity for hind­sight & excuses is bound­less (“I can get that much with fine­tun­ing, any­way I pre­dicted it all along, how bor­ing”) and, unfor­tu­nate­ly, for AGI. (If you are still cer­tain that there is near-zero prob­a­bil­ity of AGI in the next few decades, why? Did you pre­dic­t—in writ­ing—­ca­pa­bil­i­ties like GPT-3? Is this how you expect AI fail­ure to look in the decades before­hand? What spe­cific task, what spe­cific num­ber, would con­vince you oth­er­wise? How would the world look dif­fer­ent than it does now if these crude pro­to­type insec­t-brain-­sized DL sys­tems were not on a path to suc­cess?)

Author­ity with­out account­abil­i­ty. What should we think about the experts? Pro­jec­tions of fail­ure were made by emi­nent, respectable, seri­ous peo­ple. They spoke in con­sid­ered tones of why AI hype was exces­sive and might trig­ger an “AI win­ter”, and the fun­da­men­tal flaws of fash­ion­able approaches and why brute force could not work. These state­ments were made rou­tinely in 2014, 2015, 2016… And they were wrong. I am aware of few issu­ing a mea culpa or reflect­ing on it.23 It is a puz­zling fail­ure, and I’ve reflected on it before.

Phat­ic, not pre­dic­tive. There is, how­ev­er, a cer­tain tone of voice the bien pen­sant all speak in, whose sound is the same whether right or wrong; a tone shared with many state­ments in Jan­u­ary to March of this year; a tone we can also find in a 1940 Sci­en­tific Amer­i­can arti­cle author­i­ta­tively titled, , which advised the reader to not be con­cerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which cer­tain sci­en­tists had stopped talk­ing, rais­ing pub­lic con­cerns; not only could it hap­pen, the British bomb project had already begun, and 5 years later it did hap­pen.)

The iron law of bureau­cra­cy: Cathe­dral goth­ic. This tone of voice is the voice of .
The voice of author­ity insists on calm, and peo­ple not “pan­ick­ing” (the chief of sin­s).
The voice of author­ity assures you that it won’t hap­pen (be­cause it can’t hap­pen).
The voice utters sim­ple argu­ments about why the sta­tus quo will pre­vail, and con­sid­ers only how the wild new idea could fail (and not all the pos­si­ble option­s).
The voice is not, and does not deal in, uncer­tain­ty; things will either hap­pen or they will not, and since it will not hap­pen, there is no need to take any pre­cau­tions (and you should not worry because it can’t hap­pen).
The voice does not believe in draw­ing lines on graphs (it is rank numerol­o­gy).
The voice does not issue any numer­i­cal pre­dic­tions (which could be fal­si­fied).
The voice will not share its source code (for com­pli­cated rea­sons which can­not be explained to the laity).
The voice is opposed to uneth­i­cal things like ran­dom­ized exper­i­ments on vol­un­teers (but will over­look the insult).
The voice does not have a model of the future (be­cause a model implies it does not already know the future).
The voice is con­cerned about its pub­lic image (and unkind gos­sip about it by other speak­ers of the voice).
The voice is always sober, respectable, and cre­den­tialed (the voice would be pleased to write an op-ed for your national mag­a­zine and/or news­pa­per).
The voice speaks, and is not spo­ken to (you can­not ask the voice what objec­tive fact would change its mind).
The voice never changes its mind (un­til it does).
The voice is never sur­prised by events in the world (only dis­ap­point­ed).
The voice advises you to go back to sleep (right now).

When some­one speaks about future pos­si­bil­i­ties, what is the tone of their voice?








  1. Given the num­ber of com­ments on the paper’s arith­metic bench­mark, I should point out that the arith­metic bench­mark appears to greatly under­state GPT-3’s abil­i­ties due to the : even using com­mas markedly improves its 5-digit addi­tion abil­i­ty, for exam­ple. The BPE issue also appears to explain much of the poor per­for­mance on the anagram/shuffling tasks. This is some­thing to keep in mind for any task which requires char­ac­ter-level manip­u­la­tion or under­stand­ing.↩︎

  2. On implicit meta-learn­ing, see: / ()/, , /, /.↩︎

  3. GPT-3 hardly costs more than a few mil­lion dol­lars of com­pute (as of early 2020) as the exten­sive scal­ing research before­hand enabled one train­ing run. (Like­wise, T5 was trained only once.) It is cheap to run (pg39: “Even with the full GPT-3 175B, gen­er­at­ing 100 pages of con­tent from a trained model can cost on the order of 0.4 kW-hr, or only a few cents in energy costs.”), while IBM’s (other­wise use­less) Deep Blue AI project reput­edly cost >$10m for the final iter­a­tion (re­ports of $192m appear to be a con­fu­sion with the esti­mated value of pub­lic­ity men­tioned in pg187 of Hsu’s Behind Deep Blue) and Big Sci­ence projects like blow >5000× the fund­ing to mostly fail. (The par­ti­cle physi­cists, inci­den­tal­ly, are ≫$24b, based on, pre­sum­ably the sci­en­tific rev­o­lu­tions & world-chang­ing break­throughs that the LHC’s >$12b invest­ment pro­duced…)

    GPT-3 could have been done decades ago with global com­put­ing resources & sci­en­tific bud­gets; what could be done with today’s hard­ware & bud­gets that we just don’t know or care to do? There is a hard­ware over­hang. (See also the Whole Brain Emu­la­tion Roadmap & .)↩︎

  4. Fur­ther, NNs have addi­tional hard­ware over­hangs of their own due to the many orders of mag­ni­tude asym­me­try of train­ing vs run­ning. Trans­fer learn­ing and meta-learn­ing are so much faster than the base­line model train­ing. You can ‘train’ GPT-3 with­out even any gra­di­ent step­s—just exam­ples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it every­where at tiny mar­ginal cost. If you train a mod­el, then as soon as it’s done you get, among other things:

    • the abil­ity to run thou­sands of copies in par­al­lel on the same hard­ware

      • in a con­text like AlphaGo, I esti­mate sev­eral hun­dred ELO strength gains if you reuse the same hard­ware to merely run tree search with exact copies of the orig­i­nal model
    • meta-learning/transfer-learning to any related domain, cut­ting train­ing require­ments by orders of mag­ni­tude

    • model compression/distillation to train stu­dent mod­els which are a frac­tion of the size, FLOPS, or latency (ra­tios vary­ing widely based on task, approach, domain, accept­able per­for­mance degra­da­tion, tar­geted hard­ware etc, but often extreme like 1⁄100th)

    • reuse of the model else­where to instantly power up other mod­els (eg use of text or image embed­dings for a DRL agent)

    • learning-by-doing/ (high­est in infor­ma­tion tech­nolo­gies, and high for DL: Her­nan­dez & Brown 2020), so the next from-scratch model may be much cheap­er.

      For exam­ple: after all the iter­a­tive model archi­tec­ture & game upgrades done while train­ing the first OA5 agent was com­plet­ed, the sec­ond iter­a­tion of OA5, “Rerun”, was trained from scratch. Rerun required only 20% of the train­ing for a “98% win-rate against the final ver­sion of Ope­nAI Five.” As the authors note: “The ideal option would be to run Rerun-­like train­ing from the very start, but this is impos­si­ble—the Ope­nAI Five curve rep­re­sents lessons learned that led to the final code­base, envi­ron­ment, etc., with­out which it would not be pos­si­ble to train Rerun.”

    • base­line for engi­neer­ing much more effi­cient ones by ablat­ing and com­par­ing with the orig­i­nal

  5. Eg a nar­row con­text win­dow severely lim­its it, and moti­vates the need for . More broad­ly, GPT-3 does noth­ing exotic—no use of or neural archi­tec­ture search to try to tai­lor the mod­el, or even decide basic hyper­pa­ra­me­ters like widths (which as shows, can make quite a dif­fer­ent even in “well-un­der­stood and hand-op­ti­mized vanilla archi­tec­tures”).↩︎

  6. Not even PDFs—so no Google Books, no Arx­iv, no Lib­gen, no Sci-Hub…↩︎

  7. Gen­er­at­ing text from a LM can reveal the pres­ence of knowl­edge, but not its absence, and it is uni­ver­sally agreed that the cur­rent crude heuris­tic meth­ods like top-k can­not pos­si­bly be opti­mal.↩︎

  8. ‘A man is at the doc­tor’s office, and the doc­tor tells him, “I’ve got some good news and some bad news for you.” / The man says, “Well, I can’t take the bad news right now, so give me the good news first.” / The doc­tor says, “Well, the good news is that you have an 18-inch penis.” / The man looks stunned for a moment, and then asks, “What’s the bad news?” / The doc­tor says, “Your brain’s in your dick.”’↩︎

  9. Specif­i­cal­ly: , , , , , , /, , , , , . (An exam­ple of how not to do scal­ing papers is , which, in stark con­trast to the fore­go­ing paper­s—which Thomp­son et al do not men­tion at all!—at­tempts to infer scal­ing not from well-­con­trolled exper­i­ments run by the authors, which yield extremely tight and highly pre­dic­tive curves, but attempts to infer them from occa­sional reported num­bers in highly dis­parate research papers; unsur­pris­ing­ly, their curves barely pre­dict any­thing and seem to be seri­ous over­es­ti­mates any­way.)

    It is note­wor­thy that the pur­suit of large mod­els is dri­ven almost exclu­sively by Ope­nAI & indus­try enti­ties (the lat­ter of which are con­tent with far smaller mod­el­s), and that acad­e­mia has evinced an almost total dis­in­ter­est—dis­gust & anger, even, and denial (one might say “green AI” is green with envy). For all that the scal­ing hypoth­e­sis is ‘obvi­ous’ and scal­ing is ‘pre­dicted’, there is remark­ably lit­tle inter­est in actu­ally doing it. Per­haps we should pay more atten­tion to what peo­ple do rather than what they say.↩︎

  10. Roughly around Chuan Li’s esti­mate, using nom­i­nal list prices with­out dis­counts (which could be steep as the mar­ginal costs of cloud com­pute are sub­stan­tially low­er). The R&D project cost would be much high­er, but is amor­tized over all sub­se­quent mod­els & pro­jects.↩︎

  11. The Man­hat­tan Project cost ~$24b.↩︎

  12. As if we live in a world where grad stu­dents could go to the Moon on a ramen bud­get if we just wished hard enough, or as if “green AI” approaches to try to cre­ate small mod­els with­out going through big mod­els did not look increas­ingly futile and like throw­ing good money after bad, and were not the least green of all AI research…↩︎

  13. One inter­est­ing aspect of image scal­ing exper­i­ments like is that even when per­for­mance is ‘plateau­ing’ on the orig­i­nal task & approach­ing label error, the trans­fer learn­ing con­tin­ues to improve. Appar­ently the inter­nal rep­re­sen­ta­tions, even when ade­quate for mere clas­si­fi­ca­tion and so the score can­not increase more than a small per­cent­age, become more human-­like—be­cause it’s encod­ing or more ? I’ve noticed with lan­guage mod­els, the final frac­tions of a loss appear to make a sub­stan­tial dif­fer­ence to gen­er­ated sam­ple qual­i­ty, per­haps because it is only after all the eas­ier mod­el­ing is fin­ished that the lazy lan­guage model is forced to squeeze out the next bit of per­for­mance by more cor­rectly mod­el­ing more sophis­ti­cated things like log­ic, objects, world-­knowl­edge, etc.↩︎

  14. The num­bers here are not exact and are for illus­tra­tion; because BPEs don’t cor­re­spond to any intu­itive, I am going to bor­row from my obser­va­tions watch­ing char-RNNs, and talk about the loss per char­ac­ter instead of BPE.↩︎

  15. pg210–211, “The Quiet Enemy”, The Legacy of Hiroshima, Teller 1962.↩︎

  16. Another way of inter­pret­ing the var­i­ous papers about how Trans­form­ers are actu­ally like RNNs or are is to take that as indi­cat­ing that what is impor­tant about them is not any inher­ent new capa­bil­ity com­pared to older archi­tec­tures, but some low­er-level aspect like being more effi­ciently train­able on con­tem­po­rary hard­ware.↩︎

  17. How do these absolute pre­dic­tion per­for­mances com­pare to humans? It’s hard to say. The only avail­able bench­marks for per­plex­ity for humans/GPT-2/GPT-3 appear to be Web­Text, (PTB; based on the ), (1B­W), and . But cov­er­age is spot­ty.

    I found no human bench­marks for Web­Text or Penn Tree Bank, so I can’t com­pare the human vs GPT-2/GPT-3 per­plex­i­ties (GPT-2 PTB: 35.7; GPT-3 PTB: 20.5).

    GPT-2 was bench­marked at 43 per­plex­ity on the 1 Bil­lion Word (1BW) bench­mark vs a (highly extrap­o­lat­ed) (which inter­est­ingly extrap­o­lates, using 2012 LSTM RNNs, that “10 to 20 more years of research before human per­for­mance is reached”), but that may be an unfair bench­mark (“Our model is still sig­nif­i­cantly worse than prior work on the One Bil­lion Word Bench­mark (Chelba et al., 2013). This is likely due to a com­bi­na­tion of it being both the largest dataset and hav­ing some of the most destruc­tive pre-pro­cess­ing—1B­W’s sen­tence level shuf­fling removes all long-range struc­ture.”) and 1BW was dropped from the GPT-3 eval­u­a­tion due to data con­t­a­m­i­na­tion (“We omit the 4 Wikipedi­a-re­lated tasks in that work because they are entirely con­tained in our train­ing data, and we also omit the one-­bil­lion word bench­mark due to a high frac­tion of the dataset being con­tained in our train­ing set.”).

    LAMBADA was bench­marked at a GPT-2 per­plex­ity of 8.6, and a GPT-3 per­plex­ity of 3.0 (ze­ro-shot) / 1.92 (few-shot). in their GPT-2 blog post (but not the paper) that human per­plex­ity is 1–2, but pro­vides no sources and I could­n’t find any. (The authors might be guess­ing based on how LAMBADA was con­struct­ed: exam­ples were fil­tered by whether two inde­pen­dent human raters pro­vided the same right answer, which lower bounds how good humans must be at pre­dict­ing the answer.)

    So over­all, it looks like the best guess is that GPT-3 con­tin­ues to have some­where around twice the absolute error of a human. This implies it will take a large (yet, far from impos­si­ble) amount of com­pute to fully close the remain­ing gap with the cur­rent scal­ing laws. If we irre­spon­si­bly extrap­o­late out the Web­Text scal­ing curve fur­ther, assume GPT-3 has twice the error of a human at its cur­rent Web­Text per­plex­ity of 1.73 (and so humans are ~0.86), then we need , where x = 2.2e6 or 2,200,000× the com­pute of GPT-3. (This would roughly equal the cost to the USA of invad­ing Iraq.)

    When is that fea­si­ble?

    If we imag­ine that , then 2.2e6 would be 22 dou­blings away—or 6.3 years, in 2027. Most peo­ple believe that com­pute trend must break down soon, and that sort of pre­dic­tion is a good rea­son why!

    Going the other direc­tion, Her­nan­dez & Brown 2020’s esti­mate is that, net of hard­ware & algo­rith­mic pro­gress, the cost of a fixed level of per­for­mance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it’ll cost $2.5m around mid-2021, and so on. Sim­i­lar­ly, a GPT-human requir­ing 2.2e6× more com­pute would pre­sum­ably cost on the order of $10 tril­lion in 2020, but after 14 halv­ings (18 years) would cost $1b in 2038.↩︎

  18. Par­tic­u­larly ; I’m not sure about , although given the accu­racy of his while found­ing Deep­Mind, he prob­a­bly has­n’t much changed his views that AI will be empow­ered by the (re­al­ized) expo­nen­tial com­pute gains or his . (This is con­sis­tent with the lat­est Metac­u­lus fore­casts.)↩︎

  19. When faced with the choice between hav­ing to admit all their fancy hard work is a dead­-end, swal­low the bit­ter lesson, and start bud­get­ing tens of mil­lions of com­pute, or instead writ­ing a dis­dain­ful tweet explain­ing how, “actu­ally, GPT-3 shows that scal­ing is a dead end, it’s an envi­ron­men­tal cat­a­stro­phe, and it’s just imi­ta­tion intel­li­gence any­way”—most peo­ple will get busy on the tweet!↩︎

  20. A mix­ture-of-­ex­pert model like GShard or an embed­ding like Dynam­icEm­bed­ding is not com­pa­ra­ble to ‘dense’ mod­els like GPT-3, as it’s always been cheap & easy to train mod­els with bil­lions of ‘para­me­ters’ in some sense, like extremely large embed­dings; how­ev­er, these para­me­ters do lit­tle, and are more like a few hun­dred shal­low mod­els glued back­-­to-back. They prob­a­bly do not learn the same inter­est­ing things that a dense model would with the same nom­i­nal para­me­ter count.↩︎

  21. Strik­ing­ly, as of 2020, this is still true: eg the only deep learn­ing research I have seen done on were & . (In dou­ble-check­ing Arx­iv, I did find one non-STEM paper using Sum­mit resources: , focus­ing on sys­tems engi­neer­ing in train­ing a video clas­si­fi­ca­tion mod­el.)↩︎

  22. Now that GPT-3’s few-shot and have begun to make peo­ple like Gary Mar­cus feel slightly ner­vous about Wino­Grande, they have begun for why Wino­grad schemas good mea­sures of com­mon­sense reasoning/intelligence (be­cause intel­li­gence, of course, is what­ever AI can’t do yet).↩︎

  23. Feyn­man: “There are sev­eral ref­er­ences to pre­vi­ous flights; the accep­tance and suc­cess of these flights are taken as evi­dence of safe­ty. But ero­sion and blowby are not what the design expect­ed. They are warn­ings that some­thing is wrong. The equip­ment is not oper­at­ing as expect­ed, and there­fore there is a dan­ger that it can oper­ate with even wider devi­a­tions in the unex­pected and not thor­oughly under­stood way. The fact that this dan­ger did not lead to cat­a­stro­phe before is no guar­an­tee that it will not the next time, unless it is com­pletely under­stood.”↩︎

  24. Don’t wor­ry: we already have short­-shorts & ear- to hedge against fur­sona infla­tion. That said, we advise tak­ing a large posi­tion in equineties image macro funds to ben­e­fit from a flight to qual­ity and herd­ing: it’ll be a bear mar­ket for kinky bond­s—and that’s no bull.↩︎

  25. Some inter­est­ing ref­er­ences on viral evo­lu­tion: