May 2020 news & ‘On GPT-3’

May 2020 gwern.net newsletter: GPT-3 scaling, implications, deep theory; anime GAN updates, and 1 book review.
newsletter, NN, insight-porn
2019-12-262021-01-23 finished certainty: possible importance: 10


May 2020’s is now out; pre­vi­ous, April 2020 (). This is a sum­mary of the re­vi­sion-his­tory RSS feed, over­lap­ping with my & ; brought to you by my donors on Pa­treon.

Writings

On GPT-3: Meta-Learning, Scaling, Implications, And Deep Theory

On ( & my fol­lowup , com­pare ; ran­dom sam­ples; with re­al-world demos)

GPT-3, an­nounced by Ope­nAI in May 2020, is the largest neural net­work ever trained, by over an or­der of mag­ni­tude. Trained on In­ter­net text data, it is the suc­ces­sor to GPT-2, which had sur­prised every­one by its nat­ural lan­guage un­der­stand­ing & gen­er­a­tion abil­i­ty. To the sur­prise of most (in­clud­ing my­self), this vast in­crease in size did not run into di­min­ish­ing or neg­a­tive re­turns, as many ex­pect­ed, but the ben­e­fits of scale con­tin­ued to hap­pen as fore­casted by Ope­nAI. These ben­e­fits were not merely learn­ing more facts & text than GPT-2, but qual­i­ta­tively dis­tinct & even more sur­pris­ing in show­ing meta-learn­ing: while GPT-2 learned how to do com­mon nat­ural lan­guage tasks like text sum­ma­riza­tion, GPT-3 in­stead learned how to fol­low di­rec­tions and learn new tasks from a few ex­am­ples. (As a re­sult, GPT-3 out­puts & in­ter­ac­tion are more fas­ci­nat­ing & hu­man-like than GPT-2.)

While the im­me­di­ate ap­pli­ca­tions of GPT-3, like my po­etry or hu­mor writ­ings, are nice, the short­-term im­pli­ca­tions of GPT-3 are much more im­por­tant.

First, while GPT-3 is ex­pen­sive by con­ven­tional DL stan­dards, it is cheap by scientific/commercial/military/government bud­get stan­dards, and the re­sults in­di­cate that mod­els could be made much larg­er. Sec­ond, mod­els can also be made much more pow­er­ful, as GPT is an old ap­proach known to be flawed in both mi­nor & ma­jor ways, and far from an ‘ideal’ Trans­former. Third, GPT-3’s ca­pa­bil­i­ties come from learn­ing on raw (un­su­per­vised) data; that has long been one of the weak­est ar­eas of DL, hold­ing back progress in other ar­eas like re­in­force­ment learn­ing or ro­bot­ics. Mod­els like GPT-3 sug­gest that large un­su­per­vised mod­els will be vi­tal com­po­nents of fu­ture DL sys­tems, as they can be ‘plugged into’ sys­tems to im­me­di­ately pro­vide un­der­stand­ing of the world, hu­mans, nat­ural lan­guage, and rea­son­ing.

The meta-learn­ing has a longer-term im­pli­ca­tion: it is a demon­stra­tion of the bless­ings of scale, where prob­lems with sim­ple neural net­works van­ish, and they be­come more pow­er­ful, more gen­er­al­iz­able, more hu­man-like when sim­ply made very large & trained on very large datasets with very large com­pute—even though those prop­er­ties are be­lieved to re­quire com­pli­cated ar­chi­tec­tures & fancy al­go­rithms (and this per­ceived need dri­ves much re­search). Un­su­per­vised mod­els ben­e­fit from this, as train­ing on large cor­puses like In­ter­net-s­cale text present a myr­iad of diffi­cult prob­lems to solve; this is enough to drive meta-learn­ing de­spite GPT not be­ing de­signed for meta-learn­ing in any way. (This fam­ily of phe­nom­ena is per­haps dri­ven by neural net­works func­tion­ing as en­sem­bles of many sub­-net­works with them all av­er­ag­ing out to an Oc­cam’s ra­zor, which for small data & mod­els, learn su­per­fi­cial or mem­o­rized parts of the data, but can be forced into true learn­ing by mak­ing the prob­lems hard & rich enough; as meta-learn­ers learn amor­tized Bayesian in­fer­ence, they build in in­for­ma­tive pri­ors when trained over many tasks, and be­come dra­mat­i­cally more sam­ple-effi­cient and bet­ter at gen­er­al­iza­tion.)

The bless­ings of scale in turn sup­port a rad­i­cal the­o­ry: an old AI par­a­digm held by a few pi­o­neers in con­nec­tion­ism (early ar­ti­fi­cial neural net­work re­search) and by more re­cent deep learn­ing re­searchers, the scal­ing hy­poth­e­sis. The scal­ing hy­poth­e­sis re­gards the bless­ings of scale as the se­cret of AGI: in­tel­li­gence is ‘just’ sim­ple neural units & learn­ing al­go­rithms ap­plied to di­verse ex­pe­ri­ences at a (cur­rent­ly) un­reach­able scale. As in­creas­ing com­pu­ta­tional re­sources per­mit run­ning such al­go­rithms at the nec­es­sary scale, the neural net­works will get ever more in­tel­li­gent.

When? Es­ti­mates of Moore’s law-like progress curves decades ago by pi­o­neers like Hans Moravec in­di­cated that it would take un­til the 2010s for the suffi­cient­ly-cheap com­pute for tiny in­sec­t-level pro­to­type sys­tems to be avail­able, and the 2020s for the first sub­-hu­man sys­tems to be­come fea­si­ble, and these fore­casts are hold­ing up. (De­spite this vin­di­ca­tion, the scal­ing hy­poth­e­sis is so un­pop­u­lar an idea, and diffi­cult to prove in ad­vance rather than as a fait ac­com­pli, that while the GPT-3 re­sults fi­nally drew some pub­lic no­tice after Ope­nAI en­abled lim­ited pub­lic ac­cess & peo­ple could ex­per­i­ment with it live, it is un­likely that many en­ti­ties will mod­ify their re­search philoso­phies, much less kick off an ‘arms race’.)

More con­cern­ing­ly, GPT-3’s scal­ing curves, un­pre­dicted meta-learn­ing, and suc­cess on var­i­ous an­ti-AI chal­lenges sug­gests that in terms of fu­tur­ol­o­gy, AI re­searchers’ fore­casts are an em­peror sans gar­ments: they have no co­her­ent model of how AI progress hap­pens or why GPT-3 was pos­si­ble or what spe­cific achieve­ments should cause alarm, where in­tel­li­gence comes from, and do not learn from any fal­si­fied pre­dic­tions. Their pri­mary con­cerns ap­pear to be sup­port­ing the sta­tus quo, pla­cat­ing pub­lic con­cern, and re­main­ing re­spectable. As such, their com­ments on AI risk are mean­ing­less: they would make the same pub­lic state­ments if the scal­ing hy­poth­e­sis were true or not.

De­pend­ing on what in­vest­ments are made into scal­ing DL, and how fast com­pute grows, the 2020s should be quite in­ter­est­ing—sig­moid or sin­gu­lar­i­ty?

Meta-Learning

Read The Sam­ples

I strongly en­cour­age any­one in­ter­ested in GPT-3 to also at least skim OA’s ran­dom sam­ples, or bet­ter yet, my sam­ples in —read­ing the pa­per & look­ing at some stan­dard bench­mark graphs does not give a good feel for what work­ing with GPT-3 is like or the di­ver­sity of things it can do which are missed by bench­marks.

Learn­ing to learn. In May 2020, OA re­leased—to re­mark­ably lit­tle in­ter­est from re­searchers, no blog post, no me­dia blitz, and lit­tle pub­lic dis­cus­sion be­yond the snidely dis­mis­sive—the long-awaited fol­lowup to , one model to rule them all: a 117× larger 175b-pa­ra­me­ter model with far more pow­er­ful lan­guage gen­er­a­tion, which lets it solve a wide va­ri­ety of prob­lems from arith­metic1 to Eng­lish trans­la­tion to un­scram­bling ana­grams to SAT analo­gies—purely from be­ing prompted with text ex­am­ples, with­out any spe­cial­ized train­ing or fine­tun­ing what­so­ev­er, merely nex­t-word pre­dic­tion train­ing on a big In­ter­net text cor­pus. This im­plies GPT-3’s at­ten­tion mech­a­nisms serve as that have “learned to learn” by train­ing on suffi­ciently var­ied data2, forc­ing it to do more than just learn or­di­nary tex­tual re­la­tion­ships. Like Ope­nAI’s just weeks ago (it­self a re­mark­able demon­stra­tion of scal­ing in syn­the­siz­ing raw au­dio mu­sic com­plete with re­mark­ably re­al­is­tic voices/instruments), the an­nounce­ment of GPT-3 ap­pears to have sunk al­most with­out a trace, so I will go into more depth than usu­al.

Flexing GPT

‘“They are ab­solutely rea­son­able. I think that is their dis­tin­guish­ing char­ac­ter­is­tic. Yes, Mr. Er­sk­ine, an ab­solutely rea­son­able peo­ple. I as­sure you there is no non­sense about the Amer­i­cans.” “How dread­ful!” cried Lord Hen­ry. “I can stand brute force, but brute rea­son is quite un­bear­able. There is some­thing un­fair about its use. It is hit­ting be­low the in­tel­lect.”’

The Pic­ture of Do­rian Gray, Os­car Wilde

“At­tacks only get bet­ter.” 2 years ago, was in­ter­est­ingly use­ful pre­train­ing and adorable with its “sen­ti­ment neu­ron”. 1 year ago, GPT-2 was im­pres­sive with its ex­cel­lent text gen­er­a­tion & fine­tun­ing ca­pa­bil­i­ties. This year, GPT-3 is scary be­cause it’s a mag­nifi­cently ob­so­lete ar­chi­tec­ture from early 2018 (used mostly for soft­ware en­gi­neer­ing con­ve­nience as the in­fra­struc­ture has been de­bugged), which is small & shal­low com­pared to what’s pos­si­ble34, with a sim­ple uni­form ar­chi­tec­ture5 trained in the dumb­est way pos­si­ble (u­ni­di­rec­tional pre­dic­tion of next text to­ken) on a sin­gle im­pov­er­ished modal­ity (ran­dom In­ter­net HTML text dumps6) on tiny data (fits on a lap­top), sam­pled in a dumb way7, its bench­mark per­for­mance sab­o­taged by bad prompts & (e­spe­cially arith­metic & com­mon­sense rea­son­ing), and yet, the first ver­sion al­ready man­i­fests crazy run­time meta-learn­ing—and the scal­ing curves still are not bend­ing! The sam­ples are also bet­ter than ever, whether it’s GPT-3 in­vent­ing new pe­nis jokes8 or writ­ing (mostly work­ing) about ro­tat­ing ar­rays.

It’s odd that this qual­i­ta­tive leap ap­pears to be largely missed by the stan­dard NLP bench­marks. Noth­ing in the raw met­rics re­ported on, say, Penn Tree Bank or LAMBADA or Wino­Grande would lead you to ex­pect all of this hi­lar­i­ous and cre­ative out­put; the meta-learn­ing re­sults might, but only if you thought meta-learn­ing was im­por­tant. This sug­gests to me that a use­ful post-GPT-3 con­tri­bu­tion would be fig­ur­ing out how to bench­mark these sorts of flex­i­ble text gen­er­a­tion ca­pa­bil­i­ties (pos­si­bly some­thing along the lines of Chol­let’s im­age-based ).

Baking The Cake

Is GPT ac­tu­ally part of AGI—or is the cake a lie? (Le­Cun 2019)

Not the whole pic­ture, but a big part. Does it set SOTA on every task? No, of course not. But the ques­tion is not whether we can lawyerly find any way in which it might not work, but . And there are many ways it might work bet­ter (see the “Lim­i­ta­tions” sec­tion for just a few). Does GPT-3 do any­thing like steer a ro­bot around SF shoot­ing lasers and rock­ets at hu­mans⸮ No, of course not. It is ‘just’ a text pre­dic­tion mod­el, an id­iot sa­vant of text; but an id­iot sa­vant, we should re­mem­ber, is only a ge­netic mu­ta­tion or bit of brain dam­age away from a nor­mal hu­man. If RL is the cherry on the top of the su­per­vised learn­ing frost­ing, and su­per­vised learn­ing is the frost­ing on top of the un­su­per­vised learn­ing cake, well, it looks like the cake lay­ers are fi­nally ris­ing.

A bet­ter GPT-3 les­son.

Scal­ing still work­ing. I was sur­prised, as I had ex­pected closer to 100b pa­ra­me­ters, and I thought that the per­for­mance of ///// sug­gested that, the scal­ing pa­pers9 notwith­stand­ing, the scal­ing curves had started to bend and by 100b, it might be hard to jus­tify fur­ther scal­ing. How­ev­er, in the lat­est ver­sion of , GPT-3 hits twice that with­out no­tice­able change in scal­ing fac­tors: its scal­ing con­tin­ues to be roughly logarithmic/power-law, as it was for much smaller mod­els & as fore­cast, and it has not hit a regime where gains effec­tively halt or start to re­quire in­creases vastly be­yond fea­si­bil­i­ty. That sug­gests that it would be both pos­si­ble and use­ful to head to tril­lions of pa­ra­me­ters (which are still well within avail­able com­pute & bud­gets, re­quir­ing merely thou­sands of GPUs & per­haps $10–$100m bud­gets as­sum­ing no im­prove­ments which of course there will be, see Her­nan­dez & Brown 2020 etc in this is­sue), and eye­balling the graphs, many bench­marks like the would fall by 10t pa­ra­me­ters. The pre­dictabil­ity of scal­ing is strik­ing, and makes scal­ing mod­els more like sta­tis­tics than AI. (AI is sta­tis­tics which does what we want it to but does­n’t work; and sta­tis­tics is AI which works but does­n’t do what we wan­t.)

GPT-3: not even that much com­pute—3640 petaflop/s-day, only 2× their es­ti­mate for Al­phaGo Ze­ro, 1860. (His­tor­i­cal graph mod­i­fied by my­self from .)

An­ti-s­cal­ing: pen­ny-wise, pound-fool­ish. GPT-3 is an ex­tra­or­di­nar­ily ex­pen­sive model by the stan­dards of ma­chine learn­ing: it is es­ti­mated that train­ing it may re­quire the an­nual cost of more ma­chine learn­ing re­searchers than you can count on one hand (~$5m10), up to $30 of hard drive space to store the model (500–800G­B), and mul­ti­ple pen­nies of elec­tric­ity per 100 pages of out­put (0.4 kWH). Re­searchers are con­cerned about the prospects for scal­ing: can ML afford to run projects which cost more than 0.1 mil­li-Man­hat­tan-Pro­jects⸮11 Surely it would be too ex­pen­sive, even if it rep­re­sented an­other large leap in AI ca­pa­bil­i­ties, to spend up to 10 mil­li-Man­hat­tan-Pro­jects to scale GPT-3 100× to a triv­ial thing like hu­man-like per­for­mance in many do­mains⸮ Many re­searchers feel that such a sug­ges­tion is ab­surd and re­futes the en­tire idea of scal­ing ma­chine learn­ing re­search fur­ther, and that the field would be more pro­duc­tive if it in­stead fo­cused on re­search which can be con­ducted by an im­pov­er­ished goatherder on an old lap­top run­ning off so­lar pan­els.12 Nonethe­less, I think we can ex­pect fur­ther scal­ing. (10×? No, 10× is­n’t cool. You know what’s cool? 100–1000×, trained on a .)

Scaling

How far will scal­ing go? The scal­ing pa­pers sug­gest that the leaps we have seen over the past few years are not even half way there in terms of ab­solute like­li­hood loss, never mind what re­al-world ca­pa­bil­i­ties each ad­di­tional decre­ment trans­lates in­to. The scal­ing curves are clean; from Ka­plan et al 2020:

DL scal­ing laws: com­pute, data, model pa­ra­me­ters. (Fig­ure 1)

GPT-3 rep­re­sents ~103 on this chart, leav­ing plenty of room for fur­ther loss de­creas­es—e­spe­cially given the un­cer­tainty in ex­trap­o­la­tion:

Pro­ject­ing DL power laws: still room be­yond GPT-3.

Lo and be­hold, the scal­ing laws con­tinue for GPT-3 mod­els for sev­eral or­ders past Ka­plan et al 2020; from Brown et al 2020:

GPT-3 con­tin­ues to scale as pre­dict­ed. (Note GPT-3’s curve has not ‘bounced’, and it trained only ~0.5 epoches, see Ta­ble 2.2)

If we see such strik­ing gains in halv­ing the val­i­da­tion loss but with so far left to go, what is left to emerge as we third or halve again? How far does this go, ex­act­ly? How do we pre­dict what emerges when? Bueller? Bueller? (See also Meena’s per­plex­ity vs hu­man-ness chat­bot rat­ings, GPT-3-written news ar­ti­cles’ prob­a­bil­ity of fool­ing hu­mans by pa­ra­me­ter count, and GPT-3 model size vs Q&A from .)

Blessings Of Scale

“Ex­trap­o­lat­ing the spec­tac­u­lar per­for­mance of GPT-3 into the fu­ture sug­gests that the an­swer to life, the uni­verse and every­thing is just 4.398 tril­lion pa­ra­me­ters.”

Ge­off Hin­ton

We don’t know how to train NNs. The bless­ings of scale is the ob­ser­va­tion that for deep learn­ing, hard prob­lems are eas­ier to solve than easy prob­lem­s—ev­ery­thing gets bet­ter as it gets larger (in con­trast to the usual out­come in re­search, where small things are hard and large things im­pos­si­ble). The big­ger the neural net/compute/data/problem, the faster it learns, the bet­ter it learns, the sta­bler it learns, and so on. A prob­lem we can’t solve at all at small n may sud­denly be­come straight­for­ward with mil­lions or bil­lions of n. “NNs are lazy”: they can do far more than we make them do when we push them be­yond easy an­swers & cheap short­cuts. The is the harder and big­ger, the bet­ter. (Be­sides GPT-3, one could men­tion re­cent progress in semi­-su­per­vised learn­ing & the mod­el-based DRL re­nais­sance.)

Al­phaGo Ze­ro: ‘just stack moar lay­ers lol!’

Bless­ings of scale: sta­bil­ity → gen­er­al­iza­tion → meta-learn­ing. GPT-3 is ham­strung by its train­ing & data, but DL en­joys an un­rea­son­ably effec­tive —just sim­ply train­ing a big model on a lot of data in­duces bet­ter prop­er­ties like meta-learn­ing with­out even the slight­est bit of that ar­chi­tec­ture be­ing built in; and in gen­er­al, train­ing on more and harder tasks cre­ates ever more hu­man-like per­for­mance, gen­er­al­iza­tion, and ro­bust­ness. The GPT nat­u­ral-lan­guage & pro­gram­ming lan­guage mod­els, / for im­ages (and to some de­gree ), show that sim­ply scal­ing up mod­els & datasets with­out any su­per­vi­sion pro­duces re­sults com­pet­i­tive with the best (and most com­plex) al­ter­na­tives, us­ing the same sim­ple ar­chi­tec­ture, grad­u­ally pass­ing from su­per­fi­cial sur­face cor­re­la­tions to more hu­man-like brain ac­tiv­ity () and lin­guis­tic bi­ases as data in­creases (eg ) OA5 does not just scale to, but sta­bi­lizes at, mini­batches of mil­lions due to . OA5-like, sta­bi­lizes at large-s­cale im­age datasets like JFT-300M & ben­e­fits from un­usu­ally large mini­batches and VAEs (long an al­so-ran to GANs or au­tore­gres­sive mod­els in terms of sharp im­age gen­er­a­tion) catch up if you make them very deep (, ); while clas­si­fier CNNs like 13/ or or trans­fer & with hu­man-like er­rors14, mul­ti­modal learn­ing pro­duces bet­ter rep­re­sen­ta­tions on fewer data (eg /, mo­ti­vat­ing ), and RNNs can . reaches hu­man-level with hun­dreds of com­pet­ing self­-play­ers to cover pos­si­ble strate­gies. Im­i­ta­tion learn­ing DRL like gen­er­al­izes at hun­dreds of tasks to train a deep net. Dis­en­tan­gle­ment emerges in with suffi­ciently deep w em­bed­dings, with enough pa­ra­me­ters to train raw au­dio in the afore­men­tioned Juke­box, or in // with enough sam­ples to force fac­tor­iza­tion. (See also ////.) Train­ing on mil­lions of do­main ran­dom­iza­tions in­duced sim­i­lar im­plicit meta-learn­ing where dur­ing each run­time in­vo­ca­tion, the RNN probes its en­vi­ron­ment and en­codes its un­der­stand­ing of ro­bot hand con­trol into its hid­den state; and out­per­forms clas­si­cal ro­bot plan­ners by scal­ing 2 or­ders. Or in or , train­ing on hun­dreds of lev­els trains agents to solve lev­els in­di­vid­u­al­ly, but at thou­sands of lev­els, they be­gin to gen­er­al­ize to un­seen lev­els. demon­strated truly su­per­hu­man Go with­out ‘delu­sions’ just by train­ing a big­ger model on a richer sig­nal & pro-level play with­out any search—and , for that mat­ter, demon­strated that just train­ing an RNN end-to-end to pre­dict a re­ward on enough data is enough to ob­so­lete even Al­p­haZero and learn tree search im­plic­itly (but bet­ter). And on and on. DM re­searcher , dis­cussing their meta-re­in­force­ment learn­ing work where they were sur­prised to dis­cover meta-learn­ing emerg­ing, and that it did so re­gard­less of which spe­cific ar­chi­tec­ture they used:

“…it’s some­thing that just hap­pens. In a sense, you can’t avoid this hap­pen­ing. If you have a sys­tem that has mem­o­ry, and the func­tion of that mem­ory is shaped by re­in­force­ment learn­ing, and this sys­tem is trained on a se­ries of in­ter­re­lated tasks, this is go­ing to hap­pen. You can’t stop it.”

Pace , why? Why do they trans­fer and gen­er­al­ize? Why do these bless­ings of scale ex­ist? Why do we need to train large mod­els when small mod­els prov­ably ex­ist with the same per­for­mance? Why do larger mod­els not over­fit (though they ) and gen­er­al­ize bet­ter than smaller mod­els? What’s up with the whole any­way?

These are all, ahem, deep ques­tion about neural net­works and heav­ily de­bat­ed, but right now, I would sug­gest that the an­swer lies in some mix of the model compression/distillation, , , and (like ) lit­er­a­tures.

Big mod­els work be­cause they en­code a dizzy­ingly vast num­ber of sub­-mod­els in an ex­tremely ab­stract space, rep­re­sent­ing count­less small sub­-mod­els () , one of which is likely to solve the prob­lem well, and so en­sures the prob­lem is sol­u­ble by the over­all mod­el. They func­tion as an en­sem­ble: even though there count­less over­fit sub­-mod­els in­side the sin­gle big mod­el, they all av­er­age out, lead­ing to a pref­er­ence for sim­ple so­lu­tions. This Oc­cam’s ra­zor bi­ases the model to­wards sim­ple so­lu­tions which are flex­i­ble enough to grad­u­ally ex­pand in com­plex­ity to match the da­ta.

How­ev­er, “neural nets are lazy”: sub­-mod­els which mem­o­rize pieces of the data, or latch onto su­per­fi­cial fea­tures, learn quick­est and are the eas­i­est to rep­re­sent in­ter­nal­ly. If the model & data & com­pute are not big or var­ied enough, the op­ti­miza­tion, by the end of the cur­sory train­ing, will have only led to a sub­-model which achieves a low loss but missed im­por­tant pieces of the de­sired so­lu­tion.

On the other hand, for a model like GPT-3, it is suffi­ciently pow­er­ful a model that its sub­-mod­els can do any­thing from po­etry to arith­metic, and it is trained on so much data that those su­per­fi­cial mod­els may do well early on, but grad­u­ally fall be­hind more ab­stract mod­els; a sub­-model which mem­o­rizes some of the data is in­deed much sim­pler than a sub­-model which en­codes gen­uine arith­metic (a NN can prob­a­bly mem­o­rize tens of thou­sands of lookup ta­ble en­tries stor­ing ex­am­ples of ad­di­tion in the space it would take to en­code an ab­stract al­go­rithm like ‘ad­di­tion’), but it can’t pos­si­bly mem­o­rize all the in­stances of arith­metic (im­plicit or ex­plic­it) in GPT-3’s In­ter­net-s­cale dataset. If a mem­o­riz­ing sub­-model tried to do so, it would be­come ex­tremely large and pe­nal­ized. Even­tu­al­ly, after enough ex­am­ples and enough up­dates, the sim­plest ‘arith­metic’ model which ac­cu­rately pre­dicts the data just is arith­metic. And then the meta-learn­ing, after see­ing enough in­stances of al­go­rithms which vary slightly within each sam­ple, mak­ing it hard to learn each task sep­a­rate­ly, just is learn­ing of more generic al­go­rithms, yield­ing sub­-mod­els which achieve lower loss than the ri­val sub­-mod­els, which ei­ther fail to pre­dict well or bloat un­ac­cept­ably. (GPT-2-1.5b ap­par­ently was too small or shal­low to en­sem­ble eas­ily over sub­-mod­els en­cod­ing meta-learn­ing al­go­rithms, or per­haps not trained long enough on enough data to lo­cate the meta-learner mod­els; GPT-3 was.)

So, the larger the mod­el, the bet­ter, if there is enough data & com­pute to push it past the easy con­ve­nient sub­-mod­els and into the sub­-mod­els which ex­press de­sir­able traits like gen­er­al­iz­ing, fac­tor­iz­ing per­cep­tion into mean­ing­ful la­tent di­men­sions, meta-learn­ing tasks based on de­scrip­tions, learn­ing causal rea­son­ing & log­ic, and so on. If the in­gre­di­ents are there, it’s go­ing to hap­pen.

Scaling Hypothesis

The strong scal­ing hy­poth­e­sis is that, once we find a scal­able ar­chi­tec­ture like self­-at­ten­tion or con­vo­lu­tions, which like the brain can be ap­plied fairly uni­formly (eg or Hawkin­s), we can sim­ply train ever larger NNs and ever more so­phis­ti­cated be­hav­ior will emerge nat­u­rally as the eas­i­est way to op­ti­mize for all the tasks & da­ta. More pow­er­ful NNs are ‘just’ scaled-up weak NNs, in much the same way that hu­man brains look much like . While I was highly skep­ti­cal of scal­ing hy­poth­e­sis ad­vo­cates when I first be­came in­ter­ested in AI 2004–2010 (back when AI was stuck in the dol­drums of hope­lessly nar­row tools and dates like 2028 seemed im­pos­si­bly far away), which smacked of nu­merol­ogy and “if you build it they will come” logic (at the time, we cer­tainly did­n’t have gen­eral al­go­rithms that you could just throw com­pute at), in 2020, I have to ad­mit, I was wrong and they were right. We built the com­pute, and the al­go­rithms did come, and the scal­ing hy­poth­e­sis has only looked more and more plau­si­ble every year since 2010.

Why Does Pretraining Work?

The pre­train­ing the­sis goes some­thing like this:

“Fig­ure 1: En­vi­sioned evo­lu­tion of NLP re­search through three differ­ent eras or curves” (the hy­po­thet­i­cal S-curves & progress in nat­ural lan­guage mod­el­ing; from )

Hu­mans, one might say, are the : we con­stantly emit large amounts of struc­tured data, which im­plic­itly rely on log­ic, causal­i­ty, ob­ject per­ma­nence, his­to­ry—all of that good stuff. All of that is im­plicit and en­coded into our writ­ings and videos and ‘data ex­haust’. A model learn­ing to pre­dict must learn to un­der­stand all of that to get the best per­for­mance; as it pre­dicts the easy things which are mere sta­tis­ti­cal pat­tern-match­ing, what’s left are the hard things.

Early on in train­ing, a model learns the crud­est lev­els: that some let­ters are more fre­quent than oth­ers, that every 5 char­ac­ters or so there is a space, and so on. It goes from pre­dicted uni­form­ly-dis­trib­uted bytes to what looks like Base-60 en­cod­ing—al­phanu­meric gib­ber­ish. As crude as this may be, it’s enough to make quite a bit of ab­solute pro­gress: a ran­dom pre­dic­tor needs 8 bits to ‘pre­dict’ a byte/character, but just by at least match­ing let­ter and space fre­quen­cies, it can al­most halve its er­ror to around 5 bits.15 Be­cause it is learn­ing so much from every char­ac­ter, and be­cause the learned fre­quen­cies are sim­ple, it can hap­pen so fast that if one is not log­ging sam­ples fre­quent­ly, one might not even ob­serve the im­prove­ment.

As train­ing pro­gress­es, the task be­comes more diffi­cult. Now it be­gins to learn what words ac­tu­ally ex­ist and do not ex­ist. It does­n’t know any­thing about mean­ing, but at least now when it’s asked to pre­dict the sec­ond half of a word, it can ac­tu­ally do that to some de­gree, sav­ing it a few more bits. This takes a while be­cause any spe­cific in­stance will show up only oc­ca­sion­al­ly: a word may not ap­pear in a dozen sam­ples, and there are many thou­sands of words to learn. With some more work, it has learned that punc­tu­a­tion, plu­ral­iza­tion, pos­ses­sives are all things that ex­ist. Put that to­geth­er, and it may have pro­gressed again, all the way down to 3–4 bits er­ror per char­ac­ter! (While the progress is grat­i­fy­ingly fast, it’s still all gib­ber­ish, though, makes no mis­take: a sam­ple may be spelled cor­rect­ly, but it does­n’t make even a bit of sense.)

But once a model has learned a good Eng­lish vo­cab­u­lary and cor­rect formatting/spelling, what’s next? There’s not much juice left in pre­dict­ing with­in-words. The next thing is pick­ing up as­so­ci­a­tions among words. What words tend to come first? What words ‘clus­ter’ and are often used nearby each oth­er? Nau­ti­cal terms tend to get used a lot with each other in sea sto­ries, and like­wise Bible pas­sages, or Amer­i­can his­tory Wikipedia ar­ti­cle, and so on. If the word “Jeffer­son” is the last word, then “Wash­ing­ton” may not be far away, and it should hedge its bets on pre­dict­ing that ‘W’ is the next char­ac­ter, and then if it shows up, go al­l-in on “ash­ing­ton”. Such bag-of-words ap­proaches still pre­dict bad­ly, but now we’re down to per­haps <3 bits per char­ac­ter.

What next? Does it stop there? Not if there is enough data and the ear­lier stuff like learn­ing Eng­lish vo­cab does­n’t hem the model in by us­ing up its learn­ing abil­i­ty. Grad­u­al­ly, other words like “Pres­i­dent” or “gen­eral” or “after” be­gin to show the model sub­tle cor­re­la­tions: “Jeffer­son was Pres­i­dent after…” With many such pas­sages, the word “after” be­gins to serve a use in pre­dict­ing the next word, and then the use can be broad­ened.

By this point, the loss is per­haps 2 bits: every ad­di­tional 0.1 bit de­crease comes at a steeper cost and takes more time. How­ev­er, now the sen­tences have started to make sense. A sen­tence like “Jeffer­son was Pres­i­dent after Wash­ing­ton” does in fact mean some­thing (and if oc­ca­sion­ally we sam­ple “Wash­ing­ton was Pres­i­dent after Jeffer­son”, well, what do you ex­pect from such an un-con­verged mod­el). Jar­ring er­rors will im­me­di­ately jos­tle us out of any il­lu­sion about the mod­el’s un­der­stand­ing, and so train­ing con­tin­ues. (Around here, Markov chain & n-gram mod­els start to fall be­hind; they can mem­o­rize in­creas­ingly large chunks of the train­ing cor­pus, but they can’t solve in­creas­ingly crit­i­cal syn­tac­tic tasks like bal­anc­ing paren­the­ses or quotes, much less start to as­cend from syn­tax to se­man­tic­s.)

Now train­ing is hard. Even sub­tler as­pects of lan­guage must be mod­eled, such as keep­ing pro­nouns con­sis­tent. This is hard in part be­cause the mod­el’s er­rors are be­com­ing rare, and be­cause the rel­e­vant pieces of text are in­creas­ingly dis­tant and ‘long-range’. As it makes pro­gress, the ab­solute size of er­rors shrinks dra­mat­i­cal­ly. Con­sider the case of as­so­ci­at­ing names with gen­der pro­nouns: the differ­ence be­tween “Janelle ate some ice cream, be­cause he likes sweet things like ice cream” and “Janelle ate some ice cream, be­cause she likes sweet things like ice cream” is one no hu­man could fail to no­tice, and yet, it is a differ­ence of a sin­gle let­ter. If we com­pared two mod­els, one of which did­n’t un­der­stand gen­der pro­nouns at all and guessed ‘he’/‘she’ purely at ran­dom, and one which un­der­stood them per­fectly and al­ways guessed ‘she’, the sec­ond model would at­tain a lower av­er­age er­ror of barely <0.02 bits per char­ac­ter!

Nev­er­the­less, as train­ing con­tin­ues, these prob­lems and more, like im­i­tat­ing gen­res, get solved, and even­tu­ally at a loss of 1–2 (where a small char-RNN might con­verge on a small cor­pus like Shake­speare or some Project Guten­berg ebook­s), we will fi­nally get sam­ples that sound hu­man—at least, for a few sen­tences. These fi­nal sam­ples may con­vince us briefly, but, aside from is­sues like rep­e­ti­tion loops, even with good sam­ples, the er­rors ac­cu­mu­late: a sam­ple will state that some­one is “alive” and then 10 sen­tences lat­er, use the word “dead”, or it will di­gress into an ir­rel­e­vant ar­gu­ment in­stead of the ex­pected next ar­gu­ment, or some­one will do some­thing phys­i­cally im­prob­a­ble, or it may just con­tinue for a while with­out seem­ing to get any­where.

All of these er­rors are far less than <0.02 bits per char­ac­ter; we are now talk­ing not hun­dredths of bits per char­ac­ters but less than ten-t­hou­sandths.

The pre­train­ing the­sis ar­gues that this can go even fur­ther: we can com­pare this per­for­mance di­rectly with hu­mans do­ing the same ob­jec­tive task, who can achieve closer to 0.7 bits per char­ac­ter. What is in that miss­ing >0.4?

“Yeah, but there’s more to be­ing smart than know­ing com­pres­sion schemes!” “No there’s not!” “Shoot—he knows the se­cret!!”

Well—every­thing! Every­thing that the model miss­es. While just bab­bling ran­dom words was good enough at the be­gin­ning, at the end, it needs to be able to rea­son our way through the most diffi­cult tex­tual sce­nar­ios re­quir­ing causal­ity or com­mon­sense rea­son­ing. Every er­ror where the model pre­dicts that ice cream put in a freezer will “melt” rather than “freeze”, every case where the model can’t keep straight whether a per­son is alive or dead, every time that the model chooses a word that does­n’t help build some­how to­wards the ul­ti­mate con­clu­sion of an ‘es­say’, every time that it lacks the the­ory of mind to com­press novel scenes de­scrib­ing the Machi­avel­lian schem­ing of a dozen in­di­vid­u­als at din­ner jock­ey­ing for power as they talk, every use of logic or ab­strac­tion or in­struc­tions or Q&A where the model is be­fud­dled and needs more bits to cover up for its mis­take where a hu­man would think, un­der­stand, and pre­dict. Each of these cog­ni­tive break­throughs al­lows ever so slightly bet­ter pre­dic­tion of a few rel­e­vant texts; noth­ing less than true un­der­stand­ing will suffice for ideal pre­dic­tion.

If we trained a model which reached that loss of <0.7, which could pre­dict text in­dis­tin­guish­able from a hu­man, whether in a di­a­logue or quizzed about ice cream or be­ing tested on SAT analo­gies or tu­tored in math­e­mat­ics, if for every string the model did just as good a job of pre­dict­ing the next char­ac­ter as you could do, how could we say that it does­n’t truly un­der­stand every­thing? (If noth­ing else, we could, by de­fi­n­i­tion, re­place hu­mans in any kind of tex­t-writ­ing job!)

The last bits are deep­est. The im­pli­ca­tion here is that the fi­nal few bits are the most valu­able bits, which re­quire the most of what we think of as in­tel­li­gence. A help­ful anal­ogy here might be our ac­tions: for the most part, all hu­mans ex­e­cute ac­tions equally well. We all pick up a tea mug with­out drop­ping, and can lift our legs to walk down thou­sands of steps with­out falling even once. For every­day ac­tions, any­body, of any in­tel­li­gence, can get enough prac­tice & feed­back to do them quite well. Where in­di­vid­u­als differ is when they start run­ning into novel choic­es, rare choic­es, choices that take sec­onds but un­fold over a life­time, choices where we will never get any feed­back (like after our death). One only has to make a sin­gle bad de­ci­sion, out of a life­time of mil­lions of dis­crete de­ci­sions, to wind up in jail or dead. A small ab­solute av­er­age im­prove­ment in de­ci­sion qual­i­ty, if it is in those de­ci­sions, may be far more im­por­tant than its quan­tity in­di­cates. (Why do hu­mans have such large brains, when an­i­mals like chim­panzees do so many or­di­nary ac­tiv­i­ties seem­ingly as well with a frac­tion of the ex­pense? Why is lan­guage worth­while? Per­haps be­cause of con­sid­er­a­tions like these. We may be at our most hu­man while fill­ing out the pa­per­work for life in­sur­ance.)

Rea­sons for doubt. The pre­train­ing the­sis, while log­i­cally im­pec­ca­ble—how is a model sup­posed to solve all pos­si­ble trick ques­tions with­out un­der­stand­ing, just guess­ing?—n­ever struck me as con­vinc­ing, an ar­gu­ment ad­mit­ting nei­ther confu­ta­tion nor con­vic­tion. It feels too much like a magic trick: “here’s some in­for­ma­tion the­o­ry, here’s a hu­man bench­mark, here’s how we can en­code all tasks as a se­quence pre­dic­tion prob­lem, hey presto, in­tel­li­gence!” There are lots of al­go­rithms which are Tur­ing-com­plete or ‘uni­ver­sal’ in some sense; there are lots of al­go­rithms like AIXI which solve AI in some the­o­ret­i­cal sense (Schmid­hu­ber & com­pany have many of these cute al­go­rithms such as ‘the fastest pos­si­ble al­go­rithm for all prob­lems’, with the mi­nor catch of some con­stant fac­tors which re­quire com­put­ers big­ger than the uni­verse).

Why think pre­train­ing or se­quence mod­el­ing is not an­other one of them? Sure, if the model got a low enough loss, it’d have to be in­tel­li­gent, but how could you prove that would hap­pen in prac­tice? (Train­ing char-RNNs was fun, but they had­n’t ex­actly rev­o­lu­tion­ized deep learn­ing.) It might re­quire more text than ex­ists, count­less petabytes of data for all of those sub­tle fac­tors like log­i­cal rea­son­ing to rep­re­sent enough train­ing sig­nal, amidst all the noise and dis­trac­tors, to train a mod­el. Or maybe your mod­els are too small to do more than ab­sorb the sim­ple sur­face-level sig­nals, and you would have to scale them 100 or­ders of mag­ni­tude for it to work, be­cause the scal­ing curves did­n’t co­op­er­ate. Or maybe your mod­els are fun­da­men­tally bro­ken, and stuff like ab­strac­tion re­quire an en­tirely differ­ent ar­chi­tec­ture to work at all, and what­ever you do, your cur­rent mod­els will sat­u­rate at poor per­for­mance. Or it’ll train, but it’ll spend all its time try­ing to im­prove the sur­face-level mod­el­ing, ab­sorb­ing more and more lit­eral data and facts with­out ever as­cend­ing to the higher planes of cog­ni­tion as planned. Or…

‘The pos­si­bil­i­ties of de­vel­op­ing an atomic weapon and the de­sir­abil­ity of do­ing it se­cretly were dis­cussed at a Prince­ton Uni­ver­sity con­fer­ence in which I par­tic­i­pated in March 1939… said this rare va­ri­ety could not be sep­a­rated from com­mon ura­nium ex­cept by turn­ing the coun­try into a gi­gan­tic fac­to­ry. Bohr was wor­ried that this could be done and that an atomic bomb could be de­vel­ope­d—but he hoped that nei­ther could be ac­com­plished. Years lat­er, when Bohr came to Los Alam­os, I was pre­pared to say, “You see . . .” But be­fore I could open my mouth, he said: “You see, I told you it could­n’t be done with­out turn­ing the whole coun­try into a fac­to­ry. You have done just that.”

16

But ap­par­ent­ly, it would’ve worked fine. Even RNNs prob­a­bly would’ve worked—­Trans­form­ers are nice, but they seem mostly be about effi­cien­cy.17 (Train­ing large RNNs is much more ex­pen­sive, and do­ing BPTT over mul­ti­ple nodes is much harder en­gi­neer­ing-wise.) It just re­quired more com­pute & data than any­one was will­ing to risk on it un­til a few true-be­liev­ers were able to get their hands on a few mil­lion dol­lars of com­pute.

  1. Q: Did any­one pre­dict, quan­ti­ta­tive­ly, that this would hap­pen where it did?

    A: Not that I know of.

  2. Q: What would fu­ture scaled-up mod­els learn?

    GPT-2-1.5b had a cross-en­tropy Web­Text val­i­da­tion loss of ~3.3 (based on the per­plex­ity of ~10 in Fig­ure 4, and log2(10) = 3.32). GPT-3 halved that loss to ~1.73 judg­ing from Brown et al 2020 and us­ing the scal­ing for­mula (2.57 × (3.64 × 103)-0.048). For a hy­po­thet­i­cal GPT-4, if the scal­ing curve con­tin­ues for an­other 3 or­ders or so of com­pute (100–1000×) be­fore cross­ing over and hit­ting harder di­min­ish­ing re­turns, the cross-en­tropy loss will drop to ~1.24 (2.57 × (3.64 × (103 × 103))−0.048).

    If GPT-3 gained so much meta-learn­ing and world knowl­edge by drop­ping its ab­solute loss ~50% when start­ing from GPT-2’s lev­el, what ca­pa­bil­i­ties would an­other ~30% im­prove­ment over GPT-3 gain? (Cut­ting the loss that much would still not reach hu­man-level, as far as I can tell.18) What would a drop to ≤1, per­haps us­ing wider con­text win­dows or re­cur­ren­cy, gain?

    A: I don’t know.

  3. Q: Does any­one?

    A: Not that I know of.19

Prospects

In the prob­lem of de­cod­ing, the most im­por­tant in­for­ma­tion which we can pos­sess is the knowl­edge that the mes­sage which we are read­ing is not gib­ber­ish…In a sim­i­lar way, when we con­sider a prob­lem of na­ture such as that of atomic re­ac­tions and atomic ex­plo­sives, the largest sin­gle item of in­for­ma­tion which we can make pub­lic is that they ex­ist. Once a sci­en­tist at­tacks a prob­lem which he knows to have an an­swer, his en­tire at­ti­tude is changed. He is al­ready some 50% of his way to­ward that an­swer…the one se­cret con­cern­ing the atomic bomb which might have been kept and which was given to the pub­lic and to all po­ten­tial en­e­mies with­out the least in­hi­bi­tion, was that of the pos­si­bil­ity on its con­struc­tion. Take a prob­lem of this im­por­tance and as­sure the sci­en­tific world that it has an an­swer; then both the in­tel­lec­tual abil­ity of the sci­en­tists and the ex­ist­ing lab­o­ra­tory fa­cil­i­ties are so widely dis­trib­uted that the qua­si­-in­de­pen­dent re­al­iza­tion of the task will be a mat­ter of merely a few years any­where in the world.

, pg124–125, (em­pha­sis added)

“Peo­ple who work in ma­chine learn­ing sim­ply did­n’t think that neural net­works could do much. Peo­ple did­n’t be­lieve large neural net­works could be trained…The ideas were all there, the thing that was miss­ing was a lot of su­per­vised data and a lot of com­pute. Once you have [those two], then there is a third thing is that is need­ed—and that is con­vic­tion. Con­vic­tion that if you take the right stuff, which al­ready ex­ists, and ap­ply and mix it with a lot of data and a lot of com­pute, that it will in fact work. And so that was the miss­ing piece.”

Ilya Sutskever

What can we ex­pect from fu­ture DL work? Will GPT-3 kick­start an arms race where soon we will be dis­cussing, blase, what would seem now like lu­di­crously far­fetched schemes like bidi­rec­tional mul­ti­modal Trans­former 100× the size trained on 100× the data (video/text/PDFs-as-images/photo/robotics) with sup­ple­men­tary su­per­vised learn­ing as the back­bone of a MuZe­ro-like learn­ing+­plan­ning DRL agent run­ning on thou­sands of tasks (such as cod­ing) si­mul­ta­ne­ous­ly?

The ex­is­tence of im­plies that the lim­it­ing fac­tor here is less hard­ware than hu­man: will any or­ga­ni­za­tion treat GPT-3 as a Sput­nik mo­ment and in­vest ag­gres­sively in scal­ing pro­grams? Is there a GPT-4-equivalent brew­ing away in­side Deep­Mind or Google Brain’s TPU pods now? They aren’t stu­pid, they have the hard­ware, they have the bud­gets, they have the peo­ple.

But I think they lack a vi­sion. As far as I can tell: they do not have any such thing, be­cause Google Brain & Deep­Mind do not be­lieve in the scal­ing hy­poth­e­sis the way that Sutskev­er, Amodei and oth­ers at OA do. Just read through ma­chine learn­ing Twit­ter to see the dis­dain for the scal­ing hy­poth­e­sis. (A quar­ter year on from GPT-3 and count­ing, can you name a sin­gle dense model as large as the 17b Turing-NLG—never mind larger than GPT-3?)

Google Brain is en­tirely too prac­ti­cal and short­-term fo­cused to dab­ble in such es­o­teric & ex­pen­sive spec­u­la­tion, al­though Quoc V. Le’s group oc­ca­sion­ally sur­prises you. They’ll dab­ble in some­thing like GShard, but mostly be­cause they ex­pect to be likely to be able to de­ploy it or some­thing like it to pro­duc­tion in Google Trans­late.

Deep­Mind20 holds what we might call the “weak scal­ing hy­poth­e­sis”: they be­lieve that AGI will re­quire us to “find the right al­go­rithms” effec­tively repli­cat­ing a mam­malian brain mod­ule by mod­ule, and that while these mod­ules will be ex­tremely large & ex­pen­sive by con­tem­po­rary stan­dards (which is why com­pute is im­por­tant, to give us “a more pow­er­ful tool with which to hunt for the right al­go­rithms”), they still need to be in­vented & fine­tuned piece by piece, with lit­tle risk or sur­prise un­til the fi­nal as­sem­bly. Each piece, how­ev­er, it­self can scale: there’s no mag­i­cal in­tel­li­gence gland or quan­tum woo which cre­ates a bright line be­tween hu­mans and, say, chim­panzees or ro­dents. (As much as we hu­mans ex­trav­a­gantly ad­mire our own ca­pa­bil­i­ties like lan­guage or log­ic, those are rel­a­tively mi­nor flour­ishes on the ba­sic brain—each or­gan­ism solves the same ba­sic prob­lems, like ex­plo­ration, long-term mem­o­ry, learn­ing world-mod­els, as­so­ci­at­ing re­wards with spe­cific ac­tions, meta-learn­ing, etc.) As such, once you have a rat-level AGI, a hu­man-level AGI is just more so. (And rats are a lot eas­ier to ex­per­i­ment on.) That is how you get DM con­trap­tions like which throw the kitchen sink at the wall to see what sticks, and why they place such em­pha­sis on neu­ro­science as in­spi­ra­tion and cross-fer­til­iza­tion for re­verse-engi­neer­ing the brain. (See also Sam Alt­man’s pod­cast in­ter­view com­ments on OA’s ad­van­tage vs un­named ri­vals with more com­pute is be­cause the lack of com­pute makes them stay “small and fo­cused”—“for sure” like a startup ap­proach.) When some­one seems to have come up with a scal­able ar­chi­tec­ture for crack­ing a hard prob­lem, like Al­p­haZero or Al­phaS­tar, they are will­ing to pour on the gas to make it scale, but oth­er­wise, in­cre­men­tal re­fine­ment on ALE and then is the game plan. They have been bit­ing off and chew­ing pieces of the brain for a decade, and it’ll prob­a­bly take an­other decade or two of steady chew­ing if all goes well. Be­cause they have locked up so much tal­ent and have so much pro­pri­etary code and be­lieve all of that is a ma­jor moat to any com­peti­tor try­ing to repli­cate the com­pli­cated brain, they are fairly easy­go­ing. You will not see DM ‘bet the com­pany’ on any moon­shot; Google’s cash­flow is­n’t go­ing any­where (and DM’s bud­get), and slow and steady wins the race.

Go­ing be­yond that, most other re­search labs like Tesla or FAIR are ir­rel­e­vant and un­in­ter­est­ed. Chi­nese AI com­pa­nies are a ques­tion mark: past the lan­guage bar­ri­er, I seem to dis­cern in­ter­est in AGI & lit­tle of the re­flex­ive West­ern op­po­si­tion, and com­pa­nies like Baidu oc­ca­sion­ally re­lease im­por­tant re­search (such as the early scal­ing pa­per Hes­t­ness et al 2017), but over­all, Chi­nese AI may be over­es­ti­mat­ed, and they seem to suffer from a kind of Dutch dis­ease—­fund­ing for sur­veil­lance tech­nol­o­gy, and for nar­row e-com­merce nich­es, is so plen­ti­ful that other ar­eas are ne­glect­ed.

OA, lack­ing any­thing like DM’s long-term fund­ing from Google or its enor­mous head­count, is mak­ing a star­tup-like bet that they know an im­por­tant truth which is a se­cret: “the scal­ing hy­poth­e­sis is true!” So, sim­ple DRL al­go­rithms like PPO on top of large sim­ple ar­chi­tec­tures like RNNs or Trans­form­ers can emerge, ex­ploit­ing the bless­ings of scale, and meta-learn their way to pow­er­ful ca­pa­bil­i­ties, en­abling fur­ther fund­ing for still more com­pute & scal­ing, in a vir­tu­ous cy­cle. This is why OA had to re­vise its cor­po­rate form: lack­ing any enor­mous en­dow­ment or ex­tremely deep­-pock­eted pa­tron like Google, where does it get the money to scale (or hire ma­chine learn­ing engineer/researchers who can com­mand salaries in the mil­lion­s)? OA has to earn the nec­es­sary mon­ey, so in a move like Mozilla Foun­da­tion own­ing Mozilla Cor­po­ra­tion (to sell Fire­fox search en­gine place­men­t), or the Her­shey or­phan­age own­ing Her­shey Choco­late or the Girl Scouts li­cens­ing their cook­ies, Ope­nAI switched from a pure non­profit funded by do­na­tions to a non­profit which owns a for-profit subsidiary/startup, “Ope­nAI LP”, which can take in­vest­ments and en­gage in for-profit ac­tiv­i­ties. OA LP, while con­trolled by OA, can then shoot for the moon. And if OA is wrong to trust in the , well, they never could com­pete with DM di­rectly us­ing DM’s fa­vored ap­proach, and were al­ways go­ing to be an al­so-ran foot­note, so they have no re­gret.

While all of this hy­po­thet­i­cally can be repli­cated rel­a­tively eas­ily (n­ever un­der­es­ti­mate the amount of tweak­ing and spe­cial sauce it takes) by com­peti­tors if they wished (the nec­es­sary amounts of com­pute bud­gets are still triv­ial in terms of Big Sci­ence or other in­vest­ments like Al­phaGo or Al­phaS­tar or Way­mo, after al­l), said com­peti­tors lack the very most im­por­tant thing, which no amount of money or GPUs can ever cure: the courage of their con­vic­tions. They are too hide­bound and deeply philo­soph­i­cally wrong to ever ad­mit fault and try to over­take OA un­til it’s too late. How can we talk se­ri­ously about any kind of mil­i­tary Man­hat­tan Project when the US mil­i­tary does­n’t even let its de­vel­op­ers use Ten­sor­flow or Py­Torch, or about gov­ern­ment projects in the shadow of coro­n­avirus? This might seem ab­surd (surely the Bit­ter Lesson/scaling hy­poth­e­sis have now earned enough prior prob­a­bil­ity to be taken se­ri­ously and re­ceive ma­jor re­search in­vest­ments to test how far they can go, es­pe­cially given how im­por­tant the im­pli­ca­tions are), but look at the re­peated crit­i­cism of OA every time they re­lease a new ex­am­ple of the scal­ing hy­poth­e­sis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3… To para­phrase St Au­gustine, most peo­ples’ re­ac­tion to the Bit­ter Les­son or scal­ing hy­poth­e­sis is “grant me scale & com­pute—but not yet”.21

A crit­i­cal in­di­ca­tor will be whether or­ga­ni­za­tions be­yond ‘the usual sus­pects’ (Mi­crosoft team has reached , but there is also Nvidia, Sales­force, Al­len, Google DM/GB, Connor/EleutherAI, Face­book FAIR) start par­tic­i­pat­ing or if they con­tinue to dis­miss scal­ing. At least as of 2020-10-26, 152 days lat­er, no model has come near GPT-3, and in­deed, no model has even ex­ceeded Turing-NLG’s 17b.22

Critiquing The Critics

Keep­ing track. GPT-3 in 2020 makes as good a point as any to take a look back on the past decade. It’s re­mark­able to re­flect that some­one who started a PhD be­cause they were ex­cited by these new “ResNets” would still not have fin­ished it by now—that is how re­cent even resnets are, never mind Trans­form­ers, and how rapid the pace of progress is. In 2010, one could eas­ily fit every­one in the world who gen­uinely be­lieved in deep learn­ing into a mod­er­ate-sized con­fer­ence room (as­sisted slightly by the fact that 3 of them were busy found­ing ). Some­one in­ter­ested in ma­chine learn­ing in 2010 might have read about some in­ter­est­ing stuff from weirdo diehard con­nec­tion­ists in rec­og­niz­ing hand-writ­ten dig­its us­ing all of 1–2 mil­lion pa­ra­me­ters, or some mod­est neural tweaks to stan­dard voice-recog­ni­tion hid­den Markov mod­els. In 2010, who would have pre­dicted that over the next 10 years, deep learn­ing would un­dergo a Cam­brian ex­plo­sion caus­ing a mass ex­tinc­tion of al­ter­na­tive ap­proaches through­out ma­chine learn­ing, that mod­els would scale up to 175,000 mil­lion pa­ra­me­ters, and that these enor­mous mod­els would just spon­ta­neously de­velop all these ca­pa­bil­i­ties?

No one. That is, no one aside from a few diehard con­nec­tion­ists writ­ten off as will­ful­ly-de­luded old-school fa­nat­ics by the rest of the AI com­mu­nity (n­ever mind the world), such as , Schmid­hu­ber, Sutskever, Legg, & Amod­ei? One of the more shock­ing things about look­ing back is re­al­iz­ing how un­sur­pris­ing and eas­ily pre­dicted all of this was if you lis­tened to the right peo­ple. In 1998, 22 years ago, Moravec noted that AI re­search could be de­cep­tive, and hard­ware lim­its meant that “in­tel­li­gent ma­chine re­search did not make steady progress in its first 50 years, it marked time for 30 of them!”, pre­dict­ing that as Moore’s law con­tin­ued, “things will go much faster in the next 50 years than they have in the last 50.” Moravec fur­ther ob­served that part of the rea­son for rapid progress was the hard­ware over­hang: while su­per­com­put­ers of the nec­es­sary power would ex­ist long be­fore the con­nec­tion­ist rev­o­lu­tion be­gan, no one would be al­lowed to use them, as they would be de­voted to ‘more im­por­tant’ (pres­ti­gious) hard STEM work, like “physics sim­u­la­tions” (ie cli­mate sim­u­la­tions & nu­clear bombs)23, and “AI re­search must wait for the power to be­come more afford­able.” Afford­able mean­ing a work­sta­tion roughly ~$1,854$10001998; suffi­ciently cheap com­pute to ri­val a hu­man would ar­rive some­time in the 2020s, with the 2010s see­ing afford­able sys­tems in the lizard–­mouse range. As it hap­pens, the start of the DL rev­o­lu­tion is typ­i­cally dated to in 2012, by a grad stu­dent us­ing 2 GTX 580 3GB GPUs (launch list price of… $657$5002010, for a sys­tem build cost of per­haps $1,901$15002012). 2020 saw GPT-3 ar­rive, and as dis­cussed be­fore, there are many rea­sons to ex­pect the cost to fall, in ad­di­tion to the large hard­ware com­pute gains that are be­ing fore­cast for the 2020s.

The ac­cel­er­at­ing pace of the last 10 years should wake any­one from their dog­matic slum­ber and make them sit up­right. And there are 28 years left in Moravec’s fore­cast…

The temp­ta­tion, that many do not re­sist so much as revel in, is to give in to a dé­for­ma­tion pro­fes­sion­nelle and dis­miss any model as “just” this or that(“just bil­lions of IF state­ments” or “just a bunch of mul­ti­pli­ca­tions” or “just mil­lions of mem­o­rized web pages”), miss­ing the for­est for the trees, as Moravec com­mented of chess en­gi­nes:

The event was no­table for many rea­sons, but one es­pe­cially is of in­ter­est here. Sev­eral times dur­ing both match­es, Kas­parov re­ported signs of mind in the ma­chine. At times in the sec­ond tour­na­ment, he wor­ried there might be hu­mans be­hind the sce­nes, feed­ing Deep Blue strate­gic in­sight­s!…In all other chess com­put­ers, he re­ports a me­chan­i­cal pre­dictabil­ity stem­ming from their undis­crim­i­nat­ing but lim­ited looka­head, and ab­sence of long-term strat­e­gy. In Deep Blue, to his con­ster­na­tion, he saw in­stead an “alien in­tel­li­gence.”

…Deep Blue’s cre­ators know its quan­ti­ta­tive su­pe­ri­or­ity over other chess ma­chines in­ti­mate­ly, but lack the chess un­der­stand­ing to share Kas­parov’s deep ap­pre­ci­a­tion of the differ­ence in the qual­ity of its play. I think this di­chotomy will show up in­creas­ingly in com­ing years. En­gi­neers who know the mech­a­nism of ad­vanced ro­bots most in­ti­mately will be the last to ad­mit they have real minds. From the in­side, ro­bots will in­dis­putably be ma­chi­nes, act­ing ac­cord­ing to me­chan­i­cal prin­ci­ples, how­ever elab­o­rately lay­ered. Only on the out­side, where they can be ap­pre­ci­ated as a whole, will the im­pres­sion of in­tel­li­gence emerge. A hu­man brain, too, does not ex­hibit the in­tel­li­gence un­der a neu­ro­bi­ol­o­gist’s mi­cro­scope that it does par­tic­i­pat­ing in a lively con­ver­sa­tion.

But of course, if we ever suc­ceed in AI, or in re­duc­tion­ism in gen­er­al, it must be by re­duc­ing Y to ‘just X’. Show­ing that some task re­quir­ing in­tel­li­gence can be solved by a well-de­fined al­go­rithm with no ‘in­tel­li­gence’ is pre­cisely what suc­cess must look like! (Other­wise, the ques­tion has been thor­oughly begged & the prob­lem has only been pushed else­where; com­puter chips are made of tran­sis­tors, not es­pe­cially lit­tle ho­mun­culi.)

“As long as the AI [OA5] can ex­plore, it will learn, given enough time…We just kept wait­ing for the magic to run out. We kept wait­ing to hit a wall, and we never seemed to hit a wall.”

Greg Brock­man

“Give it the com­pute, give it the data, and it will do amaz­ing things. This stuff is like—it’s like alchemy!”

Ilya Sutskever, sum­mer 2019

Hind­sight is 20⁄20. Even in 2015, the scal­ing hy­poth­e­sis seemed highly du­bi­ous: you needed some­thing to scale, after all, and it was all too easy to look at flaws in ex­ist­ing sys­tems and imag­ine that they would never go away and progress would sig­moid any month now, soon. Like the ge­nomics rev­o­lu­tion where a few far-sighted seers ex­trap­o­lated that the nec­es­sary n for GWASes would in­crease ex­po­nen­tially & de­liver pow­er­ful PGSes soon, while sober ex­perts wrung their hands over “miss­ing her­i­tabil­ity” & the mirac­u­lous com­plex­ity of bi­ol­ogy & scoff about how such n re­quire­ments proved GWAS was a failed par­a­digm, the fu­ture ar­rived at first slowly and then quick­ly. Yet, here we are: all honor to the fa­nat­ics, shame and hu­mil­i­a­tion to the crit­ics!24 If only one could go back 10 years, or even 5, to watch every AI re­searchers’ head ex­plode read­ing this pa­per… Un­for­tu­nate­ly, few heads ap­pear to be ex­plod­ing now, be­cause hu­man ca­pac­ity for hind­sight & ex­cuses is bound­less (“I can get that much with fine­tun­ing, any­way I pre­dicted it all along, how bor­ing”) and, un­for­tu­nate­ly, for AGI. (If you are still cer­tain that there is near-zero prob­a­bil­ity of AGI in the next few decades, why? Did you pre­dic­t—in writ­ing—­ca­pa­bil­i­ties like GPT-3? Is this how you ex­pect AI fail­ure to look in the decades be­fore­hand? What spe­cific task, what spe­cific num­ber, would con­vince you oth­er­wise? How would the world look differ­ent than it does now if these crude pro­to­type in­sec­t-brain-sized DL sys­tems were not on a path to suc­cess?)

Au­thor­ity with­out ac­count­abil­i­ty. What should we think about the ex­perts? Pro­jec­tions of fail­ure were made by em­i­nent, re­spectable, se­ri­ous peo­ple. They spoke in con­sid­ered tones of why AI hype was ex­ces­sive and might trig­ger an “AI win­ter”, and the fun­da­men­tal flaws of fash­ion­able ap­proaches and why brute force could not work. These state­ments were made rou­tinely in 2014, 2015, 2016… And they were wrong. I am aware of few is­su­ing a mea culpa or re­flect­ing on it.25 It is a puz­zling fail­ure, and I’ve re­flected on it be­fore.

Phat­ic, not pre­dic­tive. There is, how­ev­er, a cer­tain tone of voice the bien pen­sant all speak in, whose sound is the same whether right or wrong; a tone shared with many state­ments in Jan­u­ary to March of this year; a tone we can also find in a 1940 Sci­en­tific Amer­i­can ar­ti­cle au­thor­i­ta­tively ti­tled, , which ad­vised the reader to not be con­cerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which cer­tain sci­en­tists had stopped talk­ing, rais­ing pub­lic con­cerns; not only could it hap­pen, the British bomb project had al­ready be­gun, and 5 years later it did hap­pen.)

The iron law of bu­reau­cra­cy: Cathe­dral goth­ic. This tone of voice is the voice of .
The voice of au­thor­ity in­sists on calm, and peo­ple not “pan­ick­ing” (the chief of sin­s).
The voice of au­thor­ity as­sures you that it won’t hap­pen (be­cause it can’t hap­pen).
The voice ut­ters sim­ple ar­gu­ments about why the sta­tus quo will pre­vail, and con­sid­ers only how the wild new idea could fail (and not all the pos­si­ble op­tion­s).
The voice is not, and does not deal in, un­cer­tain­ty; things will ei­ther hap­pen or they will not, and since it will not hap­pen, there is no need to take any pre­cau­tions (and you should not worry be­cause it can’t hap­pen).
The voice does not be­lieve in draw­ing lines on graphs (it is rank nu­merol­o­gy).
The voice does not is­sue any nu­mer­i­cal pre­dic­tions (which could be fal­si­fied).
The voice will not share its source code (for com­pli­cated rea­sons which can­not be ex­plained to the laity).
The voice is op­posed to un­eth­i­cal things like ran­dom­ized ex­per­i­ments on vol­un­teers (but will over­look the in­sult).
The voice does not have a model of the fu­ture (be­cause a model im­plies it does not al­ready know the fu­ture).
The voice is con­cerned about its pub­lic im­age (and un­kind gos­sip about it by other speak­ers of the voice).
The voice is al­ways sober, re­spectable, and cre­den­tialed (the voice would be pleased to write an op-ed for your na­tional mag­a­zine and/or news­pa­per).
The voice speaks, and is not spo­ken to (you can­not ask the voice what ob­jec­tive fact would change its mind).
The voice never changes its mind (un­til it does).
The voice is never sur­prised by events in the world (only dis­ap­point­ed).
The voice ad­vises you to go back to sleep (right now).

When some­one speaks about fu­ture pos­si­bil­i­ties, what is the tone of their voice?

Media

Books

Fic­tion:

Music

Touhou:

MLP:

Dou­jin:


  1. Given the num­ber of com­ments on the pa­per’s arith­metic bench­mark, I should point out that the arith­metic bench­mark ap­pears to greatly un­der­state GPT-3’s abil­i­ties due to the : even us­ing com­mas markedly im­proves its 5-digit ad­di­tion abil­i­ty, for ex­am­ple. The BPE is­sue also ap­pears to ex­plain much of the poor per­for­mance on the anagram/shuffling tasks. This is some­thing to keep in mind for any task which re­quires char­ac­ter-level ma­nip­u­la­tion or un­der­stand­ing.↩︎

  2. On im­plicit meta-learn­ing, see: / ()/, , /, /.↩︎

  3. GPT-3 hardly costs more than a few mil­lion dol­lars of com­pute (as of early 2020) as the ex­ten­sive scal­ing re­search be­fore­hand en­abled one train­ing run, and it is cheap to run (pg39): “Even with the full GPT-3 175B, gen­er­at­ing 100 pages of con­tent from a trained model can cost on the or­der of 0.4 kW-hr, or only a few cents in en­ergy costs.” (Like­wise, T5 was trained only once.)

    For com­par­ison, the was a com­mon aca­d­e­mic work­horse due to its ex­tremely low cost, a mere $91,556$20,0001970, while the first cost >$215,341$50,0001972—ex­pen­sive for a work­sta­tion but a bar­gain com­pared to re­searchers hog­ging main­frames cost­ing tens of mil­lions. IBM’s (other­wise use­less) Deep Blue AI project re­put­edly cost >$10$51997m for the fi­nal it­er­a­tion (re­ports of $192$1001997m ap­pear to be a con­fu­sion with the es­ti­mated value of pub­lic­ity men­tioned in pg187 of Hsu’s Be­hind Deep Blue) and Big Sci­ence projects like blow >5000× the fund­ing to mostly fail. (The par­ti­cle physi­cists, in­ci­den­tal­ly, are ≫$24b, based on, pre­sum­ably the sci­en­tific rev­o­lu­tions & world-chang­ing break­throughs that the LHC’s >$12$92010b in­vest­ment pro­duced, or the $4$21993b spent to (not) build the …)

    GPT-3 could have been done decades ago with global com­put­ing re­sources & sci­en­tific bud­gets; what could be done with to­day’s hard­ware & bud­gets that we just don’t know or care to do? There is a hard­ware over­hang. (See also the Whole Brain Em­u­la­tion Roadmap & .)↩︎

  4. Fur­ther, NNs have ad­di­tional hard­ware over­hangs of their own due to the many or­ders of mag­ni­tude asym­me­try of train­ing vs run­ning. Trans­fer learn­ing and meta-learn­ing are so much faster than the base­line model train­ing. You can ‘train’ GPT-3 with­out even any gra­di­ent step­s—just ex­am­ples. You pay the ex­tremely steep up­front cost of One Big Model to Rule Them All, and then reuse it every­where at tiny mar­ginal cost. If you train a mod­el, then as soon as it’s done you get, among other things:

    • the abil­ity to run thou­sands of copies in par­al­lel on the same hard­ware

      • in a con­text like Al­phaGo, I es­ti­mate sev­eral hun­dred ELO strength gains if you reuse the same hard­ware to merely run tree search with ex­act copies of the orig­i­nal model
    • meta-learning/transfer-learning to any re­lated do­main, cut­ting train­ing re­quire­ments by or­ders of mag­ni­tude

    • model compression/distillation to train stu­dent mod­els which are a frac­tion of the size, FLOPS, or la­tency (ra­tios vary­ing widely based on task, ap­proach, do­main, ac­cept­able per­for­mance degra­da­tion, tar­geted hard­ware etc, but often ex­treme like 1⁄100th)

    • reuse of the model else­where to in­stantly power up other mod­els (eg use of text or im­age em­bed­dings for a DRL agent)

    • learning-by-doing/ (high­est in in­for­ma­tion tech­nolo­gies, and high for DL: Her­nan­dez & Brown 2020), so the next from-scratch model may be much cheap­er.

      For ex­am­ple: after all the it­er­a­tive model ar­chi­tec­ture & game up­grades done while train­ing the first OA5 agent was com­plet­ed, the sec­ond it­er­a­tion of OA5, “Re­run”, was trained from scratch. Re­run re­quired only 20% of the train­ing for a “98% win-rate against the fi­nal ver­sion of Ope­nAI Five.” As the au­thors note: “The ideal op­tion would be to run Re­run-like train­ing from the very start, but this is im­pos­si­ble—the Ope­nAI Five curve rep­re­sents lessons learned that led to the fi­nal code­base, en­vi­ron­ment, etc., with­out which it would not be pos­si­ble to train Re­run.”

    • base­line for en­gi­neer­ing much more effi­cient ones by ab­lat­ing and com­par­ing with the orig­i­nal

    ↩︎
  5. Eg a nar­row con­text win­dow se­verely lim­its it, and mo­ti­vates the need for . More broad­ly, GPT-3 does noth­ing ex­otic—no use of or neural ar­chi­tec­ture search to try to tai­lor the mod­el, or even de­cide ba­sic hy­per­pa­ra­me­ters like widths (which as shows, can make quite a differ­ent even in “well-un­der­stood and hand-op­ti­mized vanilla ar­chi­tec­tures”).↩︎

  6. Not even PDFs—so no Google Books, no Arx­iv, no Lib­gen, no Sci-Hub…↩︎

  7. Gen­er­at­ing text from a LM can re­veal the pres­ence of knowl­edge, but not its ab­sence, and it is uni­ver­sally agreed that the cur­rent crude heuris­tic meth­ods like top-k can­not pos­si­bly be op­ti­mal.↩︎

  8. ‘A man is at the doc­tor’s office, and the doc­tor tells him, “I’ve got some good news and some bad news for you.” / The man says, “Well, I can’t take the bad news right now, so give me the good news first.” / The doc­tor says, “Well, the good news is that you have an 18-inch pe­nis.” / The man looks stunned for a mo­ment, and then asks, “What’s the bad news?” / The doc­tor says, “Your brain’s in your dick.”’↩︎

  9. Specifi­cal­ly: , , , , , , , , //, , , , , , , . In par­tic­u­lar, sam­ple-effi­ciency in­creases with model size up to com­pute-effi­cient scal­ing, and —a given long-tailed re­al-world dis­tri­b­u­tions of da­ta. (An ex­am­ple of how not to do scal­ing pa­pers is , which, in stark con­trast to the fore­go­ing pa­per­s—which Thomp­son et al do not men­tion at al­l!—at­tempts to in­fer scal­ing not from well-con­trolled ex­per­i­ments run by the au­thors, which yield ex­tremely tight and highly pre­dic­tive curves, but at­tempts to in­fer them from oc­ca­sional re­ported num­bers in highly dis­parate re­search pa­pers; un­sur­pris­ing­ly, their curves barely pre­dict any­thing and seem to be se­ri­ous over­es­ti­mates any­way.)

    It is note­wor­thy that the pur­suit of large mod­els is dri­ven al­most ex­clu­sively by Ope­nAI & in­dus­try en­ti­ties (the lat­ter of which are con­tent with far smaller mod­el­s), and that acad­e­mia has evinced an al­most to­tal dis­in­ter­est—dis­gust & anger, even, and de­nial (one might say “green AI” is green with en­vy). For all that the scal­ing hy­poth­e­sis is ‘ob­vi­ous’ and scal­ing is ‘pre­dicted’, there is re­mark­ably lit­tle in­ter­est in ac­tu­ally do­ing it. Per­haps we should pay more at­ten­tion to what peo­ple do rather than what they say.

    For more ML scal­ing re­search, fol­low the sub­red­dit.↩︎

  10. Roughly around Chuan Li’s es­ti­mate, us­ing nom­i­nal list prices with­out dis­counts (which could be steep as the mar­ginal costs of cloud com­pute are sub­stan­tially low­er). The R&D project cost would be much high­er, but is amor­tized over all sub­se­quent mod­els & pro­jects.↩︎

  11. The Man­hat­tan Project cost ~$24$21946b.↩︎

  12. As if we live in a world where grad stu­dents could go to the Moon on a ra­men bud­get if we just wished hard enough, or as if “green AI” ap­proaches to try to cre­ate small mod­els with­out go­ing through big mod­els did not look in­creas­ingly fu­tile and like throw­ing good money after bad, and were not the least green of all AI re­search… To the ex­tent that all cut­ting-edge AI re­search ~2010 could be done with grad stu­dent money like $1,313$10002010 of hard­ware, where AI re­search in decades be­fore & after ben­e­fited from big iron, that is an in­dict­ment of that era, demon­strat­ing what a stag­nant dead end that re­search was, that its tech­niques were so small­minded and hob­bled it could not ben­e­fit from the avail­able large-s­cale com­pute.↩︎

  13. Fun triv­ia: BiT at pre­dict­ing (cleaned, cor­rect­ed) Im­a­geNet la­bels than the orig­i­nal Im­a­geNet la­bels are.↩︎

  14. One in­ter­est­ing as­pect of im­age scal­ing ex­per­i­ments like Do­jo­longa et al 2020 is that even when per­for­mance is ‘plateau­ing’ on the orig­i­nal task & ap­proach­ing la­bel er­ror, the trans­fer learn­ing con­tin­ues to im­prove. Ap­par­ently the in­ter­nal rep­re­sen­ta­tions, even when ad­e­quate for mere clas­si­fi­ca­tion and so the score can­not in­crease more than a small per­cent­age, be­come more hu­man-like—be­cause it’s en­cod­ing or more ? I’ve no­ticed with lan­guage mod­els, the fi­nal frac­tions of a loss ap­pear to make a sub­stan­tial differ­ence to gen­er­ated sam­ple qual­i­ty, per­haps be­cause it is only after all the eas­ier mod­el­ing is fin­ished that the lazy lan­guage model is forced to squeeze out the next bit of per­for­mance by more cor­rectly mod­el­ing more so­phis­ti­cated things like log­ic, ob­jects, world-knowl­edge, etc.↩︎

  15. The num­bers here are not ex­act and are for il­lus­tra­tion; be­cause BPEs don’t cor­re­spond to any in­tu­itive, I am go­ing to bor­row from my ob­ser­va­tions watch­ing char-RNNs, and talk about the loss per char­ac­ter in­stead of BPE.↩︎

  16. pg210–211, “The Quiet En­emy”, The Legacy of Hi­roshima, Teller 1962.↩︎

  17. An­other way of in­ter­pret­ing the var­i­ous pa­pers about how Trans­form­ers are ac­tu­ally like RNNs or are is to take that as in­di­cat­ing that what is im­por­tant about them is not any in­her­ent new ca­pa­bil­ity com­pared to older ar­chi­tec­tures, but some low­er-level as­pect like be­ing more effi­ciently train­able on con­tem­po­rary hard­ware.↩︎

  18. How do these ab­solute pre­dic­tion per­for­mances com­pare to hu­mans? It’s hard to say. The only avail­able bench­marks for per­plex­ity for humans/GPT-2/GPT-3 ap­pear to be Web­Text, (PTB; based on the ), (1B­W), and . But cov­er­age is spot­ty.

    I found no hu­man bench­marks for Web­Text or Penn Tree Bank, so I can’t com­pare the hu­man vs GPT-2/GPT-3 per­plex­i­ties (GPT-2 PTB: 35.7; GPT-3 PTB: 20.5).

    GPT-2 was bench­marked at 43 per­plex­ity on the 1 Bil­lion Word (1BW) bench­mark vs a (highly ex­trap­o­lat­ed) (which in­ter­est­ingly ex­trap­o­lates, us­ing 2012 LSTM RNNs, that “10 to 20 more years of re­search be­fore hu­man per­for­mance is reached”), but that may be an un­fair bench­mark (“Our model is still sig­nifi­cantly worse than prior work on the One Bil­lion Word Bench­mark (). This is likely due to a com­bi­na­tion of it be­ing both the largest dataset and hav­ing some of the most de­struc­tive pre-pro­cess­ing—1B­W’s sen­tence level shuffling re­moves all long-range struc­ture.”) and 1BW was dropped from the GPT-3 eval­u­a­tion due to data con­t­a­m­i­na­tion (“We omit the 4 Wikipedi­a-re­lated tasks in that work be­cause they are en­tirely con­tained in our train­ing data, and we also omit the one-bil­lion word bench­mark due to a high frac­tion of the dataset be­ing con­tained in our train­ing set.”).

    LAMBADA was bench­marked at a GPT-2 per­plex­ity of 8.6, and a GPT-3 per­plex­ity of 3.0 (ze­ro-shot) / 1.92 (few-shot). in their GPT-2 blog post (but not the pa­per) that hu­man per­plex­ity is 1–2, but pro­vides no sources and I could­n’t find any. (The au­thors might be guess­ing based on how LAMBADA was con­struct­ed: ex­am­ples were fil­tered by whether two in­de­pen­dent hu­man raters pro­vided the same right an­swer, which lower bounds how good hu­mans must be at pre­dict­ing the an­swer.)

    So over­all, it looks like the best guess is that GPT-3 con­tin­ues to have some­where around twice the ab­solute er­ror of a hu­man. This im­plies it will take a large (yet, far from im­pos­si­ble) amount of com­pute to fully close the re­main­ing gap with the cur­rent scal­ing laws. If we ir­re­spon­si­bly ex­trap­o­late out the Web­Text scal­ing curve fur­ther, as­sume GPT-3 has twice the er­ror of a hu­man at its cur­rent Web­Text per­plex­ity of 1.73 (and so hu­mans are ~0.86), then we need 2.57 · (3.64 · (103 · x))-0.048 = 0.86, where x = 2.2e6 or 2,200,000× the com­pute of GPT-3. (This would roughly equal the cost to the USA of in­vad­ing Iraq.)

    When is that fea­si­ble?

    If we imag­ine that , then 2.2e6 would be 22 dou­blings away—or 6.3 years, in 2027. Most peo­ple be­lieve that com­pute trend must break down soon, and that sort of pre­dic­tion is a good rea­son why!

    Go­ing the other di­rec­tion, Her­nan­dez & Brown 2020’s es­ti­mate is that, net of hard­ware & al­go­rith­mic pro­gress, the cost of a fixed level of per­for­mance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it’ll cost $2.5m around mid-2021, and so on. Sim­i­lar­ly, a GPT-human re­quir­ing 2.2e6× more com­pute would pre­sum­ably cost on the or­der of $10 tril­lion in 2020, but after 14 halv­ings (18 years) would cost $1b in 2038.↩︎

  19. As of De­cem­ber 2020, half a year lat­er, al­most no re­searcher has been will­ing to go on record as say­ing what spe­cific ca­pa­bil­i­ties they pre­dict fu­ture 1t, 10t, or 100t mod­els will have or not have, and at what size which miss­ing ca­pa­bil­i­ties will emerge—just as no one is on record suc­cess­fully pre­dict­ing GPT-2 or GPT-3’s spe­cific ca­pa­bil­i­ties.↩︎

  20. Par­tic­u­larly ; I’m not sure about Shane Leg­g’s , al­though given the ac­cu­racy of his while found­ing Deep­Mind, he prob­a­bly has­n’t much changed his views that AI will be em­pow­ered by the (re­al­ized) ex­po­nen­tial com­pute gains or his . (This is con­sis­tent with the lat­est Metac­u­lus fore­casts.)↩︎

  21. When faced with the choice be­tween hav­ing to ad­mit all their fancy hard work is a dead­-end, swal­low the bit­ter lesson, and start bud­get­ing tens of mil­lions of com­pute, or in­stead writ­ing a dis­dain­ful tweet ex­plain­ing how, “ac­tu­ally, GPT-3 shows that scal­ing is a dead end, it’s an en­vi­ron­men­tal cat­a­stro­phe, and it’s just im­i­ta­tion in­tel­li­gence any­way”—most peo­ple will get busy on the tweet!↩︎

  22. A mix­ture-of-ex­pert model like GShard or an em­bed­ding like Dy­nam­icEm­bed­ding is not com­pa­ra­ble to ‘dense’ mod­els like GPT-3, as it’s al­ways been cheap & easy to train mod­els with bil­lions of ‘pa­ra­me­ters’ in some sense, like ex­tremely large em­bed­dings; how­ev­er, these pa­ra­me­ters do lit­tle, and are more like a few hun­dred shal­low mod­els glued back­-to-back. They prob­a­bly do not learn the same in­ter­est­ing things that a dense model would with the same nom­i­nal pa­ra­me­ter count.↩︎

  23. Strik­ing­ly, as of 2020, this is still true: eg the only deep learn­ing re­search I have seen done on were & . (In dou­ble-check­ing Arx­iv, I did find one non-STEM pa­per us­ing Sum­mit re­sources: , fo­cus­ing on sys­tems en­gi­neer­ing in train­ing a video clas­si­fi­ca­tion mod­el.)↩︎

  24. Now that GPT-3’s few-shot and have be­gun to make peo­ple like Gary Mar­cus feel slightly ner­vous about Wino­Grande, they have be­gun for why Wino­grad schemas good mea­sures of com­mon­sense reasoning/intelligence (be­cause in­tel­li­gence, of course, is what­ever AI can’t do yet).↩︎

  25. Feyn­man: “There are sev­eral ref­er­ences to pre­vi­ous flights; the ac­cep­tance and suc­cess of these flights are taken as ev­i­dence of safe­ty. But ero­sion and blowby are not what the de­sign ex­pect­ed. They are warn­ings that some­thing is wrong. The equip­ment is not op­er­at­ing as ex­pect­ed, and there­fore there is a dan­ger that it can op­er­ate with even wider de­vi­a­tions in the un­ex­pected and not thor­oughly un­der­stood way. The fact that this dan­ger did not lead to cat­a­stro­phe be­fore is no guar­an­tee that it will not the next time, un­less it is com­pletely un­der­stood.”↩︎

  26. Don’t wor­ry: we al­ready have short­-shorts & ear- to hedge against fur­sona in­fla­tion. That said, we ad­vise tak­ing a large po­si­tion in equineties im­age macro funds to ben­e­fit from a flight to qual­ity and herd­ing: it’ll be a bear mar­ket for kinky bond­s—and that’s no bull.↩︎

  27. Some in­ter­est­ing ref­er­ences on vi­ral evo­lu­tion:

    ↩︎