Building AGI Using Language Models

De­spite the buzz around GPT-3, it is, in and of it­self, not AGI. In many ways, this makes it sim­i­lar to Al­phaGo or Deep Blue; while ap­proach­ing human abil­ity in one do­main (play­ing Chess/Go, or writ­ing re­ally im­pres­sively), it doesn’t re­ally seem like it will do Scary AGI Things™ any more than Al­phaGo is going to be turn­ing the Earth into pa­per­clips any­time soon. While its writ­ings are im­pres­sive at em­u­lat­ing hu­mans, GPT-3 (or any po­ten­tial fu­ture GPT-x) has no mem­ory of past in­ter­ac­tions, nor is it able to fol­low goals or max­i­mize util­ity. How­ever, lan­guage mod­el­ling has one cru­cial dif­fer­ence from Chess or Go or image clas­si­fi­ca­tion. Nat­ural lan­guage es­sen­tially en­codes in­for­ma­tion about the world—the en­tire world, not just the world of the Goban, in a much more ex­pres­sive way than any other modal­ity ever could.[1] By har­ness­ing the world model em­bed­ded in the lan­guage model, it may be pos­si­ble to build a proto-​AGI.

World Mod­el­ling

The ex­plicit goal of a lan­guage model is only to max­i­mize like­li­hood of the model on nat­ural lan­guage data. In the au­tore­gres­sive for­mu­la­tion that GPT-3 uses, this means being able to pre­dict the next word as well as pos­si­ble. How­ever, this ob­jec­tive places much more weight on large, text-​scale dif­fer­ences like gram­mar and spelling than fine, sub­tle dif­fer­ences in se­man­tic mean­ing and log­i­cal co­herency, which re­flect as very sub­tle shifts in dis­tri­b­u­tion. Once the for­mer are near-​perfect, though, the only place left to keep im­prov­ing is the lat­ter.

At the ex­treme, any model whose loss reaches the Shan­non en­tropy of nat­ural lan­guage—the the­o­ret­i­cal low­est loss a lan­guage model can pos­si­bly achieve, due to the in­her­ent ran­dom­ness of lan­guage—will be com­pletely in­dis­tin­guish­able from writ­ings by a real human in every way, and the closer we get to it, the more ab­stract the ef­fect on qual­ity of each bit of im­prove­ment in loss. Or, said dif­fer­ently, string­ing words to­gether using Markov chain gen­er­a­tors gets you 50% of the way there, fig­ur­ing out gram­mar gets you an­other 50% of the re­main­ing dis­tance, stay­ing on topic across para­graphs gets you an­other 50% of the re­main­ing dis­tance, being log­i­cally con­sis­tent gets you an­other 50% of the re­main­ing dis­tance, and so on.[2]

H(X)=E[logP(X)]=xΩfX(x)logfX(x)\begin{aligned} H(X) = -\mathbb E[\log \mathbb P(X)] = -\sum_{x \in \Omega} f_X(x) \log f_X(x) \end{aligned}

Shan­non En­tropy: the num­ber of bits nec­es­sary, on av­er­age to spec­ify one piece of text.

Why? Be­cause if you have a coherent-​but-not-logically con­sis­tent model, be­com­ing more log­i­cally con­sis­tent helps you pre­dict lan­guage bet­ter. Hav­ing a model of human be­hav­ior helps you pre­dict lan­guage bet­ter. Hav­ing a model of the world helps you pre­dict lan­guage bet­ter. As the low-​hanging fruits of gram­mar and basic log­i­cal co­her­ence are taken, the only place left for the model to keep im­prov­ing the loss is a world model. Pre­dict­ing text is equiv­a­lent to AI.

The thing about GPT-3 that makes it so im­por­tant is that it pro­vides ev­i­dence that as long as we keep in­creas­ing the model size, we can keep dri­ving down the loss, pos­si­bly right up until it hits the Shan­non en­tropy of text. No need for clever ar­chi­tec­tures or com­plex hand­crafted heuris­tics. Just by scal­ing it up we can get a bet­ter lan­guage model, and a bet­ter lan­guage model en­tails a bet­ter world model.

But how do we use this lan­guage model if it’s buried implicitly-​represented in­side GPT-x, though? Well, we can lit­er­ally just ask it, in nat­ural lan­guage, what it thinks will hap­pen next given a se­quence of events, and its out­put dis­tri­b­u­tion will ap­prox­i­mate the dis­tri­b­u­tion of what the av­er­age human thinks would hap­pen next after those events. Great—we’ve got our­selves a us­able world model.

“But wait!” you say, “Var­i­ous ex­per­i­ments have shown that GPT-3 often fails at world mod­el­ling, and just con­jec­tur­ing that adding more pa­ra­me­ters will fix the prob­lem is a mas­sive leap!” If you’re think­ing this, you’re ab­solutely cor­rect. The biggest and most likely to be wrong as­sump­tion that I’m mak­ing is that larger mod­els will de­velop bet­ter world mod­els. Since as loss ap­proaches the Shan­non en­tropy its world mod­el­ling abil­ity has to be­come about as good than the av­er­age human on the in­ter­net[3], this boils down to two ques­tions: “Will we re­ally make mod­els whose loss gets close enough to the Shan­non en­tropy?” and “How close is close enough, in order to have the world mod­el­ling ca­pa­bil­i­ties to make all this prac­ti­cal?”

Loss keeps going down with more parameters and compute. (<a href='https://arxiv.org/abs/2005.14165'>Source</a>)Loss keeps going down with more pa­ra­me­ters and com­pute. (Source)

The an­swer to the first ques­tion is “most likely”—that’s the main take­away of GPT-3. The an­swer to the sec­ond ques­tion is… no­body knows. Some have demon­strated ways of mak­ing GPT-3 bet­ter at world mod­el­ling, but this alone is prob­a­bly not suf­fi­cient. When mod­els with 1 tril­lion, then 10 tril­lion, then 100 tril­lion pa­ra­me­ters be­come avail­able, we will have em­pir­i­cal ev­i­dence to see whether this as­sump­tion is cor­rect. If GPT-x demon­strates un­canny abil­ity to pre­dict out­comes in the real world, then this just might work.

Putting the pieces to­gether

A world model alone does not an agent make, though.[4] So what does it take to make a world model into an agent? Well, first off we need a goal, such as “max­i­mize num­ber of pa­per­clips”. Then, we just ask the world model “What ac­tion can I take to max­i­mize the num­ber of pa­per­clips I have?” Sim­ple, right? Ac­tu­ally, not quite. The prob­lem is that our world model prob­a­bly won’t be able to con­sider all the pos­si­ble things that could hap­pen next well enough to make a rea­son­able an­swer.

GPT-3 considers mesa-optimization. (Source: OpenAI API)GPT-3 con­sid­ers mesa-​optimization. (Source: Ope­nAI API)

So what can we do in­stead? Well, ask­ing the world model for a list of things you could do in a given world state would prob­a­bly not be out­side the ca­pa­bil­i­ties of a suf­fi­ciently pow­er­ful lan­guage model (think: “I am in sit­u­a­tion xyz. Here is a list of things I could do:”). Sim­i­larly, ask­ing the world model how much re­ward you’d get in some hy­po­thet­i­cal world where you took a se­quence of ac­tions would prob­a­bly be pos­si­ble too—imag­ine ask­ing some­thing along the lines of “I go to ebay. I look up pa­per­clips, sorted by price as­cend­ing. I spend $100 on the first item on the list. How many pa­per­clips will I have?”[5] This will let us fig­ure out what ac­tions the agent can take in any given step (pol­icy func­tion), as well as how much re­ward each se­quence of steps will net the agent (value func­tion).

So now, to es­ti­mate the state-​action value of any ac­tion, we can sim­ply do Monte Carlo Tree Search to es­ti­mate the state-​action val­ues! Start­ing from a given agent state, we can roll out se­quences of ac­tions using the world model. By in­te­grat­ing over all roll­outs, we can know how much fu­ture ex­pected re­ward the agent can ex­pect to get for each ac­tion it con­sid­ers. Then, we can sim­ply use, for ex­am­ple, a greedy pol­icy with that state-​action value func­tion, to de­cide on ac­tions to take.

Monte Carlo Tree Search visualized (<a href='https://www.researchgate.net/figure/Phases-of-the-Monte-Carlo-tree-search-algorithm-A-search-tree-rooted-at-the-current_fig1_312172859'>Source</a>)Monte Carlo Tree Search vi­su­al­ized (Source)

Each of these ac­tions is likely to be very high level, such as “fig­ure out the cheap­est way to buy pa­per­clips”, but thanks to the flex­i­bil­ity of lan­guage we can de­scribe very com­plex ideas with short se­quences of to­kens. To ac­tu­ally ex­e­cute these ab­stract ac­tions once the agent de­cides on an ac­tion, that ac­tion can be bro­ken down using the lan­guage model into smaller sub-​goals such as “fig­ure out the cheap­est pa­per­clips on Ama­zon”, sim­i­lar to Hi­er­ar­chi­cal Re­in­force­ment Learn­ing. Pos­si­bly even just break­ing ac­tions down into a de­tailed list of in­struc­tions would be fea­si­ble, de­pend­ing on the ca­pa­bil­i­ties of the model and how ab­stract the ac­tions are.

We can rep­re­sent the agent state as nat­ural lan­guage, too. Since the agent state is just a com­pressed rep­re­sen­ta­tion of the ob­ser­va­tions, we can ask the lan­guage model to sum­ma­rize the im­por­tant in­for­ma­tion of any ob­ser­va­tions for its own in­ter­nal world state. The lan­guage model could be used to pe­ri­od­i­cally prune (i.e for­get) the in­for­ma­tion in­side its state, too, to make room for more ob­ser­va­tions.

Al­to­gether, this gets us a sys­tem where we can pass ob­ser­va­tions from the out­side world in, spend some time think­ing about what to do, and out­put an ac­tion in nat­ural lan­guage.

To han­dle input, you could have an input mod­ule that turns var­i­ous modal­i­ties of ob­ser­va­tions into sum­ma­rized text with re­spect to the cur­rent agent state. For in­stance, you could use some­thing like iGPT to input cam­era im­ages or screen­shots, or raw HTML from web­pages that the agent re­quests. How ex­actly this is done is tan­gen­tial to the point; all that mat­ters is that some­how the in­puts are all con­verted to text and added to the agent state. The ex­am­ples I have pro­vided are just to con­vince you that it’s ab­solutely not in­sur­mount­able.

Fi­nally, to get the model to ac­tu­ally act in the world, you could again use the lan­guage model to trans­late nat­ural lan­guage into code that is then ex­e­cuted, or shell com­mands, or se­quences of key­presses, or any of a num­ber of other pos­si­ble ways. Like input, there are an in­fini­tude of dif­fer­ent ways to solve the out­put prob­lem, and which one turns out to be the best is en­tirely ir­rel­e­vant to our dis­cus­sion; all that mat­ters is that it’s pos­si­ble to get var­i­ous modal­i­ties in and out of the text-​only agent.[6]

An example of an input module taking a screenshot input combined with the current agent state to give an observation with the information needed by the agent.An ex­am­ple of an input mod­ule tak­ing a screen­shot input com­bined with the cur­rent agent state to give an ob­ser­va­tion with the in­for­ma­tion needed by the agent.

Con­clu­sion

This is more a thought ex­per­i­ment than some­thing that’s ac­tu­ally going to hap­pen to­mor­row; GPT-3 today just isn’t good enough at world mod­el­ling. Also, this method de­pends heav­ily on at least one major as­sump­tion—that big­ger fu­ture mod­els will have much bet­ter world mod­el­ling ca­pa­bil­i­ties—and a bunch of other smaller im­plicit as­sump­tions. How­ever, this might be the clos­est thing we ever get to a chance to sound the fire alarm for AGI: there’s now a con­crete path to proto-​AGI that has a non-​negligible chance of work­ing.

Thanks to zit­ter­be­we­gung, realmeaty­hu­man, and Shawn Presser for tak­ing the time to pro­vide feed­back on the draft of this blog post!

To cite:

1
2
3
4
5
6
7
@article{lg2020agilms,
title = "Building AGI Using Language Models",
author = "Gao, Leo",
journal = "leogao.dev",
year = "2020",
url = "https://leogao.dev/2020/08/17/Building-AGI-Using-Language-Models/"
}

  1. Im­ages aren’t nearly as good as text for en­cod­ing un­am­bigu­ous, com­plex ideas, un­less you put text in im­ages, but at that point that’s just lan­guage mod­el­ling with extra steps. Also, im­ages can en­code com­plex ideas, but in a much less information-​dense man­ner; I have no doubt that a suf­fi­ciently large image model could also learn such in­for­ma­tion about the world through im­ages, but most likely at mul­ti­ple or­ders of mag­ni­tude higher cost than an equivalent-​world-modelling-capability lan­guage model. ↩︎

  2. An­other way to look at this is at cher­ryp­ick­ing. Most im­pres­sive demos of GPT-3 where it dis­plays im­pres­sive knowl­edge of the world are cher­ryp­icked, but what that tells us is that the model needs to im­prove by ap­prox log2(N)/L\log_2(N)/L bits, where N and L are the num­ber of cher­ryp­ick­ings nec­es­sary and the length of the gen­er­a­tions in con­sid­er­a­tion, re­spec­tively, to reach that level of qual­ity. In other words, cher­ryp­ick­ing pro­vides a win­dow into how good fu­ture mod­els could be; and typ­i­cally, cher­ryp­icked sam­ples are much more log­i­cally co­her­ent.

    A Markov chain text gen­er­a­tor trained on a small cor­pus rep­re­sents a huge leap over ran­dom­ness: in­stead of hav­ing to gen­er­ate count­less quadrillions of sam­ples, one might only have to gen­er­ate mil­lions of sam­ples to get a few co­her­ent pages; this can be im­proved to hun­dreds or tens of thou­sands by in­creas­ing the depth of the n of its n-​grams. […] But for GPT-3, once the prompt is di­aled in, the ratio ap­pears to have dropped to closer to 1:5—maybe even as low as 1:3! gwern

    ↩︎
  3. Which, de­spite what the all-​too-common snide re­marks about the in­tel­li­gence of the av­er­age in­ter­net user would have you be­lieve, is ac­tu­ally not that bad! ↩︎

  4. A pure world model is in a lot of ways sim­i­lar to the idea of Or­a­cle AIs, specif­i­cally Pre­dic­tors. Whether these LM-​based world mod­els will be pow­er­ful enough to model the im­pact of their own out­puts is yet to be seen. ↩︎

  5. A more in­volved way we could do this is by fine­tun­ing the model, or steer­ing it using a smaller model, etc. to get the model to out­put the kinds of things we need, if just ask­ing nicely and pro­vid­ing ex­am­ples, like, how GPT-3 is used today, turns out to not be good enough. ↩︎

  6. Given a strong enough agent, it might not even be nec­es­sary to give it the abil­ity to ac­tu­ally act in the real world. LM-​based agents prob­a­bly (hope­fully?) won’t get this strong, though. ↩︎

...