Why Tool AIs Want to Be Agent AIs

AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.
decision-theory, statistics, NN, computer-science, transhumanism, AI, Bayes, insight-porn
2016-09-072018-08-28 finished certainty: likely importance: 9

Autonomous AI sys­tems (Agent AIs) trained using can do harm when they take wrong actions, espe­cially super­in­tel­li­gent Agent AIs. One solu­tion would be to elim­i­nate their agency by not giv­ing AIs the abil­ity to take actions, con­fin­ing them to purely infor­ma­tional or infer­en­tial tasks such as clas­si­fi­ca­tion or pre­dic­tion (Tool AIs), and have all actions be approved & exe­cuted by humans, giv­ing equiv­a­lently super­in­tel­li­gent results with­out the risk.

I argue that this is not an effec­tive solu­tion for two major rea­sons. First, because Agent AIs will by defi­n­i­tion be bet­ter at actions than Tool AIs, giv­ing an eco­nomic advan­tage. Sec­ond­ly, because Agent AIs will be bet­ter at infer­ence & learn­ing than Tool AIs, and this is inher­ently due to their greater agen­cy: the same algo­rithms which learn how to per­form actions can be used to select impor­tant dat­a­points to learn infer­ence over, how long to learn, how to more effi­ciently exe­cute infer­ence, how to design them­selves, how to opti­mize hyper­pa­ra­me­ters, how to make use of exter­nal resources such as long-term mem­o­ries or exter­nal soft­ware or large data­bases or the Inter­net, and how best to acquire new data. All of these actions will result in Agent AIs more intel­li­gent than Tool AIs, in addi­tion to their greater eco­nomic com­pet­i­tive­ness. Thus, Tool AIs will be infe­rior to Agent AIs in both actions and intel­li­gence, imply­ing use of Tool AIs is a even more highly unsta­ble equi­lib­rium than pre­vi­ously argued, as users of Agent AIs will be able to out­com­pete them on two dimen­sions (and not just one).

One pro­posed solu­tion to AI risk is to sug­gest that AIs could be lim­ited purely to supervised/unsupervised learn­ing, and not given access to any sort of capa­bil­ity that can directly affect the out­side world such as robotic arms. In this frame­work, AIs are treated purely as math­e­mat­i­cal func­tions map­ping data to an out­put such as a clas­si­fi­ca­tion prob­a­bil­i­ty, sim­i­lar to a logis­tic or lin­ear model but far more com­plex; most deep learn­ing neural net­works like Ima­geNet image clas­si­fi­ca­tion con­vo­lu­tional neural net­works (CNN)s would qual­i­fy. The gains from AI then come from train­ing the AI and then ask­ing it many ques­tions which humans then review & imple­ment in the real world as desired. So an AI might be trained on a large dataset of chem­i­cal struc­tures labeled by whether they turned out to be a use­ful drug in humans and asked to clas­sify new chem­i­cal struc­tures as use­ful or non-use­ful; then doc­tors would run the actual med­ical tri­als on the drug can­di­dates and decide whether to use them in patients etc. Or an AI might look like /: it answers your ques­tions about how best to drive places bet­ter than any human could, but it does not con­trol any traffic lights coun­try-wide to opti­mize traffic flows nor will it run a self­-driv­ing car to get you there. This the­o­ret­i­cally avoids any pos­si­ble run­away of AIs into malig­nant or uncar­ing actors who harm human­ity by sat­is­fy­ing dan­ger­ous util­ity func­tions and devel­op­ing instru­men­tal dri­ves. After all, if they can’t take any actions, how can they do any­thing that humans do not approve of?

Two vari­a­tions on this lim­it­ing or box­ing theme are

  1. Ora­cle AI: , in (pg145–158) notes that while they can be eas­ily ‘boxed’ and in some cases like P/NP prob­lems the answers can be cheaply checked or ran­dom sub­sets expen­sively ver­i­fied, there are sev­eral issues with ora­cle AIs:

    • the AI’s defi­n­i­tion of ‘resources’ or ‘stay­ing inside the box’ can change as it learns more about the world (on­to­log­i­cal crises)
    • responses might manip­u­late users into ask­ing easy (and use­less prob­lems)
    • mak­ing changes in the world can make it eas­ier to answer ques­tions about, by sim­pli­fy­ing or con­trol­ling it (“All processes that are sta­ble we shall pre­dict. All processes that are unsta­ble we shall con­trol.”)
    • even a suc­cess­fully boxed and safe ora­cle or tool AI can be mis­used1
  2. Tool AI (the term, as “tool mode” or “tool AGI”, was coined by Holden Karnof­sky in a July 2011 dis­cus­sion & elab­o­rated on in a May 2013 essay, but the idea has prob­a­bly been pro­posed before). To quote Karnof­sky:

    Google Map­s—by which I mean the com­plete soft­ware pack­age includ­ing the dis­play of the map itself—­does not have a “util­ity” that it seeks to max­i­mize. (One could fit a util­ity func­tion to its actions, as to any set of actions, but there is no sin­gle “para­me­ter to be max­i­mized” dri­ving its oper­a­tions.)

    Google Maps (as I under­stand it) con­sid­ers mul­ti­ple pos­si­ble routes, gives each a score based on fac­tors such as dis­tance and likely traffic, and then dis­plays the best-s­cor­ing route in a way that makes it eas­ily under­stood by the user. If I don’t like the route, for what­ever rea­son, I can change some para­me­ters and con­sider a differ­ent route. If I like the route, I can print it out or email it to a friend or send it to my phone’s nav­i­ga­tion appli­ca­tion. Google Maps has no sin­gle para­me­ter it is try­ing to max­i­mize; it has no rea­son to try to “trick” me in order to increase its util­i­ty. In short, Google Maps is not an agent, tak­ing actions in order to max­i­mize a util­ity para­me­ter. It is a tool, gen­er­at­ing infor­ma­tion and then dis­play­ing it in a user-friendly man­ner for me to con­sid­er, use and export or dis­card as I wish.

    Every soft­ware appli­ca­tion I know of seems to work essen­tially the same way, includ­ing those that involve (spe­cial­ized) arti­fi­cial intel­li­gence such as Google Search, Siri, Wat­son, Rybka, etc. Some can be put into an “agent mode” (as Wat­son was on Jeop­ardy) but all can eas­ily be set up to be used as “tools” (for exam­ple, Wat­son can sim­ply dis­play its top can­di­date answers to a ques­tion, with the score for each, with­out speak­ing any of them.)…Tool-AGI is not “trapped” and it is not Unfriendly or Friend­ly; it has no moti­va­tions and no dri­ving util­ity func­tion of any kind, just like Google Maps. It scores differ­ent pos­si­bil­i­ties and dis­plays its con­clu­sions in a trans­par­ent and user-friendly man­ner, as its instruc­tions say to do; it does not have an over­ar­ch­ing “want,” and so, as with the spe­cial­ized AIs described above, while it may some­times “mis­in­ter­pret” a ques­tion (thereby scor­ing options poorly and rank­ing the wrong one #1) there is no rea­son to expect inten­tional trick­ery or manip­u­la­tion when it comes to dis­play­ing its results.

    …An­other way of putting this is that a “tool” has an under­ly­ing instruc­tion set that con­cep­tu­ally looks like: “(1) Cal­cu­late which action A would max­i­mize para­me­ter P, based on exist­ing data set D. (2) Sum­ma­rize this cal­cu­la­tion in a user-friendly man­ner, includ­ing what Action A is, what likely inter­me­di­ate out­comes it would cause, what other actions would result in high val­ues of P, etc.” An “agent,” by con­trast, has an under­ly­ing instruc­tion set that con­cep­tu­ally looks like: “(1) Cal­cu­late which action, A, would max­i­mize para­me­ter P, based on exist­ing data set D. (2) Exe­cute Action A.” In any AI where (1) is sep­a­ra­ble (by the pro­gram­mers) as a dis­tinct step, (2) can be set to the “tool” ver­sion rather than the “agent” ver­sion, and this sep­a­ra­bil­ity is in fact present with most/all mod­ern soft­ware. Note that in the “tool” ver­sion, nei­ther step (1) nor step (2) (nor the com­bi­na­tion) con­sti­tutes an instruc­tion to max­i­mize a para­me­ter—to describe a pro­gram of this kind as “want­ing” some­thing is a cat­e­gory error, and there is no rea­son to expect its step (2) to be decep­tive…This is impor­tant because an AGI run­ning in tool mode could be extra­or­di­nar­ily use­ful but far more safe than an AGI run­ning in agent mode. In fact, if devel­op­ing “Friendly AI” is what we seek, a tool-AGI could likely be help­ful enough in think­ing through this prob­lem as to ren­der any pre­vi­ous work on “Friend­li­ness the­ory” moot.

    …Is a tool-AGI pos­si­ble? I believe that it is, and fur­ther­more that it ought to be our default pic­ture of how AGI will work

    There are sim­i­lar gen­eral issues with Tool AIs as with Ora­cle AIs:

    • a human check­ing each result is no guar­an­tee of safe­ty; even Homer nods. A extremely dan­ger­ous or sub­tly dan­ger­ous answer might slip through; Stu­art Arm­strong notes that the sum­mary may sim­ply not men­tion the impor­tant (to humans) down­side to a sug­ges­tion, or frame it in the most attrac­tive light pos­si­ble. The more a Tool AI is used, or trusted by users, the less check­ing will be done of its answers before the user mind­lessly imple­ments it.
    • an intel­li­gent, never mind super­in­tel­li­gent Tool AI, will have built-in search processes and plan­ners which may be quite intel­li­gent them­selves, and in ‘plan­ning how to plan’, dis­cover dan­ger­ous instru­men­tal dri­ves and the sub­-plan­ning process exe­cute them.2
    • devel­op­ing a Tool AI in the first place might require another AI, which itself is dan­ger­ous

Ora­cle AIs remain mostly hypo­thet­i­cal because it’s unclear how to write such util­ity func­tions. The sec­ond approach, Tool AI, is just an extrap­o­la­tion of cur­rent sys­tems but has two major prob­lems aside from the already iden­ti­fied ones which cast doubt on Karnof­sky’s claims that Tool AIs would be “extra­or­di­nar­ily use­ful” & that we should expect future AGIs to resem­ble Tool AIs rather than Agent AIs.


First and most com­monly pointed out, agent AIs are more eco­nom­i­cally com­pet­i­tive as they can replace tool AIs (as in the case of YouTube upgrad­ing from to 3) or ‘humans in the loop’.4 In any sort of process, notes that as steps get opti­mized, the opti­miza­tion does less and less as the out­put becomes dom­i­nated by the slow­est step—if a step only takes 10% of the time or resources, then even infi­nite opti­miza­tion of that step down to zero time/resources means that the out­put will increase by no more than 10%. So if a human over­see­ing a, say, (HFT) algo­rithm, accounts for 50% of the latency in deci­sions, then the HFT algo­rithm will never run more than twice as fast as it does now, which is a crip­pling dis­ad­van­tage. (Hence, the deba­cle is not too sur­pris­ing—no profitable HFT firm could afford to put too many humans into its loops, so when some­thing does go wrong, it can be diffi­cult for humans to fig­ure out the prob­lem & inter­vene before the losses moun­t.) As the AI gets bet­ter, the gain from replac­ing the human increases great­ly, and may well jus­tify replac­ing them with an AI infe­rior in many other respects but supe­rior in some key aspect like cost or speed. This could also apply to error rates—in air­line acci­dents, human error now causes the over­whelm­ing major­ity of acci­dents due to their pres­ence as over­seers of the and it’s unclear that a human pilot rep­re­sents a net safety gain; and in ‘advanced chess’, grand­mas­ters ini­tially chose most moves and used the chess AI for check­ing for tac­ti­cal errors and blun­ders, which tran­si­tioned through the late ‘90s and early ’00s to human play­ers (not even grand­mas­ters) turn­ing over most play­ing to the chess AI but con­tribut­ing a great deal of win per­for­mance by pick­ing & choos­ing which of sev­eral AI-sug­gested moves to use, but as the chess AIs improved, at some point around 2007 vic­to­ries increas­ingly came from the humans mak­ing mis­takes which the oppos­ing chess AI could exploit, even mis­takes as triv­ial as ’misclicks’ (on the com­puter screen), and now in advanced chess, human con­tri­bu­tion has decreased to largely prepar­ing the chess AIs’ open­ing books & look­ing for novel open­ing moves which their chess AI can be bet­ter pre­pared for.

At some point, there is not much point to keep­ing the human in the loop at all since they have lit­tle abil­ity to check the AI choices and become ‘deskilled’ (think GPS direc­tions), cor­rect­ing less than they screw up and demon­strat­ing that tool­ness is no guar­an­tee of safety nor respon­si­ble use. (Hence the old joke: “the fac­tory of the future will be run by a man and a dog; the dog will be there to keep the man away from the fac­tory con­trols.”) For a suc­cess­ful autonomous pro­gram, just keep­ing up with growth alone makes it diffi­cult to keep humans in the loop; the US drone war­fare pro­gram has become such a cen­tral tool of US war­fare that the US Air Force finds it extremely diffi­cult to hire & retain enough human pilots over­see­ing its drones, and there are indi­ca­tions that oper­a­tional pres­sures are slowly erod­ing the human con­trol & turn­ing them into rub­ber­stamps, and for all its protes­ta­tions that it would always keep a human in the deci­sion-mak­ing loop, the Pen­ta­gon is, unsur­pris­ing­ly, inevitably, slid­ing towards fully autonomous drone war­fare as the next tech­no­log­i­cal step to main­tain mil­i­tary supe­ri­or­ity over Rus­sia & Chi­na. (See “Meet The New Mav­er­icks: An Inside Look At Amer­i­ca’s Drone Train­ing Pro­gram”; “Future is assured for death-deal­ing, life-sav­ing drones”; “Sam Alt­man’s Man­i­fest Des­tiny”; “The Pen­tagon’s ‘Ter­mi­na­tor Conun­drum’: Robots That Could Kill on Their Own”; “Attack of the Killer Robots”)

Fun­da­men­tal­ly, autonomous agent AIs are what we and the free mar­ket want; every­thing else is a sur­ro­gate or irrel­e­vant loss func­tion. We don’t want low log-loss error on Ima­geNet, we want to refind a par­tic­u­lar per­sonal pho­to; we don’t want excel­lent advice on which stock to buy for a few microsec­onds, we want a money pump spit­ting cash at us; we don’t want a drone to tell us where Osama bin Laden was an hour ago (but not now), we want to have killed him on sight; we don’t want good advice from Google Maps about what route to drive to our des­ti­na­tion, we want to be at our des­ti­na­tion with­out doing any dri­ving etc. Idio­syn­cratic sit­u­a­tions, legal reg­u­la­tion, fears of tail risks from very bad sit­u­a­tions, wor­ries about cor­re­lated or sys­tem­atic fail­ures (like hack­ing a drone fleet), and so on may slow or stop the adop­tion of Agent AIs—but the pres­sure will always be there.

So for this rea­son alone, we expect to see Agent AIs to sys­tem­at­i­cally be pre­ferred over Tool AIs unless they’re con­sid­er­ably worse.


Agent AIs will be cho­sen over Tool AIs—­for rea­sons aside from not being what any­one wants and some­thing that will be severely penal­ized by free mar­kets or sim­ply there being mul­ti­ple agents choos­ing whether to use a Tool AI or an Agent AI in any kind of com­pet­i­tive sce­nar­i­o—also suffer from the prob­lem that the best Tool AI’s performance/intelligence will be equal to or worse than the best Agent AI, prob­a­bly worse, and pos­si­bly much worse. Bostrom notes that “Such ‘cre­ative’ [dan­ger­ous] plans come into view when the [Tool AI] soft­ware’s cog­ni­tive abil­i­ties reach a suffi­ciently high lev­el.”; we might reverse this to say that to make the Tool AI reach a suffi­ciently high lev­el, we must put such cre­ativ­ity in view. (A lin­ear model may be extremely safe & pre­dictable, but it would be hope­less to expect every­one to use them instead of neural net­work­s.)

An Agent AI clearly ben­e­fits from being a bet­ter Tool AI, so it can bet­ter under­stand its envi­ron­ment & inputs; but less intu­itive­ly, any Tool AI ben­e­fits from agen­ti­ness. An Agent AI has the poten­tial, often real­ized in prac­tice, to out­per­form any Tool AI: it can get bet­ter results with less com­pu­ta­tion, less data, less man­ual design, less post-pro­cess­ing of its out­puts, on harder domains.

(Triv­ial proof: Agent AIs are super­sets of Tool AIs—an Agent AI, by not tak­ing any actions besides com­mu­ni­ca­tion or ran­dom choice, can reduce itself to a Tool AI; so in cases where actions are unhelp­ful, it per­forms the same as the Tool AI, and when actions can help, it can per­form bet­ter; hence, an Agent AI can always match or exceed a Tool AI. At least, assum­ing suffi­cient data that in the envi­ron­ments where actions are not help­ful, it can learn to stop act­ing, and in the ones where they are, it has a dis­tant enough hori­zon to pay for the explo­ration. Of course, you might agree with this but sim­ply believe that intel­li­gence-wise, Agent AIs == Tool AIs.)

Every suffi­ciently hard prob­lem is a rein­force­ment learn­ing prob­lem.

More seri­ous­ly, not all data is cre­ated equal. Not all data points are equally valu­able to learn from, require equal amounts of com­pu­ta­tion, should be treated iden­ti­cal­ly, should inspire iden­ti­cal fol­lowup data sam­pling, or actions. Infer­ence and learn­ing can be much more effi­cient if the algo­rithm can choose how to com­pute on what data with which actions.

There is no hard Carte­sian bound­ary such that con­trol of the envi­ron­ment is irrel­e­vant to the algo­rithm and vice-versa and its com­pu­ta­tion can be car­ried out with­out regard to the envi­ron­men­t—there are sim­ply many lay­ers between the core of the algo­rithm and the fur­thest part of the envi­ron­ment, and the more lay­ers that the algo­rithm can model & con­trol, the more it can do. Con­sider Google Maps/Waze5. On the sur­face they are ‘merely’ Tool AIs which pro­duce lists of pos­si­ble routes which would opti­mize cer­tain require­ments; but the entire point of such Tool AIs—and all large-s­cale Tool AIs and —is that count­less dri­vers will act on them (what’s the point of get­ting dri­ving direc­tions if you don’t then dri­ve?), and this will greatly change traffic pat­terns as dri­vers become appendages of the ‘Tool’ AI, poten­tially mak­ing dri­ving in an area much worse by their errors or myopic per-driver opti­miza­tion caus­ing (and far from being a the­o­ret­i­cal curios­i­ty, GPS, Google Maps, and Waze are reg­u­larly accused of that in many places, espe­cially Los Ange­les).

This is a highly gen­eral point which can be applied on many lev­els. This point often arises in clas­si­cal statistics//decision the­ory where adap­tive tech­niques can greatly out­per­form fixed-sam­ple tech­niques for both infer­ence and actions/losses: numer­i­cal inte­gra­tion , a trial test­ing a hypoth­e­sis can often ter­mi­nate after a frac­tion of the equiv­a­lent fixed-sam­ple tri­al’s sam­ple size (and/or loss) while ; an will have much lower regret than any non-adap­tive solu­tion, but it will also be infer­en­tially bet­ter at esti­mat­ing which arm is best and what the per­for­mance of that arm is (see the ‘best-arm prob­lem’: , Audib­ert et al 2010, Gabil­lon et al 2011, Mel­lor 2014, Jamieson & Nowak 2014, ), and an adap­tive can con­stan­t-fac­tor (gains of 50% or more are pos­si­ble com­pared to naive designs like even allo­ca­tion; McClel­land 1997) min­i­mize total by focus­ing on unex­pect­edly diffi­cult-to-es­ti­mate arms (while a fixed-sam­ple trial can be seen as ideal for when one val­ues pre­cise esti­mates of all arms equally and they have equal vari­ance, which is usu­ally not the case); even a or or design rather than sim­ple ran­dom­iza­tion can be seen as reflect­ing this ben­e­fit (avoid­ing the poten­tial for imbal­ance in allo­ca­tion across arms by decid­ing in advance the sequence of ‘actions’ taken in col­lect­ing sam­ples). Another exam­ple comes from , where select­ing the best of 2 pos­si­ble queues to wait in rather than select­ing 1 queue at ran­dom improves the expected max­i­mum delay from to instead (and inter­est­ing­ly, almost all the gain comes from being able to make any choice at all, going from 1 to 2—choos­ing from 3 or more queues adds only some con­stan­t-fac­tor gain­s).

The wide vari­ety of uses of action is a major theme in recent work in AI (specifi­cal­ly, /neural net­works) research and increas­ingly key to achiev­ing the best per­for­mance on infer­en­tial tasks as well as rein­force­ment learning/optimization/agent-y tasks. Although these advan­tages apply to most AI par­a­digms, because of the power and wide vari­ety of tasks NNs get applied to, and sophis­ti­cated archi­tec­tures, we can see the per­va­sive advan­tage of agen­ti­ness much more clearly than in nar­rower con­texts like bio­sta­tis­tics.

Actions for intelligence

Rough­ly, we can try to cat­e­go­rize the differ­ent kinds of agen­ti­ness by the ‘level’ of the NN they work on. There are:

  1. actions inter­nal to a com­pu­ta­tion:

    • inputs
    • inter­me­di­ate states
    • access­ing the exter­nal ‘envi­ron­ment’
    • amount of com­pu­ta­tion
    • enforc­ing constraints/finetuning qual­ity of out­put
    • chang­ing the loss func­tion applied to out­put
  2. actions inter­nal to train­ing the NN:

    • the gra­di­ent itself
    • size & direc­tion of gra­di­ent descent steps on each para­me­ter
    • over­all gra­di­ent descent learn­ing rate and learn­ing rate sched­ule
    • choice of data sam­ples to train on
  3. inter­nal to the dataset

    • active learn­ing
    • opti­mal exper­i­ment design
  4. inter­nal to the NN design step

    • hyper­pa­ra­me­ter opti­miza­tion
    • NN archi­tec­ture
  5. inter­nal to inter­ac­tion with envi­ron­ment

    • adap­tive exper­i­ment / mul­ti­-armed ban­dit / explo­ration for rein­force­ment learn­ing

Actions internal to a computation

Inside a spe­cific NN, while com­put­ing the out­put for an input ques­tion, a NN can make choices about how to han­dle it.

It can choose what parts of the input to run most of its com­pu­ta­tions on, while throw­ing away or com­put­ing less on other parts of the input, which are less rel­e­vant to the out­put, using “atten­tion mech­a­nisms” (eg Olah & Carter 2016, , Bel­lver et al 2016, , , Xu 2015, Larochelle & Hin­ton 2010, , , Mnih et al 2014, , Kaiser & Ben­gio 2016). Atten­tion mech­a­nisms are respon­si­ble for many increases in per­for­mance, but espe­cially improve­ments in RNNs’ abil­ity to do sequence-to-se­quence trans­la­tion by revis­it­ing impor­tant parts of the sequence (), image gen­er­a­tion and cap­tion­ing, and in CNNs’ abil­ity to rec­og­nize images by focus­ing on ambigu­ous or small parts of the image, even for adver­sar­ial exam­ples (). They are a major trend in deep learn­ing, as it is often the case that some parts of the input are more impor­tant than oth­ers and enable both global & local oper­a­tions to be learned, with increas­ingly too many exam­ples of atten­tion to list (with a trend as of 2018 towards using atten­tion as the major or only con­struc­t).

Many designs can be inter­preted as using atten­tion. The bidi­rec­tional RNN also often used in nat­ural lan­guage trans­la­tion does­n’t explic­itly use atten­tion mech­a­nisms but is believed to help by giv­ing the RNN a sec­ond look at the sequence. Indeed, so uni­ver­sal that it often goes with­out men­tion is that the /GRU mech­a­nism which improves almost all RNNs is itself a kind of atten­tion mech­a­nism: the LSTM cells learn which parts of the hid­den state/history are impor­tant and should be kept, and whether and when the mem­o­ries should be for­got­ten and fresh mem­o­ries loaded into the LSTM cells. While LSTM RNNs are the default for sequence tasks, they have occa­sion­ally been beaten by feed­for­ward neural net­work­s—us­ing inter­nal atten­tion or “self­-at­ten­tion”, like the Trans­former archi­tec­ture (eg Vaswani et al 2017 or ).

Extend­ing atten­tion, a NN can choose not just which parts of an input to look at mul­ti­ple times, but also how long to keep com­put­ing on it, “adap­tive com­pu­ta­tion” (, , Sil­ver et al 2016b, , , , , Teer­apit­tayanon et al 2017, , , , , , , , , , ): so it iter­a­tively spends more com­pu­ta­tion on hard parts of prob­lem within a given com­pu­ta­tional bud­get6. Neural ODEs are an inter­est­ing exam­ple of a model which are sort of like adap­tive RNNs in that they can be run repeat­edly by the ODE solver, adap­tive­ly, to refine their out­put to a tar­get accu­ra­cy, and the ODE solver can be con­sid­ered a kind of agent as well.

Atten­tion gen­er­ally does­n’t change the nature of the com­pu­ta­tion aside from the neces­sity of actions over the input, but actions can be used to bring in differ­ent com­put­ing par­a­digms. For exam­ple, the entire field of “differ­en­tiable neural com­puter”/“neural Tur­ing machines” (, ) or “neural stack machines” or “neural GPUs” or most designs with some sort of scal­able exter­nal mem­ory mech­a­nism larger than LSTMs () depends on fig­ur­ing out a clever way to back­prop­a­gate through the action of mem­ory accesses or using rein­force­ment learn­ing tech­niques like REINFORCE for train­ing the non-d­iffer­en­tiable actions. And such a mem­ory is like a data­base which is con­structed on the fly per-prob­lem, so it’ll help with data­base queries & infor­ma­tion retrieval & knowl­edge graphs (, , , , , ). An intrigu­ing vari­ant on this idea of ‘query­ing’ resources is mix­ture-of-ex­perts () NN archi­tec­tures (Shazeer et al 2016). (Google Brain) asks where should we use RL tech­niques in our OSes, net­works, and com­pu­ta­tions these days and answers: ( review). RL should be used for: pro­gram place­ment on servers (/Mirho­seini et al 2018), /Bloom fil­ters for data­bases, , search query can­di­dates (, ), com­piler set­tings (), quan­tum com­puter con­trol (), dat­a­cen­ter & server cool­ing con­trollers… Dean asks “Where Else Could We Use Learn­ing?”, and replies:

Any­where We’re Using Heuris­tics To Make a Deci­sion!

  • Com­pil­ers: instruc­tion sched­ul­ing, reg­is­ter allo­ca­tion, loop nest par­al­leliza­tion strate­gies, …
  • Net­work­ing: TCP win­dow size deci­sions, back­off for retrans­mits, data com­pres­sion, …
  • Oper­at­ing sys­tems: process sched­ul­ing, buffer cache insertion/replacement [eg Lagar-Cav­illa et al 2019 for ], file sys­tem prefetch­ing [eg , mem­ory allo­ca­tion ()], …
  • Job sched­ul­ing sys­tems: which tasks/VMs to co-lo­cate on same machine, which tasks to pre-empt, … [eg ]
  • ASIC design: , test case selec­tion, …

Any­where We’ve Punted to a User-Tun­able Per­for­mance Option! Many pro­grams have huge num­bers of tun­able com­mand-line flags, usu­ally not changed from their defaults (--eventmanager_threads=16 --bigtable_scheduler_batch_size=8 --mapreduce_merge_memory=134217728 --lexicon_cache_size=1048576 --storage_server_rpc_freelist_size=128 …)

Meta-learn every­thing. ML:

  • learn­ing place­ment deci­sions
  • learn­ing fast ker­nel imple­men­ta­tions
  • learn­ing opti­miza­tion update rules
  • learn­ing input pre­pro­cess­ing pipeline steps
  • learn­ing acti­va­tion func­tions
  • learn­ing model archi­tec­tures for spe­cific device types, or that are fast for infer­ence on mobile device X, learn­ing which pre-trained com­po­nents to reuse, …

Com­puter architecture/datacenter net­work­ing design:

  • learn­ing best design prop­er­ties by explor­ing design space auto­mat­i­cally (via sim­u­la­tor) [see ]

Final­ly, one inter­est­ing vari­ant on this theme is treat­ing an infer­en­tial or gen­er­a­tive prob­lem as a rein­force­ment learn­ing prob­lem in a sort of envi­ron­ment with global rewards. Many times the stan­dard loss func­tion is inap­plic­a­ble, or the impor­tant things are glob­al, or the task is not really well-de­fined enough (in a “I know it when I see it” sense for the human) to nail down as a sim­ple differ­en­tiable loss with pre­de­fined labels such as in an image clas­si­fi­ca­tion prob­lem; in these cas­es, one can­not do stan­dard super­vised train­ing to min­i­mize the loss but must start using rein­force­ment learn­ing to directly opti­mize a reward—treat­ing out­puts such as clas­si­fi­ca­tion labels as ‘actions’ which may even­tu­ally result in a reward. For exam­ple, in a char-RNN gen­er­a­tive text model trained by pre­dict­ing a char­ac­ter con­di­tional on the pre­vi­ous, one can gen­er­a­tive rea­son­able text sam­ples by pick­ing the most likely next char­ac­ter and occa­sion­ally a less likely char­ac­ter for diver­si­ty, but one can gen­er­ate higher qual­ity sam­ples by explor­ing longer sequences with or nucleus sam­pling, and one can improve gen­er­a­tion fur­ther by adding util­ity func­tions for global prop­er­ties & apply­ing RL algo­rithms such as (MCTS) for train­ing or run­time max­i­miza­tion of an over­all trait like translation/summarization qual­ity (se­quence-to-se­quence prob­lems in gen­er­al) or win­ning or pro­gram writ­ing (eg Jaques et al 2016, Norouzi et al 2016, , , , /, , , , , He et al 2016, Bello et al 2017, , , , , , , , , Lewis et al 2017, , , , , , , , , , , , , , , ). Most exot­i­cal­ly, the loss func­tion can itself be a sort of action/RL set­ting—­con­sider the close con­nec­tions (, , , , ) between actor-critic rein­force­ment learn­ing, syn­thetic gra­di­ents (), and game-the­o­ry-based gen­er­a­tive adver­sar­ial net­works (GANs; , Zhu et al 2017/).

Actions internal to training

The train­ing of a NN by might seem to be inde­pen­dent of any con­sid­er­a­tions of ‘actions’, but it turns to be another domain where you can go “what if we treated this as a ?” and it’s actu­ally use­ful. Specifi­cal­ly, gra­di­ent descent requires selec­tion of which data to put into a mini­batch, how large a change to make to para­me­ters in gen­eral based on the error in the cur­rent mini­batch (the learn­ing rate hyper­pa­ra­me­ter), or how much to update each indi­vid­ual para­me­ter each mini­batch (per­haps hav­ing some neu­rons which get tweaked much less than oth­er­s). Actions are things like select­ing 1 out of n pos­si­ble mini­batches to do gra­di­ent descent on, or select­ing 1 out of n pos­si­ble learn­ing rates with the learn­ing rate increasing/decreasing over time (, , Bello et al 2017, Fu et al 2016, Xu et al 2016, Jader­berg et al 2016, , , , , , , ; pri­or­i­tized traces, pri­or­i­tized expe­ri­ence replay, boost­ing, hard-neg­a­tive min­ing, (), pri­or­i­tiz­ing hard sam­ples, , Fan et al 2016, , , learn­ing inter­nal nor­mal­iza­tions, ).

Actions internal to data selection

We have pre­vi­ously looked at sam­pling from exist­ing datasets: train­ing on hard sam­ples, and so on. One prob­lem with exist­ing datasets is that they can be ineffi­cien­t—per­haps they have class imbal­ance prob­lems where some kinds of data are over­rep­re­sented and what is really needed for improved per­for­mance is more of the other kinds of data. An image clas­si­fi­ca­tion CNN does­n’t need 99 dog pho­tos & 1 cat pho­tos, it wants 50 dog pho­tos & 50 cat pho­tos. (Quite aside from the fact that there’s not enough infor­ma­tion to clas­sify other cat pho­tos based on just 1 exem­plar, the CNN will sim­ply learn to always clas­sify pho­tos as ‘dog’.) One can try to fix this by , or by chang­ing the loss func­tion to make clas­si­fy­ing the minor­ity class cor­rectly much more valu­able than clas­si­fy­ing the major­ity class.

Even bet­ter is if the NN can some­how ask for new data, be given additional/corrected data when it makes a mis­take, or even cre­ate new data (pos­si­bly based on old data: ). This leads us to : given pos­si­ble addi­tional dat­a­points (such as a large pool of unla­beled dat­a­points), the NN can ask for the dat­a­point which it will learn the most from (, Islam 2016, Gal 2016, , , , , , ). One could, for exam­ple, train a RL agent to query a search engine and select the most use­ful images/videos for learn­ing a clas­si­fi­ca­tion task (eg YouTube: ). We can think of it as a lit­tle anal­o­gous to how kids7 ask par­ents not ran­dom ques­tions, but ones they’re most unsure about, with the most impli­ca­tions one way or anoth­er. Set­tles 2010 dis­cusses the prac­ti­cal advan­tages to machine learn­ing algo­rithms of care­ful choice of data points to learn from or ‘label’, and gives some of the known the­o­ret­i­cal results on how large the ben­e­fits can be—on a toy prob­lem, an error rate e decreas­ing in sam­ple count from to , or in a Bayesian set­ting, a decrease of to . Active learn­ing also con­nects back, from a machine learn­ing per­spec­tive, to some of the sta­tis­ti­cal areas cov­er­ing the ben­e­fits of adaptive/sequential tri­al­s—op­ti­mal exper­i­ments query the most uncer­tain aspects, which the most can be learned from.

Actions internal to NN design

“I sus­pect that less than 10 years from now, all of the DL training/architecture tricks that came from the arXiv fire­hose over 2015–2019 will have been entirely super­seded by auto­mated search tech­niques. The future: no alche­my, just clean APIs, and quite a bit of com­pute.”

François Chol­let, 2019-01-7

Mov­ing on to more famil­iar ter­ri­to­ry, we have using ran­dom search or grid search or Bayesian to try train­ing a pos­si­ble NN, observe interim () and final per­for­mance, and look for bet­ter hyper­pa­ra­me­ters. But if “hyper­pa­ra­me­ters are para­me­ters we don’t know how to learn yet”, then we can see the rest of neural net­work archi­tec­ture design as being hyper­pa­ra­me­ters too: what is the prin­ci­pled differ­ence between set­ting a rate and set­ting the num­ber of NN lay­ers? Or between set­ting a learn­ing rate sched­ule and the width of NN lay­ers or the num­ber of con­vo­lu­tions or what kind of pool­ing oper­a­tors are used? There is none; they are all hyper­pa­ra­me­ters, just that usu­ally we feel it is too diffi­cult for hyper­pa­ra­me­ter opti­miza­tion algo­rithms to han­dle many options and we limit them to a small set of key hyper­pa­ra­me­ters and use “grad stu­dent descent” to han­dle the rest of the design. So… what if we used pow­er­ful algo­rithms (viz. neural net­works) to design com­piled code, neural acti­va­tions, units like LSTMs, or entire archi­tec­tures (, , , , , Cas­tronovo 2016, , , Ravi & Larochelle 2017, , , , , , , , , , , , , , , , Anony­mous 2017, , , , , , , , , , , , , Anony­mous 2018, , , , , , , , Gupta & Tan 2019, )?

The log­i­cal exten­sion of these “neural net­works all the way down” papers is that an actor like Google/Baidu/Facebook/MS could effec­tively turn NNs into a black box: a user/developer uploads through an API a dataset of input/output pairs of a spec­i­fied and a mon­e­tary loss func­tion, and a top-level NN run­ning on a large GPU clus­ter starts autonomously opti­miz­ing over archi­tec­tures & hyper­pa­ra­me­ters for the NN design which bal­ances GPU cost and the mon­e­tary loss, inter­leaved with fur­ther opti­miza­tion over the thou­sands of pre­vi­ous sub­mit­ted tasks, shar­ing its learn­ing across all of the datasets/loss functions/architectures/hyperparameters, and the orig­i­nal user sim­ply sub­mits future data through the API for pro­cess­ing by the best NN so far. (Google and Face­book have already taken steps toward this using dis­trib­uted hyper­pa­ra­me­ter opti­miza­tion ser­vices which ben­e­fit from trans­fer learn­ing across tasks; Google Vizier/HyperTune, FBLearner Flow.)

Actions external to the agent

Final­ly, we come to actions in envi­ron­ments which aren’t purely vir­tu­al. Adap­tive exper­i­ments, mul­ti­-armed ban­dits, rein­force­ment learn­ing etc will out­per­form any purely super­vised learn­ing. For exam­ple, trained as a pure super­vised-learn­ing Tool AI, pre­dict­ing next moves of human Go games in a dataset, but that was only a pre­lude to the self­-play, which boosted it from pro­fes­sional player to super­hu­man lev­el; aside from replac­ing loss func­tions (a clas­si­fi­ca­tion loss like log loss vs vic­to­ry), the AlphaGo NNs were able to explore tac­tics and posi­tions that never appeared in the orig­i­nal human dataset. The rewards can also help turn an unsu­per­vised prob­lem (what is the struc­ture or label of each frame of a video game?) into more of a prob­lem by pro­vid­ing some sort of mean­ing­ful sum­ma­ry: the reward. A DQN Atari Learn­ing Envi­ron­ment (ALE) agent will, with­out any explicit image clas­si­fi­ca­tion, learn to rec­og­nize & pre­dict objects in a game which are rel­e­vant to achiev­ing a high score.


So to put it con­crete­ly: CNNs with adap­tive com­pu­ta­tions will be com­pu­ta­tion­ally faster for a given accu­racy rate than fixed-it­er­a­tion CNNs, CNNs with atten­tion clas­sify bet­ter than CNNs with­out atten­tion, CNNs with focus over their entire dataset will learn bet­ter than CNNs which only get fed ran­dom images, CNNs which can ask for spe­cific kinds of images do bet­ter than those query­ing their dataset, CNNs which can trawl through Google Images and locate the most infor­ma­tive one will do bet­ter still, CNNs which access rewards from their user about whether the result was use­ful will deliver more rel­e­vant results, CNNs whose hyper­pa­ra­me­ters are auto­mat­i­cally opti­mized by an RL algo­rithm (and pos­si­bly trained directly by a NN) will per­form bet­ter than CNNs with hand­writ­ten hyper­pa­ra­me­ters, CNNs whose archi­tec­ture as well as stan­dard hyper­pa­ra­me­ters are designed by RL agents will per­form bet­ter than hand­writ­ten CNNs… and so on. (It’s actions all the way down.)

The draw­back to all this is the imple­men­ta­tion diffi­culty is high­er, the sam­ple effi­ciency can be bet­ter or worse (in­di­vid­ual parts will have greater sam­ple-effi­ciency but data will be used up train­ing the addi­tional flex­i­bil­ity of other part­s), and the com­pu­ta­tion require­ments for train­ing can be much high­er; but the asymp­totic per­for­mance is bet­ter, and the gap prob­a­bly grows as GPUs & datasets get big­ger and tasks get more diffi­cult & valu­able in the real world.

Why You Shouldn’t Be A Tool

Why does treat­ing all these lev­els as deci­sion or rein­force­ment learn­ing prob­lems help so much?

One answer is that most points are not near any deci­sion bound­ary, or are highly pre­dictable and con­tribute lit­tle infor­ma­tion. Opti­miz­ing explo­rations can often lead to prediction/classification/inference gains. These points need not be com­puted exten­sive­ly, nor trained on much, nor col­lected fur­ther. If a par­tic­u­lar com­bi­na­tion of vari­ables is already being pre­dicted with high accu­racy (per­haps because it’s com­mon), adding even an infi­nite num­ber of addi­tional sam­ples will do lit­tle; one sam­ple from an unsam­pled region far away from the pre­vi­ous sam­ples may be dra­mat­i­cally infor­ma­tive. A model trained on purely super­vised data col­lected from humans or experts may have huge gap­ing holes in its under­stand­ing, because most of its data will be col­lected from rou­tine use and will not sam­ple many regions of state-space, lead­ing to well-known brit­tle­ness and bizarre extrap­o­la­tions, caused by pre­cisely the fact that the humans/experts avoid the dumb­est & most cat­a­strophic mis­takes and those sit­u­a­tions are not rep­re­sented in the dataset at all! (Thus, a Tool AI might be ‘safe’ in the sense that it is not an agent, but very unsafe because it is dumb as soon as it goes out­side of rou­tine use.) Such flaws in the dis­crim­i­na­tive model would be exposed quickly in any kind of real world or com­pet­i­tive set­ting or by RL train­ing.8 You need the right data, not more data. (“39. Re graph­ics: A pic­ture is worth 10K word­s—but only those to describe the pic­ture. Hardly any sets of 10K words can be ade­quately described with pic­tures.”)

Another answer is the “curse of dimen­sion­al­ity”: in many envi­ron­ments, the tree of pos­si­ble actions and sub­se­quent rewards grows expo­nen­tial­ly, so any sequence of actions over more than a few timesteps is increas­ingly unlikely to ever be sam­pled, and sparse rewards will be increas­ingly likely to be observed. Even if an impor­tant tra­jec­tory is exe­cuted at ran­dom and a reward obtained, it will be equally unlikely to ever be exe­cuted again—whereas some sort of RL agent, whose beliefs affect its choice of actions, can sam­ple the impor­tant tra­jec­tory repeat­ed­ly, and rapidly con­verge on an esti­mate of its high value and con­tinue explor­ing more deeply.

A dataset of ran­domly gen­er­ated sequences of robot arm move­ments intended to grip an object would likely include no rewards (suc­cess­ful grips) at all, because it requires a long sequence of finely cal­i­brated arm move­ments; with no suc­cess­es, how could the tool AI learn to manip­u­late an arm? It must be able to make progress by test­ing its best arm move­ment sequence can­di­date, then learn from that and test the bet­ter arm move­ment, and so on, until it suc­ceeds. With­out any rewards or abil­ity to hone in good actions, only the ini­tial states will be observed and progress will be extremely slow com­pared to an agent who can take actions and explore novel parts of the envi­ron­ment (eg the prob­lem of in the Atari Learn­ing Envi­ron­ment: because of reward spar­si­ty, an epsilon-greedy might as well not be an agent com­pared to some bet­ter method of explor­ing like den­si­ty-es­ti­ma­tion in .)

Or imag­ine train­ing a Go pro­gram by cre­at­ing a large dataset of ran­domly gen­er­ated Go boards, then eval­u­at­ing each pos­si­ble move’s value by play­ing out a game between ran­dom agents from it; this would not work nearly as well as train­ing on actual human-gen­er­ated board posi­tions which tar­get the van­ish­ingly small set of high­-qual­ity games & moves. The explo­ration homes in on the expo­nen­tially shrink­ing opti­mal area of the move­ment tree based on its cur­rent knowl­edge, dis­card­ing the enor­mous space of bad pos­si­ble moves. In con­trast, a tool AI can­not lift itself up by its boot­straps. It merely gives its best guess on the sta­tic cur­rent dataset, and that’s that. If you don’t like the results, you can gather more data, but it prob­a­bly won’t help that much because you’ll give it more of what it already has.

Hence, being a secret agent is much bet­ter than being a tool.

See Also

  1. Super­in­tel­li­gence, pg148:

    Even if the ora­cle itself works exactly as intend­ed, there is a risk that it would be mis­used. One obvi­ous dimen­sion of this prob­lem is that an ora­cle AI would be a source of immense power which could give a deci­sive strate­gic advan­tage to its oper­a­tor. This power might be ille­git­i­mate and it might not be used for the com­mon good. Another more sub­tle but no less impor­tant dimen­sion is that the use of an ora­cle could be extremely dan­ger­ous for the oper­a­tor her­self. Sim­i­lar wor­ries (which involve philo­soph­i­cal as well as tech­ni­cal issues) arise also for other hypo­thet­i­cal castes of super­in­tel­li­gence. We will explore them more thor­oughly in Chap­ter 13. Suffice it here to note that the pro­to­col deter­min­ing which ques­tions are asked, in which sequence, and how the answers are reported and dis­sem­i­nated could be of great sig­nifi­cance. One might also con­sider whether to try to build the ora­cle in such a way that it would refuse to answer any ques­tion in cases where it pre­dicts that its answer­ing would have con­se­quences clas­si­fied as cat­a­strophic accord­ing to some rough-and-ready cri­te­ria.

  2. Super­in­tel­li­gence, pg152–153, pg158:

    With advances in arti­fi­cial intel­li­gence, it would become pos­si­ble for the pro­gram­mer to offload more of the cog­ni­tive labor required to fig­ure out how to accom­plish a given task. In an extreme case, the pro­gram­mer would sim­ply spec­ify a for­mal cri­te­rion of what counts as suc­cess and leave it to the AI to find a solu­tion. To guide its search, the AI would use a set of pow­er­ful heuris­tics and other meth­ods to dis­cover struc­ture in the space of pos­si­ble solu­tions. It would keep search­ing until it found a solu­tion that sat­is­fied the suc­cess cri­te­ri­on…Rudi­men­tary forms of this approach are quite widely deployed today…A sec­ond place where trou­ble could arise is in the course of the soft­ware’s oper­a­tion. If the meth­ods that the soft­ware uses to search for a solu­tion are suffi­ciently sophis­ti­cat­ed, they may include pro­vi­sions for man­ag­ing the search process itself in an intel­li­gent man­ner. In this case, the machine run­ning the soft­ware may begin to seem less like a mere tool and more like an agent. Thus, the soft­ware may start by devel­op­ing a plan for how to go about its search for a solu­tion. The plan may spec­ify which areas to explore first and with what meth­ods, what data to gath­er, and how to make best use of avail­able com­pu­ta­tional resources. In search­ing for a plan that sat­is­fies the soft­ware’s inter­nal cri­te­rion (such as yield­ing a suffi­ciently high prob­a­bil­ity of find­ing a solu­tion sat­is­fy­ing the user-spec­i­fied cri­te­rion within the allot­ted time), the soft­ware may stum­ble on an unortho­dox idea. For instance, it might gen­er­ate a plan that begins with the acqui­si­tion of addi­tional com­pu­ta­tional resources and the elim­i­na­tion of poten­tial inter­rupters (such as human beings). Such “cre­ative” plans come into view when the soft­ware’s cog­ni­tive abil­i­ties reach a suffi­ciently high lev­el. When the soft­ware puts such a plan into action, an exis­ten­tial cat­a­stro­phe may ensue….The appar­ent safety of a tool-AI, mean­while, may be illu­so­ry. In order for tools to be ver­sa­tile enough to sub­sti­tute for super­in­tel­li­gent agents, they may need to deploy extremely pow­er­ful inter­nal search and plan­ning process­es. Agen­t-like behav­iors may arise from such processes as an unplanned con­se­quence. In that case, it would be bet­ter to design the sys­tem to be an agent in the first place, so that the pro­gram­mers can more eas­ily see what cri­te­ria will end up deter­min­ing the sys­tem’s out­put.

  3. As the lead author put it in a , the ben­e­fit is not sim­ply bet­ter pre­dic­tion but in supe­rior con­sid­er­a­tion of down­stream effects of all rec­om­men­da­tions, which are ignored by pre­dic­tive mod­els: this pro­duced “The largest sin­gle launch improve­ment in YouTube for two years” because “We can really lead the users toward a differ­ent state, ver­sus rec­om­mend­ing con­tent that is famil­iar”.↩︎

  4. Super­in­tel­li­gence, pg151:

    It might be thought that by expand­ing the range of tasks done by ordi­nary soft­ware, one could elim­i­nate the need for arti­fi­cial gen­eral intel­li­gence. But the range and diver­sity of tasks that a gen­eral intel­li­gence could profitably per­form in a mod­ern econ­omy is enor­mous. It would be infea­si­ble to cre­ate spe­cial-pur­pose soft­ware to han­dle all of those tasks. Even if it could be done, such a project would take a long time to carry out. Before it could be com­plet­ed, the nature of some of the tasks would have changed, and new tasks would have become rel­e­vant. There would be great advan­tage to hav­ing soft­ware that can learn on its own to do new tasks, and indeed to dis­cover new tasks in need of doing. But this would require that the soft­ware be able to learn, rea­son, and plan, and to do so in a pow­er­ful and robustly cross-do­main man­ner. In other words, it would need gen­eral intel­li­gence. Espe­cially rel­e­vant for our pur­poses is the task of soft­ware devel­op­ment itself. There would be enor­mous prac­ti­cal advan­tages to being able to auto­mate this. Yet the capac­ity for rapid self­-im­prove­ment is just the crit­i­cal prop­erty that enables a seed AI to set off an intel­li­gence explo­sion.

  5. While Google Maps was used as a par­a­dig­matic exam­ple of a Tool AI, it’s not clear how hard this can be pushed, even if we exclude the road sys­tem itself: Google Maps/Waze is, of course, try­ing to max­i­mize some­thing—­traffic & ad rev­enue. Google Maps, like any Google prop­er­ty, is doubt­less con­stantly run­ning on its users to opti­mize for max­i­mum usage, its users are con­stantly feed­ing in data about routes & traffic con­di­tions to Google Maps/Waze through the web­site inter­face & smart­phone GPS/WiFi geo­graphic logs, and to the extent that users make any use of the infor­ma­tion & increase/decrease their use of Google Maps which many do so blind­ly, Google Maps will get feed­back after chang­ing the real world (some­times to the intense frus­tra­tion of those affected, who try to manip­u­late it back)… Is Google Maps/Waze a Tool AI or a large-s­cale Agent AI?

    It is in a envi­ron­ment, it has a clear reward func­tion in terms of web­site traffic, and it has a wide set of actions it con­tin­u­ously explores with ran­dom­iza­tion from var­i­ous sources; even though it was designed to be a Tool AI, from an abstract per­spec­tive, one would have to con­sider it to have evolved into an Agent AI due to its com­mer­cial con­text and use in real-world actions, whether Google likes it or not. We might con­sider Google Maps to be a “secret agent”: it is not a Tool AI but an Agent AI with a hid­den & highly opaque reward func­tion. This is prob­a­bly not an ideal sit­u­a­tion.↩︎

  6. If the NN is trained to min­i­mize error alone, it’ll sim­ply spend as much time as pos­si­ble on every prob­lem; so a cost is imposed on each iter­a­tion to encour­age it to fin­ish as soon as it has a good answer, and learn to fin­ish soon­er. And how do we decide what costs to impose on the NN for decid­ing whether to loop another time or emit its cur­rent best guess as good enough? Well, that’ll depend on the cost of GPUs and the eco­nomic activ­ity and the util­ity of results for the humans…↩︎

  7. Kyunghyun Cho, 2015:

    One ques­tion I remem­ber came from Tiele­man. He asked the pan­elists about their opin­ions on active learning/exploration as an option for effi­cient unsu­per­vised learn­ing. Schmid­hu­ber and Mur­phy respond­ed, and before I reveal their respon­se, I really liked it. In short (or as much as I’m cer­tain about my mem­o­ry), active explo­ration will hap­pen nat­u­rally as the con­se­quence of reward­ing bet­ter expla­na­tion of the world. Knowl­edge of the sur­round­ing world and its accu­mu­la­tion should be reward­ed, and to max­i­mize this reward, an agent or an algo­rithm will active explore the sur­round­ing area (even with­out super­vi­sion.) Accord­ing to Mur­phy, this may reflect how babies learn so quickly with­out much super­vis­ing sig­nal or even with­out much unsu­per­vised sig­nal (their way of active explo­ration com­pen­sates the lack of unsu­per­vised exam­ples by allow­ing a baby to col­lect high qual­ity unsu­per­vised exam­ples.)

  8. An exam­ple here might be the use of ‘lad­ders’ or ‘mir­ror­ing’ in Go—­mod­els trained in a purely super­vised fash­ion on a dataset of Go games can have seri­ous diffi­culty respond­ing to a lad­der or mir­ror because those strate­gies are so bad that no human would play them in the dataset. Once the Tool AI has been forced ‘off-pol­icy’, its pre­dic­tions & infer­ences may become garbage because it’s never seen any­thing like those states before; an agent will be bet­ter off because it’ll have been forced into them by explo­ration or adver­sar­ial train­ing and have learned the proper respons­es. This sort of bad behav­ior leads to qua­drat­i­cally increas­ing regret with pass­ing time: Ross & Bag­nall 2010.↩︎