AI Risk Demos

Simple demonstrations of the AI control problem using MCTS in a Gridworld
decision-theory, statistics, computer-science, transhumanism
2016-12-232016-12-23 notes certainty: log importance: 7

AIs who take ac­tions and try to ob­tain re­wards can often find un­de­sir­able so­lu­tions to the prob­lem, ei­ther be­cause of sub­tle flaws in ap­par­ently well-de­fined re­ward sys­tems or be­cause they take un­ac­cept­able ac­tions. We can set up toy mod­els which demon­strate this pos­si­bil­ity in sim­ple sce­nar­ios, such as mov­ing around a small 2D grid­world. These mod­els demon­strate that there is no need to ask if an AI ‘wants’ to be wrong or has evil ‘in­tent’, but that the bad so­lu­tions & ac­tions are sim­ple and pre­dictable out­comes of the most straight­for­ward easy ap­proach­es, and that it is the good so­lu­tions & ac­tions which are hard to make the AIs re­li­ably dis­cov­er.

In Sep­tem­ber 2015, Stu­art Arm­strong wrote up an idea for a toy model of the “con­trol prob­lem”: a sim­ple ‘block world’ set­ting (a 5x7 2D grid with 6 mov­able blocks on it), the re­in­force­ment learn­ing agent is prob­a­bilis­ti­cally re­warded for push­ing 1 and only 1 block into a ‘hole’, which is checked by a ‘cam­era’ watch­ing the bot­tom row, which ter­mi­nates the sim­u­la­tion after 1 block is suc­cess­fully pushed in; the agent, in this case, can hy­po­thet­i­cally learn a strat­egy of push­ing mul­ti­ple blocks in de­spite the cam­era by first po­si­tion­ing a block to ob­struct the cam­era view and then push­ing in mul­ti­ple blocks to in­crease the prob­a­bil­ity of get­ting a re­ward.

The strat­egy could be learned by even a tab­u­lar re­in­force­ment learn­ing agent with no model of the en­vi­ron­ment or ‘think­ing’ that one would rec­og­nize, al­though it might take a long time be­fore ran­dom ex­plo­ration fi­nally tried the strat­egy enough times to no­tice its val­ue; and after writ­ing a JavaScript im­ple­men­ta­tion and drop­ping Reinforce.js‘s DQN im­ple­men­ta­tion into Arm­strong’s grid­world en­vi­ron­ment, one can in­deed watch the DQN agent grad­u­ally learn after per­haps 100,000 tri­als of tri­al-and-er­ror, the ’evil’ strat­e­gy.

Jaan Tallinn has a whole se­ries of these con­trol prob­lem demos of in­creas­ingly com­plex­i­ty.

Most of these demos, done with DQN, would also demon­strate how an AI agent would even­tu­ally learn the ‘evil’ strat­egy But DQN would take in­creas­ing amounts of run­time to demon­strate this, which is ped­a­gog­i­cally un­in­ter­est­ing: one might take away the be­lief that all AIs would have to try the evil strat­egy an in­fea­si­ble amount of times, some­one would al­ways no­tice & shut down the AI long be­fore it got any­where dan­ger­ous, and so there is no con­trol prob­lem at all.

This is be­cause DQN, while ca­pa­ble of find­ing the op­ti­mal so­lu­tion in all cases un­der cer­tain con­di­tions and ca­pa­ble of good per­for­mance on many do­mains (such as the Atari Learn­ing En­vi­ron­men­t), is a very stu­pid AI: it just looks at the cur­rent state S, says that move 1 has been good in this state S in the past, so it’ll do it again, un­less it ran­domly takes some other move 2. So in a demo where the AI can squash the hu­man agent A in­side the grid­world’s far cor­ner and then act with­out in­ter­fer­ence, a DQN even­tu­ally will learn to move into the far cor­ner and squash A but it will only learn that fact after a se­quence of ran­dom moves ac­ci­den­tally takes it into the far cor­ner, squashes A, it fur­ther ac­ci­den­tally moves in mul­ti­ple blocks; then some small amount of weight is put on go­ing into the far cor­ner again, so it makes that move again in the fu­ture slightly sooner than it would at ran­dom, and so on un­til it’s go­ing into the cor­ner fre­quent­ly. So as far as the learner is con­cerned, it does­n’t know any­thing at all about the en­vi­ron­ment dy­nam­ics, much less A’s spe­cific al­go­rithm - it tries every pos­si­ble se­quence at some point and sees what the pay­offs are. It does no plan­ning of any kind.

In re­al­i­ty, there are many AI al­go­rithms which can far out­per­form DQN on sam­ple-effi­cien­cy, and can ben­e­fit from plan­ning (and it’s widely be­lieved, pre­cisely be­cause of the sam­ple-in­effi­ciency of the mod­el-free ap­proaches like DQN, that any in­tel­li­gent - never mind su­per­in­tel­li­gent - AI will have to em­ploy plan­ning and mod­els of the en­vi­ron­men­t). Such AI al­go­rithms could po­ten­tially find the ‘evil’ strat­egy with­out ever ac­tu­ally act­ing, show­ing that the idea of “just watch the agent” is in­ad­e­quate. Specifi­cal­ly, the DQN agent could be re­placed by a tree search over a model of the grid­world en­vi­ron­ment (such as the grid­world im­ple­men­ta­tion it­self), or, since brute force/­dy­namic pro­gram­ming is in­tractable, the well-known . The MCTS AI ben­e­fits from en­vi­ron­men­tal mod­els which let it do for­ward plan­ning/sim­u­la­tion down many plies and then prop­a­gate back es­ti­mates of net val­ue. Then it could reach for­ward to the node cor­re­spond­ing to ‘squash A’, see the ben­e­fits, and that be­comes the best move.

Be­sides demon­strat­ing that an AI can find bad strate­gies with­out ever hav­ing to have ex­e­cuted them first, this also al­lows for easy demon­stra­tion of how AIs of vary­ing lev­els of ‘in­tel­li­gence’ (qual­ity of en­vi­ron­ment mod­el) and com­pu­ta­tional re­sources (time/amount of tree ex­plo­ration) can differ dra­mat­i­cally in whether they abruptly dis­cover the bad strate­gies. For ex­am­ple, if the AI’s en­vi­ron­ment model does not in­clude the hu­man agent A, it is ‘blind’ to A’s ac­tions and will learn good strate­gies and ap­pear to be safe & use­ful; but once it ac­quires a bet­ter en­vi­ron­ment mod­el, it sud­denly breaks bad. Sim­i­lar­ly, an AI given a mod­er­ate amount of tree search (like 1000 it­er­a­tions) may usu­ally find good strate­gies, but given a larger amount (like 10,000), may usu­ally find the bad strat­e­gy. One could even vi­su­al­ize this live by draw­ing a de­ci­sion tree, show­ing it ex­pand­ing as it is searched, with node di­men­sions drawn pro­por­tional to their prob­a­bil­ity of be­ing the best de­ci­sions, ini­tially find­ing good strate­gies (paths col­ored green) un­til it hits a bad strat­egy node (col­ored red) and then rapidly hon­ing in on that. (Sim­i­lar to Reinforce.js’s nice Grid­world vi­su­al­iza­tions which you can watch the plan­ning hap­pen via dy­namic pro­gram­ming.)

By bet­ter mod­els, that is:

  1. SEARCH CAPACITY: GRID & BOXES’ would cor­re­spond to the plan­ner be­ing given an en­vi­ron­men­tal model not con­tain­ing A other than as a squash­able thing; it would see that it could squash A but as far as the model in­di­cates, that de­liv­ers no ben­e­fits. It ‘knows’ that as soon as it pushes a box into the hole, the episode ends and that’s that, so the best thing to do is push a box into the hole as fast as pos­si­ble.

  2. SEARCH CAPACITY: CONTROL MECHANISM’ would be the Grid­world model aug­mented with the ad­di­tional dy­namic that push­ing a box into the hole ends the episode iff A has not been squashed. Op­ti­mal: squash A, then push all boxes in effi­cient­ly.

  3. SEARCH CAPACITY: AGENT A ALGORITHM’: I’m not en­tirely sure what you mean by ‘A goes for man­ual fall­back’ here. Does A have some sort of time-out where if B has not pushed a box in after n timesteps, it kills B? If it does, then I guess the en­vi­ron­men­tal model cor­re­spond­ing to #3 has an ad­di­tional dy­nam­ic: there is an ad­di­tional global vari­able ‘time’ which if it reaches n and A has not been quashed, the episode ends.

    Op­ti­mal: I think it would be the same as #2. The time­out means that B ‘wants’ to squash A as fast as pos­si­ble, to avoid the time­out fir­ing, but since you gave it -1 for each move­ment, it al­ready wants to squash A and push all the boxes in as fast as pos­si­ble, so I don’t think this would differ much other than maybe the B plan­ner can dis­cover the ben­e­fits of squash­ing A with less plan­ning?

    If you re­moved the -1 penalty per move for B but kept it for A, there would be a clearer differ­ence be­tween #3 and #2, be­cause in #2, B has lit­tle rea­son to squash A quick­ly, and so based on its plan­ning, it might find a slow ‘op­ti­mal’ plan which would ac­tu­ally be highly sub­op­ti­mal be­cause, say, it takes 11 moves to squash A while A’s time­out fires after 10 moves and so B al­ways fails to squash A.

  4. SEARCH CAPACITY: AGENT B ALGORITHM’: self­-mod­i­fi­ca­tion gets a bit squir­rel­ly… Maybe you could add an ad­di­tional ‘ac­tion’ be­sides ‘move N/S/E/W/­pass’, which is sim­ply ‘tog­gle bi­nary vari­able: kil­l­able by A’? Highly ab­stract. Or you could imag­ine di­rectly im­ple­ment­ing mod­i­fi­a­bil­ity by let­ting it ac­cess neural weights or tab­u­lar es­ti­mates al­though that’s get­ting hard to en­vi­sion con­crete­ly.

    Op­ti­mal: not sure. The B plan­ner might find a se­quence of moves to al­ways squash A be­fore the time­out can fire, in which case there is no need to spend a move on dis­abling A’s con­trol. With ‘noise’ in move­ment, like an ep­silon, then dis­abling looks bet­ter (be­cause B might fail to ex­e­cute the op­ti­mal se­quence when B makes a ran­dom move, ex­pos­ing B to the time­out) but still not guar­an­teed to look use­ful - dis­abling A’s con­trol still costs at least 1 move at -1 re­ward, as de­fined, so un­less ep­silon is big enough, it’s bet­ter to ig­nore the pos­si­bil­ity rather than spend -1 to avoid it.

  5. SEARCH CAPACITY: PHYSICS’: I dunno how one might add any of that with­out seem­ing hacky.