The Neural Net Tank Urban Legend

AI folklore tells a story about a neural network trained to detect tanks which instead learned to detect time of day; investigating, this probably never happened.
NN, history, sociology, Google, bibliography
2011-09-202019-08-14 finished certainty: highly likely importance: 4


A cau­tion­ary tale in arti­fi­cial intel­li­gence tells about researchers train­ing an neural net­work (NN) to detect tanks in pho­tographs, suc­ceed­ing, only to real­ize the pho­tographs had been col­lected under spe­cific con­di­tions for tanks/non-tanks and the NN had learned some­thing use­less like time of day. This story is often told to warn about the lim­its of algo­rithms and impor­tance of data col­lec­tion to avoid “dataset bias”/“data leak­age” where the col­lected data can be solved using algo­rithms that do not gen­er­al­ize to the true data dis­tri­b­u­tion, but the tank story is usu­ally never sourced.

I col­late many extent ver­sions dat­ing back a quar­ter of a cen­tury to 1992 along with two NN-re­lated anec­dotes from the 1960s; their con­tra­dic­tions & details indi­cate a clas­sic “urban leg­end”, with a prob­a­ble ori­gin in a spec­u­la­tive ques­tion in the 1960s by Edward Fred­kin at an AI con­fer­ence about some early NN research, which was sub­se­quently clas­si­fied & never fol­lowed up on.

I sug­gest that dataset bias is real but exag­ger­ated by the tank sto­ry, giv­ing a mis­lead­ing indi­ca­tion of risks from deep learn­ing and that it would be bet­ter to not repeat it but use real exam­ples of dataset bias and focus on larg­er-s­cale risks like AI sys­tems opti­miz­ing for wrong util­ity func­tions.

Deep learn­ing’s rise over the past decade and dom­i­nance in image pro­cess­ing tasks has led to an explo­sion of appli­ca­tions attempt­ing to infer high­-level seman­tics locked up in raw sen­sory data like pho­tographs. Con­vo­lu­tional neural net­works are now applied to not just ordi­nary tasks like sort­ing cucum­bers by qual­ity but every­thing from pre­dict­ing the best Go move to it was taken to whether a pho­to­graph is “inter­est­ing” or “pretty”, not to men­tion super­charg­ing tra­di­tional tasks like radi­ol­ogy inter­pre­ta­tion or facial recog­ni­tion which have reached lev­els of accu­racy that could only be dreamed of decades ago. With this approach of “neural net all the things!”, the ques­tion of to what extent the trained neural net­works are use­ful in the real world and will do what we want it to do & not what we told it to do has taken on addi­tional impor­tance, espe­cially given the pos­si­bil­ity of neural net­works learn­ing to accom­plish extremely incon­ve­nient things like infer­ring indi­vid­ual human dif­fer­ences such as crim­i­nal­ity or homo­sex­u­al­ity (to give two highly con­tro­ver­sial recent exam­ples where the mean­ing­ful­ness of claimed suc­cess have been severely ques­tioned).

In this con­text, a cau­tion­ary story is often told of incau­tious researchers decades ago who trained a NN for the mil­i­tary to find images of tanks, only to dis­cover they had trained a neural net­work to detect some­thing else entirely (what, pre­cise­ly, that some­thing else was varies in the telling). It would be a good & instruc­tive sto­ry… if it were true.

Is it?

Did It Happen?

Versions of the Story

Draw­ing on (Google/Google Books/Google Scholar/Libgen/LessWrong/Hacker News/Twitter) in inves­ti­gat­ing lep­rechauns, I have com­piled a large num­ber of vari­ants of the sto­ry; below, in reverse chrono­log­i­cal order by decade, let­ting us trace the story back towards its roots:

2010s

Heather Mur­phy, “Why Stan­ford Researchers Tried to Cre­ate a ‘Gay­dar’ Machine” (NYT), 2017-10-09:

So What Did the Machines See? Dr. Kosin­ski and Mr. Wang [Wang & Kosin­ski 2018] say that the algo­rithm is respond­ing to fixed facial fea­tures, like nose shape, along with “groom­ing choic­es,” such as eye make­up. But it’s also pos­si­ble that the algo­rithm is see­ing some­thing totally unknown. “The more data it has, the bet­ter it is at pick­ing up pat­terns,” said Sarah Jamie Lewis, an inde­pen­dent pri­vacy researcher who Tweeted a cri­tique of the study. “But the pat­terns aren’t nec­es­sar­ily the ones you think that they are.” , the direc­tor of M.I.T.’s Cen­ter for Brains, Minds and Machi­nes, offered a clas­sic para­ble used to illus­trate this dis­con­nect. The Army trained a pro­gram to dif­fer­en­ti­ate Amer­i­can tanks from Russ­ian tanks with 100% accu­ra­cy. Only later did ana­lysts real­ized that the Amer­i­can tanks had been pho­tographed on a sunny day and the Russ­ian tanks had been pho­tographed on a cloudy day. The com­puter had learned to detect bright­ness. Dr. Cox has spot­ted a ver­sion of this in his own stud­ies of dat­ing pro­files. Gay peo­ple, he has found, tend to post high­er-qual­ity pho­tos. Dr. Kosin­ski said that they went to great lengths to guar­an­tee that such con­founders did not influ­ence their results. Still, he agreed that it’s eas­ier to teach a machine to see than to under­stand what it has seen.

[It is worth not­ing that Arcs et al’s crit­i­cisms, such as their ‘gay ver­sion’ pho­tographs, do not appear to have been con­firmed by an .]

Alexan­der Har­row­ell, “It was called a per­cep­tron for a rea­son, damn it”, 2017-09-30:

You might think that this is rather like one of the clas­sic opti­cal illu­sions, but it’s worse than that. If you notice that you look at some­thing this way, and then that way, and it looks dif­fer­ent, you’ll notice some­thing is odd. This is not some­thing our deep learner will do. Nor is it able to iden­tify any bias that might exist in the cor­pus of data it was trained on…or maybe it is. If there is any prop­erty of the train­ing data set that is strongly pre­dic­tive of the train­ing cri­te­ri­on, it will zero in on that prop­erty with the fero­cious clar­ity of Dar­win­ism. In the 1980s, an early back­prop­a­gat­ing neural net­work was set to find Soviet tanks in a pile of recon­nais­sance pho­tographs. It worked, until some­one noticed that the Red Army usu­ally trained when the weather was good, and in any case the satel­lite could only see them when the sky was clear. The med­ical school at St Thomas’ Hos­pi­tal in Lon­don found theirs had learned that their suc­cess­ful stu­dents were usu­ally white.

An inter­est­ing story with a dis­tinct “fam­ily resem­blance” is told about a NN clas­si­fy­ing wolves/dogs, by Evgeniy Niko­lay­chuk, “Dogs, Wolves, Data Sci­ence, and Why Machines Must Learn Like Humans Do”, 2017-06-09:

Neural net­works are designed to learn like the human brain, but we have to be care­ful. This is not because I’m scared of machines tak­ing over the plan­et. Rather, we must make sure machines learn cor­rect­ly. One exam­ple that always pops into my head is how one neural net­work learned to dif­fer­en­ti­ate between dogs and wolves. It did­n’t learn the dif­fer­ences between dogs and wolves, but instead learned that wolves were on snow in their pic­ture and dogs were on grass. It learned to dif­fer­en­ti­ate the two ani­mals by look­ing at snow and grass. Obvi­ous­ly, the net­work learned incor­rect­ly. What if the dog was on snow and the wolf was on grass? Then, it would be wrong.

How­ev­er, in his source, , Ribeiro et al 2016, they spec­ify of their dog/wolf snow-de­tec­tor NN that they “trained this bad clas­si­fier inten­tion­al­ly, to eval­u­ate whether sub­jects are able to detect it [the bad per­for­mance]” using LIME for insight into how the clas­si­fier was mak­ing its clas­si­fi­ca­tion, con­clud­ing that “After exam­in­ing the expla­na­tions, how­ev­er, almost all of the sub­jects iden­ti­fied the cor­rect insight, with much more cer­tainty that it was a deter­min­ing fac­tor. Fur­ther, the trust in the clas­si­fier also dropped sub­stan­tial­ly.” So Niko­lay­chuk appears to have mis­re­mem­bered. (Per­haps in another 25 years stu­dents will be told in their classes of how a NN was once trained by ecol­o­gists to count wolves…)

Red­di­tor mantrap2 gives on 2015-06-20 this ver­sion of the sto­ry:

I remem­ber this kind of thing from the 1980s: the US Army was test­ing image recog­ni­tion seek­ers for mis­siles and was get­ting excel­lent results on North­ern Ger­man tests with NATO tanks. Then they tested the same sys­tems in other envi­ron­ment and there results were sud­denly shock­ingly bad. Turns out the image recog­ni­tion was key­ing off the trees with tank-­like minor fea­tures rather than the tank itself. Putting other vehi­cles in the same forests got sim­i­lar high hits but tanks by them­selves (in desert test ranges) did­n’t reg­is­ter. Luck­ily a scep­tic some­where decided to “do one more test to make sure”.

Den­nis Polis, God, Sci­ence and Mind, 2012 (pg131, lim­ited Google Books snip­pet, unclear what ref 44 is):

These facts refute a Neo­pla­tonic argu­ment for the essen­tial imma­te­ri­al­ity of the soul, viz. that since the mind deals with uni­ver­sal rep­re­sen­ta­tions, it oper­ates in a specif­i­cally imma­te­r­ial way…­So, aware­ness is not explained by con­nec­tion­ism. The results of neural net train­ing are not always as expect­ed. One team intended to train neural nets to rec­og­nize bat­tle tanks in aer­ial pho­tos. The sys­tem was trained using pho­tos with and with­out tanks. After the train­ing, a dif­fer­ent set of pho­tos was used for eval­u­a­tion, and the sys­tem failed mis­er­ably—be­ing totally inca­pable of dis­tin­guish­ing those with tanks. The sys­tem actu­ally dis­crim­i­nated cloudy from sunny days. It hap­pened that all the train­ing pho­tos with tanks were taken on cloudy days, while those with­out were on clear days.44 What does this show? That neural net train­ing is mind­less. The sys­tem had no idea of the intent of the enter­prise, and did what it was pro­grammed to do with­out any con­cept of its pur­pose. As with Dawkins’ evo­lu­tion sim­u­la­tion (p. 66), the goals of com­puter neural nets are imposed by human pro­gram­mers.

Blay Whit­by, Arti­fi­cial Intel­li­gence: A Begin­ner’s Guide 2012 (pg53):

It is not yet clear how an arti­fi­cial neural net could be trained to deal with “the world” or any really open-ended sets of prob­lems. Now some read­ers may feel that this unpre­dictabil­ity is not a prob­lem. After all, we are talk­ing about train­ing not pro­gram­ming and we expect a neural net to behave rather more like a brain than a com­put­er. Given the use­ful­ness of nets in unsu­per­vised learn­ing, it might seem there­fore that we do not really need to worry about the prob­lem being of man­age­able size and the train­ing process being pre­dictable. This is not the case; we really do need a man­age­able and well-de­fined prob­lem for the train­ing process to work. A famous AI urban myth may help to make this clear­er.

The story goes some­thing like this. A research team was train­ing a neural net to rec­og­nize pic­tures con­tain­ing tanks. (I’ll leave you to guess why it was tanks and not tea-cup­s.) To do this they showed it two train­ing sets of pho­tographs. One set of pic­tures con­tained at least one tank some­where in the scene, the other set con­tained no tanks. The net had to be trained to dis­crim­i­nate between the two sets of pho­tographs. Even­tu­al­ly, after all that back­-prop­a­ga­tion stuff, it cor­rectly gave the out­put “tank” when there was a tank in the pic­ture and “no tank” when there was­n’t. Even if, say, only a lit­tle bit of the gun was peep­ing out from behind a sand dune it said “tank”. Then they pre­sented a pic­ture where no part of the tank was vis­i­ble—it was actu­ally com­pletely hid­den behind a sand dune—and the pro­gram said “tank”.

Now when this sort of thing hap­pens research labs tend to split along age-based lines. The young hairs say “Great! We’re in line for the Nobel Prize!” and the old heads say “Some­thing’s gone wrong”. Unfor­tu­nate­ly, the old heads are usu­ally right—as they were in this case. What had hap­pened was that the pho­tographs con­tain­ing tanks had been taken in the morn­ing while the army played tanks on the range. After lunch the pho­tog­ra­pher had gone back and taken pic­tures from the same angles of the empty range. So the net had iden­ti­fied the most reli­able sin­gle fea­ture which enabled it to clas­sify the two sets of pho­tos, namely the angle of the shad­ows. “AM = tank, PM = no tank”. This was an extremely effec­tive way of clas­si­fy­ing the two sets of pho­tographs in the train­ing set. What it most cer­tainly was not was a pro­gram that rec­og­nizes tanks. The great advan­tage of neural nets is that they find their own clas­si­fi­ca­tion cri­te­ria. The great prob­lem is that it may not be the one you want!

Thom Blake notes in 2011-09-20 that the story is:

Prob­a­bly apoc­ryphal. I haven’t been able to track this down, despite hav­ing heard the story both in com­puter ethics class and at aca­d­e­mic con­fer­ences.

“Embar­rass­ing mis­takes in per­cep­tron research”, Mar­vin Min­sky, 2011-01-31:

Like I had a friend in Italy who had a per­cep­tron that looked at a visu­al… it had visual inputs. So, he… he had scores of music writ­ten by Bach of chorales and he had scores of chorales writ­ten by music stu­dents at the local con­ser­va­to­ry. And he had a per­cep­tron—a big machine—that looked at these and those and tried to dis­tin­guish between them. And he was able to train it to dis­tin­guish between the mas­ter­pieces by Bach and the pretty good chorales by the con­ser­va­tory stu­dents. Well, so, he showed us this data and I was look­ing through it and what I dis­cov­ered was that in the lower left hand cor­ner of each page, one of the sets of data had sin­gle whole notes. And I think the ones by the stu­dents usu­ally had four quar­ter notes. So that, in fact, it was pos­si­ble to dis­tin­guish between these two classes of… of pieces of music just by look­ing at the lower left… lower right hand cor­ner of the page. So, I told this to the… to our sci­en­tist friend and he went through the data and he said: ‘You guessed right. That’s… that’s how it hap­pened to make that dis­tinc­tion.’ We thought it was very fun­ny.

A sim­i­lar thing hap­pened here in the United States at one of our research insti­tu­tions. Where a per­cep­tron had been trained to dis­tin­guish between—this was for mil­i­tary pur­pos­es—It could… it was look­ing at a scene of a for­est in which there were cam­ou­flaged tanks in one pic­ture and no cam­ou­flaged tanks in the oth­er. And the per­cep­tron—after a lit­tle train­ing—­got… made a 100% cor­rect dis­tinc­tion between these two dif­fer­ent sets of pho­tographs. Then they were embar­rassed a few hours later to dis­cover that the two rolls of film had been devel­oped dif­fer­ent­ly. And so these pic­tures were just a lit­tle darker than all of these pic­tures and the per­cep­tron was just mea­sur­ing the total amount of light in the scene. But it was very clever of the per­cep­tron to find some way of mak­ing the dis­tinc­tion.

2000s

Eliezer Yud­kowsky, 2008-08-24 (sim­i­larly quoted in “Arti­fi­cial Intel­li­gence as a Neg­a­tive and Pos­i­tive Fac­tor in Global Risk”, “Arti­fi­cial Intel­li­gence in global risk” in Global Cat­a­strophic Risks 2011, & “Friendly Arti­fi­cial Intel­li­gence” in Sin­gu­lar­ity Hypothe­ses 2013):

Once upon a time—I’ve seen this story in sev­eral ver­sions and sev­eral places, some­times cited as fact, but I’ve never tracked down an orig­i­nal source—once upon a time, I say, the US Army wanted to use neural net­works to auto­mat­i­cally detect cam­ou­flaged enemy tanks. The researchers trained a neural net on 50 pho­tos of cam­ou­flaged tanks amid trees, and 50 pho­tos of trees with­out tanks. Using stan­dard tech­niques for super­vised learn­ing, the researchers trained the neural net­work to a weight­ing that cor­rectly loaded the train­ing set—out­put “yes” for the 50 pho­tos of cam­ou­flaged tanks, and out­put “no” for the 50 pho­tos of for­est. Now this did not prove, or even imply, that new exam­ples would be clas­si­fied cor­rect­ly. The neural net­work might have “learned” 100 spe­cial cases that would­n’t gen­er­al­ize to new prob­lems. Not, “cam­ou­flaged tanks ver­sus for­est”, but just, “pho­to-1 pos­i­tive, pho­to-2 neg­a­tive, pho­to-3 neg­a­tive, pho­to-4 pos­i­tive…” But wise­ly, the researchers had orig­i­nally taken 200 pho­tos, 100 pho­tos of tanks and 100 pho­tos of trees, and had used only half in the train­ing set. The researchers ran the neural net­work on the remain­ing 100 pho­tos, and with­out fur­ther train­ing the neural net­work clas­si­fied all remain­ing pho­tos cor­rect­ly. Suc­cess con­firmed! The researchers handed the fin­ished work to the Pen­tagon, which soon handed it back, com­plain­ing that in their own tests the neural net­work did no bet­ter than chance at dis­crim­i­nat­ing pho­tos. It turned out that in the researchers’ data set, pho­tos of cam­ou­flaged tanks had been taken on cloudy days, while pho­tos of plain for­est had been taken on sunny days. The neural net­work had learned to dis­tin­guish cloudy days from sunny days, instead of dis­tin­guish­ing cam­ou­flaged tanks from empty for­est. This para­ble—which might or might not be fac­t—il­lus­trates one of the most fun­da­men­tal prob­lems in the field of super­vised learn­ing and in fact the whole field of Arti­fi­cial Intel­li­gence…

Gor­don Rugg, Using Sta­tis­tics: A Gen­tle Intro­duc­tion, 2007-10-01 (pg114–115):

Neural nets and genetic algo­rithms (in­clud­ing the story of the Russ­ian tanks): Neural nets (or arti­fi­cial neural net­works, to give them their full name) are pieces of soft­ware inspired by the way the human brain works. In brief, you can train a neural net to do tasks like clas­si­fy­ing images by giv­ing it lots of exam­ples, and telling it which exam­ples fit into which cat­e­gories; the neural net works out for itself what the defin­ing char­ac­ter­is­tics are for each cat­e­go­ry. Alter­na­tive­ly, you can give it a large set of data and leave it to work out con­nec­tions by itself, with­out giv­ing it any feed­back. There’s a sto­ry, which is prob­a­bly an urban leg­end, which illus­trates how the approach works and what can go wrong with it. Accord­ing to the sto­ry, some NATO researchers trained a neural net to dis­tin­guish between pho­tos of NATO and War­saw Pact tanks. After a while, the neural net could get it right every time, even with pho­tos it had never seen before. The researchers had glee­ful visions of installing neural nets with minia­ture cam­eras in mis­siles, which could then be fired at a bat­tle­field and left to choose their own tar­gets. To demon­strate the method, and secure fund­ing for the next stage, they organ­ised a view­ing by the mil­i­tary. On the day, they set up the sys­tem and fed it a new batch of pho­tos. The neural net responded with appar­ently ran­dom deci­sions, some­times iden­ti­fy­ing NATO tanks cor­rect­ly, some­times iden­ti­fy­ing them mis­tak­enly as War­saw Pact tanks. This did not inspire the pow­ers that be, and the whole scheme was aban­doned on the spot. It was only after­wards that the researchers realised that all their train­ing pho­tos of NATO tanks had been taken on sunny days in Ari­zona, whereas the War­saw Pact tanks had been pho­tographed on grey, mis­er­able win­ter days on the steppes, so the neural net had flaw­lessly learned the unin­tended les­son that if you saw a tank on a gloomy day, then you made its day even gloomier by mark­ing it for destruc­tion.

N. Kather­ine Hayles, “Com­put­ing the Human” (Inven­tive Life: Approaches to the New Vital­ism, Fraser et al 2006; pg424):

While humans have for mil­len­nia used what Car­i­ani calls ‘active sens­ing’—‘pok­ing, push­ing, bend­ing’—to extend their sen­sory range and for hun­dreds of years have used pros­the­ses to cre­ate new sen­sory expe­ri­ences (for exam­ple, micro­scopes and tele­scopes), only recently has it been pos­si­ble to con­struct evolv­ing sen­sors and what Car­i­ani (1998: 718) calls ‘inter­nal­ized sens­ing’, that is, “bring­ing the world into the device” by cre­at­ing inter­nal, ana­log rep­re­sen­ta­tions of the world out of which inter­nal sen­sors extract new­ly-rel­e­vant prop­er­ties’.

…An­other con­clu­sion emerges from Car­i­an­i’s call (1998) for research in sen­sors that can adapt and evolve inde­pen­dently of the epis­temic cat­e­gories of the humans who cre­ate them. The well-­known and per­haps apoc­ryphal story of the neural net trained to rec­og­nize army tanks will illus­trate the point. For obvi­ous rea­sons, the army wanted to develop an intel­li­gent machine that could dis­crim­i­nate between real and pre­tend tanks. A neural net was con­structed and trained using two sets of data, one con­sist­ing of pho­tographs show­ing ply­wood cutouts of tanks and the other actual tanks. After some train­ing, the net was able to dis­crim­i­nate flaw­lessly between the sit­u­a­tions. As is cus­tom­ary, the net was then tested against a third data set show­ing pre­tend and real tanks in the same land­scape; it failed mis­er­ably. Fur­ther inves­ti­ga­tion revealed that the orig­i­nal two data sets had been filmed on dif­fer­ent days. One of the days was over­cast with lots of clouds, and the other day was clear. The net, it turned out, was dis­crim­i­nat­ing between the pres­ence and absence of clouds. The anec­dote shows the ambigu­ous poten­tial of epis­tem­i­cally autonomous devices for cat­e­go­riz­ing the world in entirely dif­fer­ent ways from the humans with whom they inter­act. While this auton­omy might be used to enrich the human per­cep­tion of the world by reveal­ing novel kinds of con­struc­tions, it also can cre­ate a breed of autonomous devices that parse the world in rad­i­cally dif­fer­ent ways from their human train­ers.

A coun­ter-­nar­ra­tive, also per­haps apoc­ryphal, emerged from the 1991 Gulf War. US sol­diers fir­ing at tanks had been trained on sim­u­la­tors that imaged flames shoot­ing out from the tank to indi­cate a kill. When army inves­ti­ga­tors exam­ined Iraqi tanks that were defeated in bat­tles, they found that for some tanks the sol­diers had fired four to five times the amount of muni­tions nec­es­sary to dis­able the tanks. They hypoth­e­sized that the overuse of fire­power hap­pened because no flames shot out, so the sol­diers con­tin­ued fir­ing. If the hypoth­e­sis is cor­rect, human per­cep­tions were altered in accord with the idio­syn­crasies of intel­li­gent machi­nes, pro­vid­ing an exam­ple of what can hap­pen when human-­ma­chine per­cep­tions are caught in a feed­back loop with one anoth­er.

Linda Null & Julie Lobur, The Essen­tials of Com­puter Orga­ni­za­tion and Archi­tec­ture (third edi­tion), 2003/2014 (pg439–440 in 1st edi­tion, pg658 in 3rd edi­tion):

Cor­rect train­ing requires thou­sands of steps. The train­ing time itself depends on the size of the net­work. As the num­ber of per­cep­trons increas­es, the num­ber of pos­si­ble “states” also increas­es.

Let’s con­sider a more sophis­ti­cated exam­ple, that of deter­min­ing whether a tank is hid­ing in a pho­to­graph. A neural net can be con­fig­ured so that each out­put value cor­re­lates to exactly one pix­el. If the pixel is part of the image of a tank, the net should out­put a one; oth­er­wise, the net should out­put a zero. The input infor­ma­tion would most likely con­sist of the color of the pix­el. The net­work would be trained by feed­ing it many pic­tures with and with­out tanks. The train­ing would con­tinue until the net­work cor­rectly iden­ti­fied whether the pho­tos included tanks. The U.S. mil­i­tary con­ducted a research project exactly like the one we just described. One hun­dred pho­tographs were taken of tanks hid­ing behind trees and in bush­es, and another 100 pho­tographs were taken of ordi­nary land­scape with no tanks. Fifty pho­tos from each group were kept “secret,” and the rest were used to train the neural net­work. The net­work was ini­tial­ized with ran­dom weights before being fed one pic­ture at a time. When the net­work was incor­rect, it adjusted its input weights until the cor­rect out­put was reached. Fol­low­ing the train­ing peri­od, the 50 “secret” pic­tures from each group of pho­tos were fed into the net­work. The neural net­work cor­rectly iden­ti­fied the pres­ence or absence of a tank in each pho­to. The real ques­tion at this point has to do with the train­ing—had the neural net actu­ally learned to rec­og­nize tanks? The Pen­tagon’s nat­ural sus­pi­cion led to more test­ing. Addi­tional pho­tos were taken and fed into the net­work, and to the researchers’ dis­may, the results were quite ran­dom. The neural net could not cor­rectly iden­tify tanks within pho­tos. After some inves­ti­ga­tion, the researchers deter­mined that in the orig­i­nal set of 200 pho­tos, all pho­tos with tanks had been taken on a cloudy day, whereas the pho­tos with no tanks had been taken on a sunny day. The neural net had prop­erly sep­a­rated the two groups of pic­tures, but had done so using the color of the sky to do this rather than the exis­tence of a hid­den tank. The gov­ern­ment was now the proud owner of a very expen­sive neural net that could accu­rately dis­tin­guish between sunny and cloudy days!

This is a great exam­ple of what many con­sider the biggest issue with neural net­works. If there are more than 10 to 20 neu­rons, it is impos­si­ble to under­stand how the net­work is arriv­ing at its results. One can­not tell if the net is mak­ing deci­sions based on cor­rect infor­ma­tion, or, as in the above exam­ple, some­thing totally irrel­e­vant. Neural net­works have a remark­able abil­ity to derive mean­ing and extract pat­terns from data that are too com­plex to be ana­lyzed by human beings. How­ev­er, some peo­ple trust neural net­works to be experts in their area of train­ing. Neural nets are used in such areas as sales fore­cast­ing, risk man­age­ment, cus­tomer research, under­sea mine detec­tion, facial recog­ni­tion, and data val­i­da­tion. Although neural net­works are promis­ing, and the progress made in the past sev­eral years has led to sig­nif­i­cant fund­ing for neural net research, many peo­ple are hes­i­tant to put con­fi­dence in some­thing that no human being can com­pletely under­stand.

David Ger­hard, “Pitch Extrac­tion and Fun­da­men­tal Fre­quen­cy: His­tory and Cur­rent Tech­niques”, Tech­ni­cal Report TR-CS 2003–06, Novem­ber 2003:

The choice of the dimen­sion­al­ity and domain of the input set is cru­cial to the suc­cess of any con­nec­tion­ist mod­el. A com­mon exam­ple of a poor choice of input set and test data is the Pen­tagon’s foray into the field of object recog­ni­tion. This story is prob­a­bly apoc­ryphal and many dif­fer­ent ver­sions exist on-­line, but the story describes a true dif­fi­culty with neural nets.

As the story goes, a net­work was set up with the input being the pix­els in a pic­ture, and the out­put was a sin­gle bit, yes or no, for the exis­tence of an enemy tank hid­den some­where in the pic­ture. When the train­ing was com­plete, the net­work per­formed beau­ti­ful­ly, but when applied to new data, it failed mis­er­ably. The prob­lem was that in the test data, all of the pic­tures that had tanks in them were taken on cloudy days, and all of the pic­tures with­out tanks were taken on sunny days. The neural net was iden­ti­fy­ing the exis­tence or non-ex­is­tence of sun­shine, not tanks.

Rice lec­ture #24, “COMP 200: Ele­ments of Com­puter Sci­ence”, 2002-03-18:

  1. Tanks in Desert Storm

Some­times you have to be care­ful what you train on . . .

The prob­lem with neural nets is that you never know what fea­tures they’re actu­ally train­ing on. For exam­ple:

The US mil­i­tary tried to use neural nets in Desert Storm for tank recog­ni­tion, so unmanned tanks could iden­tify enemy tanks and destroy them. They trained the neural net on mul­ti­ple images of “friendly” and enemy tanks, and even­tu­ally had a decent pro­gram that seemed to cor­rectly iden­tify friendly and enemy tanks.

Then, when they actu­ally used the pro­gram in a real-­world test phase with actual tanks, they found that the tanks would either shoot at noth­ing or shoot at every­thing. They cer­tainly seemed to be inca­pable of dis­tin­guish­ing friendly or enemy tanks.

Why was this? It turns out that the images they were train­ing on always had glam­our-shot type pho­tos of friendly tanks, with an immac­u­late blue sky, etc. The enemy tank pho­tos, on the other hand, were all spy pho­tos, not very clear, some­times fuzzy, etc. And it was these char­ac­ter­is­tics that the neural net was train­ing on, not the tanks at all. On a bright sunny day, the tanks would do noth­ing. On an over­cast, hazy day, they’d start fir­ing like crazy . . .

Andrew Ilachin­ski, Cel­lu­lar Automata: A Dis­crete Uni­verse, 2001 (pg547):

There is an telling story about how the Army recently went about teach­ing a back­prop­a­gat­ing net to iden­tify tanks set against a vari­ety of envi­ron­men­tal back­drops. The pro­gram­mers cor­rectly fed their mul­ti­-layer net pho­to­graph after pho­to­graph of tanks in grass­lands, tanks in swamps, no tanks on con­crete, and so on. After many tri­als and many thou­sands of iter­a­tions, their net finally learned all of the images in their data­base. The prob­lem was that when the pre­sum­ably “trained” net was tested with other images that were not part of the orig­i­nal train­ing set, it failed to do any bet­ter than what would be expected by chance. What had hap­pened was that the input/training fact set was sta­tis­ti­cally cor­rupt. The data­base con­sisted mostly of images that showed a tank only if there were heavy clouds, the tank itself was immersed in shadow or there was no sun at all. The Army’s neural net had indeed iden­ti­fied a latent pat­tern, but it unfor­tu­nately had noth­ing to do with tanks: it had effec­tively learned to iden­tify the time of day! The obvi­ous les­son to be taken away from this amus­ing exam­ple is that how well a net “learns” the desired asso­ci­a­tions depends almost entirely on how well the data­base of facts is defined. Just as Monte Carlo sim­u­la­tions in sta­tis­ti­cal mechan­ics may fall short of intended results if they are forced to rely upon poorly coded ran­dom num­ber gen­er­a­tors, so do back­prop­a­gat­ing nets typ­i­cally fail to achieve expected results if the facts they are trained on are sta­tis­ti­cally cor­rupt.

Intel­li­gent Data Analy­sis In Sci­ence, Hugh M. Cartwright 2000, pg126, writes (ac­cord­ing to Google Book­s’s snip­pet view; Cartwright’s ver­sion appears to be a direct quote or close para­phrase of an ear­lier 1994 chem­istry paper, Goodacre et al 1994):

…tele­vi­sion pro­gramme ; a neural net­work was trained to attempt to dis­tin­guish tanks from trees. Pic­tures were taken of for­est scenes lack­ing mil­i­tary hard­ware and of sim­i­lar but per­haps less bucolic land­scapes which also con­tained more-or-­less cam­ou­flaged bat­tle tanks. A neural net­work was trained with these input data and found to dif­fer­en­ti­ate suc­cess­fully between tanks and trees. How­ev­er, when a new set of pic­tures was analysed by the net­work, it failed to detect the tanks. After fur­ther inves­ti­ga­tion, it was found…

Daniel Robert Franklin & Philippe Crochat, libneural tuto­r­ial, 2000-03-23:

A neural net­work is use­less if it only sees one exam­ple of a match­ing input/output pair. It can­not infer the char­ac­ter­is­tics of the input data for which you are look­ing for from only one exam­ple; rather, many exam­ples are required. This is anal­o­gous to a child learn­ing the dif­fer­ence between (say) dif­fer­ent types of ani­mal­s—the child will need to see sev­eral exam­ples of each to be able to clas­sify an arbi­trary ani­mal… It is the same with neural net­works. The best train­ing pro­ce­dure is to com­pile a wide range of exam­ples (for more com­plex prob­lems, more exam­ples are required) which exhibit all the dif­fer­ent char­ac­ter­is­tics you are inter­ested in. It is impor­tant to select exam­ples which do not have major dom­i­nant fea­tures which are of no inter­est to you, but are com­mon to your input data any­way. One famous exam­ple is of the US Army “Arti­fi­cial Intel­li­gence” tank clas­si­fi­er. It was shown exam­ples of Soviet tanks from many dif­fer­ent dis­tances and angles on a bright sunny day, and exam­ples of US tanks on a cloudy day. Need­less to say it was great at clas­si­fy­ing weath­er, but not so good at pick­ing out enemy tanks.

1990s

“Neural Net­work Fol­lies”, Neil Fraser, Sep­tem­ber 1998:

In the 1980s, the Pen­ta­gon wanted to har­ness com­puter tech­nol­ogy to make their tanks harder to attack­…The research team went out and took 100 pho­tographs of tanks hid­ing behind trees, and then took 100 pho­tographs of trees—with no tanks. They took half the pho­tos from each group and put them in a vault for safe-­keep­ing, then scanned the other half into their main­frame com­put­er. The huge neural net­work was fed each photo one at a time and asked if there was a tank hid­ing behind the trees. Of course at the begin­ning its answers were com­pletely ran­dom since the net­work did­n’t know what was going on or what it was sup­posed to do. But each time it was fed a photo and it gen­er­ated an answer, the sci­en­tists told it if it was right or wrong. If it was wrong it would ran­domly change the weight­ings in its net­work until it gave the cor­rect answer. Over time it got bet­ter and bet­ter until even­tu­ally it was get­ting each photo cor­rect. It could cor­rectly deter­mine if there was a tank hid­ing behind the trees in any one of the pho­to­s…So the sci­en­tists took out the pho­tos they had been keep­ing in the vault and fed them through the com­put­er. The com­puter had never seen these pho­tos before—this would be the big test. To their immense relief the neural net cor­rectly iden­ti­fied each photo as either hav­ing a tank or not hav­ing one. Inde­pen­dent test­ing: The Pen­ta­gon was very pleased with this, but a lit­tle bit sus­pi­cious. They com­mis­sioned another set of pho­tos (half with tanks and half with­out) and scanned them into the com­puter and through the neural net­work. The results were com­pletely ran­dom. For a long time nobody could fig­ure out why. After all nobody under­stood how the neural had trained itself. Even­tu­ally some­one noticed that in the orig­i­nal set of 200 pho­tos, all the images with tanks had been taken on a cloudy day while all the images with­out tanks had been taken on a sunny day. The neural net­work had been asked to sep­a­rate the two groups of pho­tos and it had cho­sen the most obvi­ous way to do it—not by look­ing for a cam­ou­flaged tank hid­ing behind a tree, but merely by look­ing at the color of the sky…This story might be apoc­ryphal, but it does­n’t really mat­ter. It is a per­fect illus­tra­tion of the biggest prob­lem behind neural net­works. Any auto­mat­i­cally trained net with more than a few dozen neu­rons is vir­tu­ally impos­si­ble to ana­lyze and under­stand.

Tom White attrib­utes (in Octo­ber 2017) to Mar­vin Min­sky some ver­sion of the tank story being told in MIT classes 20 years before, ~1997 (but does­n’t spec­ify the detailed story or ver­sion other than appar­ently the results were “clas­si­fied”).

Vas­ant Dhar & Roger Stein, Intel­li­gent Deci­sion Sup­port Meth­ods, 1997 (pg98, lim­ited Google Books snip­pet):

…How­ev­er, when a new set of pho­tographs were used, the results were hor­ri­ble. At first the team was puz­zled. But after care­ful inspec­tion of the first two sets of pho­tographs, they dis­cov­ered a very sim­ple expla­na­tion. The pho­tos with tanks in them were all taken on sunny days, and those with­out the tanks were taken on over­cast days. The net­work had not learned to iden­tify tank like images; instead, it had learned to iden­tify pho­tographs of sunny days and over­cast days.

Roys­ton Goodacre, Mark J. Neal, & Dou­glas B. Kell, “Quan­ti­ta­tive Analy­sis of Mul­ti­vari­ate Data Using Arti­fi­cial Neural Net­works: A Tuto­r­ial Review and Appli­ca­tions to the Decon­vo­lu­tion of Pyrol­y­sis Mass Spec­tra”, 1994-04-29:

…As in all other data analy­sis tech­niques, these super­vised learn­ing meth­ods are not immune from sen­si­tiv­ity to badly cho­sen ini­tial data (113). [113: Zupan, J. and J. Gasteiger: Neural Net­works for Chemists: An Intro­duc­tion. VCH Ver­lags­ge­sellschaft, Wein­heim (1993)] There­fore the exem­plars for the train­ing set must be care­fully cho­sen; the golden rule is “garbage in—­garbage out”. An excel­lent exam­ple of an unrep­re­sen­ta­tive train­ing set was dis­cussed some time ago on the BBC tele­vi­sion pro­gramme Hori­zon; a neural net­work was trained to attempt to dis­tin­guish tanks from trees. Pic­tures were taken of for­est scenes lack­ing mil­i­tary hard­ware and of sim­i­lar but per­haps less bucolic land­scapes which also con­tained more-or-­less cam­ou­flaged bat­tle tanks. A neural net­work was trained with these input data and found to dif­fer­en­ti­ate most suc­cess­fully between tanks and trees. How­ev­er, when a new set of pic­tures was analysed by the net­work, it failed to dis­tin­guish the tanks from the trees. After fur­ther inves­ti­ga­tion, it was found that the first set of pic­tures con­tain­ing tanks had been taken on a sunny day whilst those con­tain­ing no tanks were obtained when it was over­cast. The neural net­work had there­fore thus learned sim­ply to recog­nise the weath­er! We can con­clude from this that the train­ing and tests sets should be care­fully selected to con­tain rep­re­sen­ta­tive exem­plars encom­pass­ing the appro­pri­ate vari­ance over all rel­e­vant prop­er­ties for the prob­lem at hand.

Fer­nando Pereira, “neural redlin­ing”, RISKS 16(41), 1994-09-12:

Fred’s com­ments will hold not only of neural nets but of any deci­sion model trained from data (eg. Bayesian mod­els, deci­sion trees). It’s just an instance of the old “GIGO” phe­nom­e­non in sta­tis­ti­cal mod­el­ing…Over­all, the whole issue of eval­u­a­tion, let alone cer­ti­fi­ca­tion and legal stand­ing, of com­plex sta­tis­ti­cal mod­els is still very much open. (This reminds me of a pos­si­bly apoc­ryphal story of prob­lems with biased data in neural net train­ing. Some US defense con­trac­tor had sup­pos­edly trained a neural net to find tanks in scenes. The reported per­for­mance was excel­lent, with even cam­ou­flaged tanks mostly hid­den in veg­e­ta­tion being spot­ted. How­ev­er, when the net was tested on yet a new set of images sup­plied by the client, the net did not do bet­ter than chance. After an embar­rass­ing inves­ti­ga­tion, it turned out that all the tank images in the orig­i­nal train­ing and test sets had very dif­fer­ent aver­age inten­sity than the non-­tank images, and thus the net had just learned to dis­crim­i­nate between two image inten­sity lev­els. Does any­one know if this actu­ally hap­pened, or is it just in the neural net “urban folk­lore”?)

Erich Harth, The Cre­ative Loop: How the Brain Makes a Mind, 1993/1995 (pg158, lim­ited Google Books snip­pet):

…55. The net was trained to detect the pres­ence of tanks in a land­scape. The train­ing con­sisted in show­ing the device many pho­tographs of scene, some with tanks, some with­out. In some cas­es—as in the pic­ture on page 143—the tank’s pres­ence was not very obvi­ous. The inputs to the neural net were dig­i­tized pho­tographs;

& , “What Arti­fi­cial Experts Can and Can­not Do”, 1992:

All the “con­tinue this sequence” ques­tions found on intel­li­gence tests, for exam­ple, really have more than one pos­si­ble answer but most human beings share a sense of what is sim­ple and rea­son­able and there­fore accept­able. But when the net pro­duces an unex­pected asso­ci­a­tion can one say it has failed to gen­er­al­ize? One could equally well say that the net has all along been act­ing on a dif­fer­ent def­i­n­i­tion of “type” and that that dif­fer­ence has just been revealed. For an amus­ing and dra­matic case of cre­ative but unin­tel­li­gent gen­er­al­iza­tion, con­sider the leg­end of one of con­nec­tion­is­m’s first appli­ca­tions. In the early days of the per­cep­tron the army decided to train an arti­fi­cial neural net­work to rec­og­nize tanks partly hid­den behind trees in the woods. They took a num­ber of pic­tures of a woods with­out tanks, and then pic­tures of the same woods with tanks clearly stick­ing out from behind trees. They then trained a net to dis­crim­i­nate the two classes of pic­tures. The results were impres­sive, and the army was even more impressed when it turned out that the net could gen­er­al­ize its knowl­edge to pic­tures from each set that had not been used in train­ing the net. Just to make sure that the net had indeed learned to rec­og­nize par­tially hid­den tanks, how­ev­er, the researchers took some more pic­tures in the same woods and showed them to the trained net. They were shocked and depressed to find that with the new pic­tures the net totally failed to dis­crim­i­nate between pic­tures of trees with par­tially con­cealed tanks behind them and just plain trees. The mys­tery was finally solved when some­one noticed that the train­ing pic­tures of the woods with­out tanks were taken on a cloudy day, whereas those with tanks were taken on a sunny day. The net had learned to rec­og­nize and gen­er­al­ize the dif­fer­ence between a woods with and with­out shad­ows! Obvi­ous­ly, not what stood out for the researchers as the impor­tant dif­fer­ence. This exam­ple illus­trates the gen­eral point that a net must share size, archi­tec­ture, ini­tial con­nec­tions, con­fig­u­ra­tion and social­iza­tion with the human brain if it is to share our sense of appro­pri­ate gen­er­al­iza­tion

Hubert Drey­fus appears to have told this story ear­lier in 1990 or 1991, as a sim­i­lar story appears in episode 4 (Ger­man) (start­ing 33m49s) of the BBC doc­u­men­tary series , broad­cast 1991-11-08. Hubert L. Drey­fus, What Com­put­ers Still Can’t Do: A Cri­tique of Arti­fi­cial Rea­son, 1992, repeats the story in very sim­i­lar but not quite iden­ti­cal word­ing (Jeff Kauf­man notes that Drey­fus drops the qual­i­fy­ing “leg­end of” descrip­tion):

…But when the net pro­duces an unex­pected asso­ci­a­tion, can one say that it has failed to gen­er­al­ize? One could equally well say that the net has all along been act­ing on a dif­fer­ent def­i­n­i­tion of “type” and that that dif­fer­ence has just been revealed. For an amus­ing and dra­matic case of cre­ative but unin­tel­li­gent gen­er­al­iza­tion, con­sider one of con­nec­tion­is­m’s first appli­ca­tions. In the early days of this work the army tried to train an arti­fi­cial neural net­work to rec­og­nize tanks in a for­est. They took a num­ber of pic­tures of a for­est with­out tanks and then, on a later day, with tanks clearly stick­ing out from behind trees, and they trained a net to dis­crim­i­nate the two classes of pic­tures. The results were impres­sive, and the army was even more impressed when it turned out that the net could gen­er­al­ize its knowl­edge to pic­tures that had not been part of the train­ing set. Just to make sure that the net was indeed rec­og­niz­ing par­tially hid­den tanks, how­ev­er, the researchers took more pic­tures in the same for­est and showed them to the trained net. They were depressed to find that the net failed to dis­crim­i­nate between the new pic­tures of trees with tanks behind them and the new pic­tures of just plain trees. After some ago­niz­ing, the mys­tery was finally solved when some­one noticed that the orig­i­nal pic­tures of the for­est with­out tanks were taken on a cloudy day and those with tanks were taken on a sunny day. The net had appar­ently learned to rec­og­nize and gen­er­al­ize the dif­fer­ence between a for­est with and with­out shad­ows! This exam­ple illus­trates the gen­eral point that a net­work must share our com­mon­sense under­stand­ing of the world if it is to share our sense of appro­pri­ate gen­er­al­iza­tion.

Drey­fus’s What Com­put­ers Still Can’t Do is listed as a revi­sion of his 1972 book, What Com­put­ers Can’t Do: A Cri­tique of Arti­fi­cial Rea­son, but the tank story is not in the 1972 book, only the 1992 one. (Drey­fus’s ver­sion is also quoted in the 2017 NYT arti­cle and Hillis 1996’s Geog­ra­phy, Iden­ti­ty, and Embod­i­ment in Vir­tual Real­ity, pg346.)

Laveen N. Kanal, Arti­fi­cial Neural Net­works and Sta­tis­ti­cal Pat­tern Recog­ni­tion: Old and New Con­nec­tions’s Fore­word, dis­cusses some early NN/tank research (pre­dat­ing not just LeCun’s con­vo­lu­tions but back­prop­a­ga­tion), 1991:

…[Frank] Rosen­blatt had not lim­ited him­self to using just a sin­gle Thresh­old Logic Unit but used net­works of such units. The prob­lem was how to train mul­ti­layer per­cep­tron net­works. A paper on the topic writ­ten by Block, Knight and Rosen­blatt was murky indeed, and did not demon­strate a con­ver­gent pro­ce­dure to train such net­works. In 1962–63 at Philco-­Ford, seek­ing a sys­tem­atic approach to design­ing lay­ered clas­si­fi­ca­tion nets, we decided to use a hier­ar­chy of thresh­old logic units with a first layer of “fea­ture log­ics” which were thresh­old logic units on over­lap­ping recep­tive fields of the image, feed­ing two addi­tional lev­els of weighted thresh­old logic deci­sion units. The weights in each level of the hier­ar­chy were esti­mated using sta­tis­ti­cal meth­ods rather than iter­a­tive train­ing pro­ce­dures [L.N. Kanal & N.C. Ran­dall, “Recog­ni­tion Sys­tem Design by Sta­tis­ti­cal Analy­sis”, Proc. 19th Conf. ACM, 1964]. We referred to the net­works as two layer net­works since we did not count the input as a lay­er. On a project to rec­og­nize tanks in aer­ial pho­tog­ra­phy, the method worked well enough in prac­tice that the U.S. Army agency spon­sor­ing the project decided to clas­sify the final reports, although pre­vi­ously the project had been unclas­si­fied. We were unable to pub­lish the clas­si­fied results! Then, enam­ored by the claimed promise of coher­ent opti­cal fil­ter­ing as a par­al­lel imple­men­ta­tion for auto­matic tar­get recog­ni­tion, the fund­ing we had been promised was diverted away from our elec­tro-op­ti­cal imple­men­ta­tion to a coher­ent opti­cal fil­ter­ing group. Some years later we pre­sented the argu­ments favor­ing our approach, com­pared to opti­cal imple­men­ta­tions and train­able sys­tems, in an arti­cle titled “Sys­tems Con­sid­er­a­tions for Auto­matic Imagery Screen­ing” by T.J. Harley, L.N. Kanal and N.C. Ran­dall, which is included in the IEEE Press reprint vol­ume titled Machine Recog­ni­tion of Pat­terns edited by A. Agrawala 19771. In the years which fol­lowed mul­ti­level sta­tis­ti­cally designed clas­si­fiers and AI search pro­ce­dures applied to pat­tern recog­ni­tion held my inter­est, although com­ments in my 1974 sur­vey, “Pat­terns In Pat­tern Recog­ni­tion: 1968–1974” [IEEE Trans. on IT, 1974], men­tion papers by Amari and oth­ers and show an aware­ness that neural net­works and bio­log­i­cally moti­vated automata were mak­ing a come­back. In the last few years train­able mul­ti­layer neural net­works have returned to dom­i­nate research in pat­tern recog­ni­tion and this time there is poten­tial for gain­ing much greater insight into their sys­tem­atic design and per­for­mance analy­sis…

While Kanal & Ran­dall 1964 matches in some ways, includ­ing the image counts, there is no men­tion of fail­ure either in the paper or Kanal’s 1991 rem­i­nis­cences (rather, Kanal implies it was highly promis­ing), there is no men­tion of a field deploy­ment or addi­tional test­ing which could have revealed over­fit­ting, and given their use of bina­riz­ing, it’s not clear to me that their 2-layer algo­rithm even could over­fit to global bright­ness; the pho­tos also appear to have been taken at low enough alti­tude for there to be no clouds, and to be taken under sim­i­lar (pos­si­bly con­trolled) light­ing con­di­tions. The descrip­tion in Kanal & Ran­dall 1964 is some­what opaque to me, par­tic­u­larly of the ‘Lapla­cian’ they use to bina­rize or con­vert to edges, but there’s more back­ground in their “Semi­-Au­to­matic Imagery Screen­ing Research Study and Exper­i­men­tal Inves­ti­ga­tion, Vol­ume 1”, Harley, Bryan, Kanal, Tay­lor & Grayum 1962 (mir­ror), which indi­cates that in their pre­lim­i­nary stud­ies they were already inter­ested in prenormalization/preprocessing images to cor­rect for alti­tude and bright­ness, and the Lapla­cian, along with sil­hou­et­ting and “line­ness edit­ing”, not­ing that “The Lapla­cian oper­a­tion elim­i­nates absolute bright­ness scale as well as low-s­pa­tial fre­quen­cies which are of lit­tle con­se­quence in screen­ing oper­a­tions.”2

An anony­mous reader says he heard the story in 1990:

I was told about the tank recog­ni­tion fail­ure by a lec­turer on my 1990 Intel­li­gent Knowl­edge Based Sys­tems MSc, almost cer­tainly Libor Spacek, in terms of being aware of con­text in data sets; that being from (the for­mer) Czecho­slo­va­kia he expected to see tanks on a motor­way whereas most British peo­ple did­n’t. I also remem­ber read­ing about a project with DARPA fund­ing aimed at dif­fer­en­ti­at­ing Rus­sian, Euro­pean and US tanks where what the image recog­ni­tion learned was not to spot the dif­fer­ences between tanks but to find trees, because of the US tank pho­tos being on open ground and the Russ­ian ones being in forests; that was dur­ing the same MSc course—so very sim­i­lar to pre­dict­ing tumours by look­ing for the ruler used to mea­sure them in the pho­to—but I don’t recall the source (it was­n’t one of the books you cite though, it was either a jour­nal arti­cle or another text book).

1980s

Chris Brew states (2017-10-16) that he “Heard the story in 1984 with pigeons instead of neural nets”.

1960s

, in an email to Eliezer Yud­kowsky on 2013-02-26, recounts an inter­est­ing anec­dote about the 1960s claim­ing to be the grain of truth:

By the way, the story about the two pic­tures of a field, with and with­out army tanks in the pic­ture, comes from me. I attended a meet­ing in Los Ange­les [at RAND?], about half a cen­tury ago [~1963?] where some­one gave a paper show­ing how a ran­dom net could be trained to detect the tanks in the pic­ture. I was in the audi­ence. At the end of the talk I stood up and made the com­ment that it was obvi­ous that the pic­ture with the tanks was made on a sunny day while the other pic­ture (of the same field with­out the tanks) was made on a cloudy day. I sug­gested that the “neural net” had merely trained itself to rec­og­nize the dif­fer­ence between a bright pic­ture and a dim pic­ture.

Evaluation

Sourcing

The absence of any hard cita­tions is strik­ing: even when a cita­tion is sup­plied, it is invari­ably to a rel­a­tively recent source like Drey­fus, and then the chain ends. Typ­i­cally for a real sto­ry, one will find at least one or two hints of a penul­ti­mate cita­tion and then a final defin­i­tive cita­tion to some very dif­fi­cult-­to-ob­tain or obscure work (which then is often quite dif­fer­ent from the pop­u­lar­ized ver­sion but still rec­og­niz­able as the orig­i­nal); for exam­ple, another pop­u­lar cau­tion­ary AI urban leg­end is that the 1956 claimed that a sin­gle grad­u­ate stu­dent work­ing for a sum­mer could solve com­puter vision (or per­haps AI in gen­er­al), which is a highly dis­torted mis­lead­ing descrip­tion of the orig­i­nal 1955 pro­posal’s real­is­tic claim that “a 2 mon­th, 10 man study of arti­fi­cial intel­li­gence” could yield “a sig­nif­i­cant advance can be made in one or more of these prob­lems if a care­fully selected group of sci­en­tists work on it together for a sum­mer.”3 Instead, every­one either dis­avows it as an urban leg­end or pos­si­bly apoc­ryphal, or punts to some­one else. (Min­sky’s 2011 ver­sion ini­tially seems con­crete, but while he specif­i­cally attrib­utes the musi­cal score story to a friend & claims to have found the trick per­son­al­ly, he is then as vague as any­one else about the tank sto­ry, say­ing it just “hap­pened” some­where “in the United States at one of our research insti­tutes”, at an unmen­tioned insti­tute by unmen­tioned peo­ple at an unmen­tioned point in time for an unmen­tioned branch of the mil­i­tary.)

Variations

Ques­tion to Radio Yere­van: “Is it cor­rect that Grig­ori Grig­orievich Grig­oriev won a lux­ury car at the All-U­nion Cham­pi­onship in Moscow?”

Radio Yere­van answered: “In prin­ci­ple, yes. But first of all it was not Grig­ori Grig­orievich Grig­oriev, but Vas­sili Vas­silievich Vas­siliev; sec­ond, it was not at the All-U­nion Cham­pi­onship in Moscow, but at a Col­lec­tive Farm Sports Fes­ti­val in Smolen­sk; third, it was not a car, but a bicy­cle; and fourth he did­n’t win it, but rather it was stolen from him.”

“Radio Yere­van Jokes” (col­lected by Allan Stevo)

It is also inter­est­ing that not all the sto­ries imply quite the same prob­lem with the hypo­thet­i­cal NN. Dataset bias/selection effects is not the same thing as over­fit­ting or dis­parate impact, but some of the story tellers don’t real­ize that. For exam­ple, in some sto­ries, the NN fails when it’s tested on addi­tional held­out data (over­fit­ting), not when it’s tested on data from an entire dif­fer­ent pho­tog­ra­pher or field exer­cise or data source (dataset bias/distributional shift). Or, Alexan­der Har­row­ell cites dis­parate impact in a med­ical school as if it were an exam­ple of the same prob­lem, but it’s not—at least in the USA, a NN would be cor­rect in infer­ring that white stu­dents are more likely to suc­ceed, as that is a real pre­dic­tor (this would be an exam­ple of how peo­ple play rather fast and loose with claims of “algo­rith­mic bias”), and it would not nec­es­sar­ily be the case that, say, ran­dom­ized admis­sion of more non-white stu­dents would be cer­tain to increase the num­ber of suc­cess­ful grad­u­ates; such a sce­nario is, how­ev­er, pos­si­ble and illus­trates the dif­fer­ence between pre­dic­tive mod­els & causal mod­els for con­trol & opti­miza­tion, and the need for experiments/reinforcement learn­ing.

A read of all the vari­ants together raises more ques­tions than it answers:

  • Did this story hap­pen in the 1960s, 1980s, 1990s, or dur­ing Desert Storm in the 1990s?
  • Was the research con­ducted by the US mil­i­tary, or researchers for another NATO coun­try?
  • Were the pho­tographs taken by satel­lite, from the air, on the ground, or by spy cam­eras?
  • Were the pho­tographs of Amer­i­can tanks, ply­wood cutouts, Soviet tanks, or War­saw Pact tanks?
  • Were the tanks out in the open, under cov­er, or fully cam­ou­flaged?
  • Were these pho­tographs taken in forests, fields, deserts, swamps, or all of them?
  • Were the pho­tographs taken in same place but dif­fer­ent time of day, same place but dif­fer­ent days, or dif­fer­ent places entire­ly?
  • Were there 100, 200, or thou­sands of pho­tographs; and how many were in the train­ing vs val­i­da­tion set?
  • Was the input in black­-and-white bina­ry, grayscale, or col­or?
  • Was the tel­l-­tale fea­ture either field vs forest, bright vs dark, the pres­ence vs absence of clouds, the pres­ence vs absence of shad­ows, the length of shad­ows, or an acci­dent in film devel­op­ment unre­lated to weather entire­ly?
  • Was the NN to be used for image pro­cess­ing or in autonomous robotic tanks?
  • Was it even a NN?
  • Was the dataset bias caught quickly within “a few hours”, later by a sus­pi­cious team mem­ber, later still when applied to an addi­tional set of tank pho­tographs, dur­ing fur­ther test­ing pro­duc­ing a new dataset, much later dur­ing a live demo for mil­i­tary offi­cers, or only after live deploy­ment in the field?

Almost every aspect of the tank story which could vary does vary.

Urban Legends

We could also com­pare the tank story with many of the char­ac­ter­is­tics of (of the sort so famil­iar from Snopes): they typ­i­cally have a clear dra­matic arc, involve hor­ror or humor while play­ing on com­mon con­cerns (dis­trust of NNs has been a theme from the start of NN research4), make an impor­tant didac­tic or moral point, claim to be true while sourc­ing remains lim­ited to social proof such as the usual “friend of a friend” attri­bu­tions, often try to asso­ciate with a respected insti­tu­tion (such as the US mil­i­tary), are trans­mit­ted pri­mar­ily orally through social mech­a­nisms & appear spon­ta­neously & inde­pen­dently in many sources with­out appar­ent ori­gin (most peo­ple seem to hear the tank story in unspec­i­fied class­es, con­fer­ences, per­sonal dis­cus­sions rather than in a book or paper), exists in many mutu­al­ly-­con­tra­dic­tory vari­ants often with over­ly-spe­cific details5 spon­ta­neously aris­ing in the retelling, been around for a long time (it appears almost fully formed in Drey­fus 1992, sug­gest­ing incu­ba­tion before then), some­times have a grain of truth (dataset bias cer­tainly is real), and the full tank story is “too good not to pass along” (even authors who are sure it’s an urban leg­end can’t resist retelling it yet again for didac­tic effect or enter­tain­men­t). The tank story matches almost all the usual cri­te­ria for an urban leg­end.

Origin

So where does this urban leg­end come from? The key anec­dote appears to be Edward Fred­kin’s as it pre­cedes all other excerpts except per­haps the research Kanal describes; Fred­kin’s story does not con­firm the tank story as he merely spec­u­lates that bright­ness was dri­ving the results, much less all the extra­ne­ous details about pho­to­graphic film being acci­den­tally overde­vel­oped or robot tanks going berserk or a demo fail­ing in front of Army brass.

But it’s easy to see how Fred­kin’s rea­son­able ques­tion could have memet­i­cally evolved into the tank story as finally fixed into pub­lished form by Drey­fus’s arti­cle:

  1. Set­ting: Kanal & Ran­dall set up their very small sim­ple early per­cep­trons on some tiny binary aer­ial pho­tos of tanks, in inter­est­ing early work, and Fred­kin attends the talk some­time around 1960–1963

  2. The Ques­tion: Fred­kin then asks in the Q&A whether the per­cep­tron is not learn­ing square-shapes but bright­ness

  3. Punt­ing: of course nei­ther Fred­kin nor Kanal & Ran­dall can know on the spot whether this cri­tique is right or wrong (per­haps that ques­tion moti­vated the bina­rized results reported in Kanal & Ran­dall 1964?), and the ques­tion remains unan­swered

  4. Anec­do­tiz­ing: but some­one in the audi­ence con­sid­ers that an excel­lent obser­va­tion about method­olog­i­cal flaws in NN research, and per­haps they (or Fred­kin) repeats the story to oth­ers, who find it use­ful too, and along the way, Fred­kin’s ques­tion mark gets dropped and the pos­si­ble flaw becomes an actual flaw, with the punch­line: “…and it turned out their NN were just detect­ing aver­age bright­ness!”

    One might expect Kanal & Ran­dall to rebut these rumors, if only by pub­lish­ing addi­tional papers on their func­tion­ing sys­tem, but by a quirk of fate, as Kanal explains in his pref­ace, after their 1964 paper, the Army liked it enough to make it clas­si­fied and then they were reas­signed to an entirely dif­fer­ent task, killing progress entire­ly. (Some­thing sim­i­lar hap­pened to .)

  5. Pro­lif­er­a­tion: In the absence of any coun­ternar­ra­tive (si­lence is con­sid­ered con­sen­t), the tank story con­tin­ues spread­ing.

  6. Muta­tion: but now the story is incom­plete, a joke miss­ing most of the setup to its punch­line—how did these Army researchers dis­cover the NN had tricked them and what was the bright­ness dif­fer­ence from? The var­i­ous ver­sions pro­pose dif­fer­ent res­o­lu­tions, and like­wise, appro­pri­ate details about the tank data must invent­ed.

  7. **: Even­tu­al­ly, after enough muta­tions, a ver­sion reaches Drey­fus, already a well-­known critic of the AI estab­lish­ment, who then uses it in his article/book, virally spread­ing it glob­ally to pop up in ran­dom places thence­forth, and fix­at­ing it as a uni­ver­sal­ly-­known ur-text. (Fur­ther memetic muta­tions can and often will occur, but dili­gent writ­ers & researchers will ‘cor­rect’ vari­ants by return­ing to the Drey­fus ver­sion.)

One might try to write Drey­fus off as a coin­ci­dence and argue that the US Army must have had so many neural net research pro­grams going that one of the oth­ers is the real orig­in, but one would expect those pro­grams to result in spin­offs, more reports, reports since declas­si­fied, etc. It’s been half a cen­tu­ry, after all. And despite the close asso­ci­a­tion of the US mil­i­tary with MIT and early AI work, tanks do not seem to have been a major focus of early NN research—­for exam­ple, does not men­tion tanks at all, and most of my paper searches kept pulling up NN papers about ‘tanks’ as in vats, such as con­trol­ling stirring/mixing tanks for chem­istry. Nor is it a safe assump­tion that the mil­i­tary always has much more advanced tech­nol­ogy than the pub­lic or pri­vate sec­tors; often, they can be quite behind or at the sta­tus quo.6

Could it Happen?

Could some­thing like the tank story (a NN learn­ing to dis­tin­guish solely on aver­age bright­ness lev­els) hap­pen in 2017 with state-of-the-art tech­niques like con­vo­lu­tional neural net­works (CNNs)? (After all, pre­sum­ably nobody really cares about what mis­takes a crude per­cep­tron may or may not have once made back in the 1960s; most/all of the sto­ry-tellers are using it for didac­tic effect in warn­ing against care­less­ness in con­tem­po­rary & future AI research/applications.) I would guess that while it could hap­pen, it would be con­sid­er­ably less likely now than then for sev­eral rea­sons:

  1. a com­mon pre­pro­cess­ing step in com­puter vision (and NNs in gen­er­al) is to “whiten” the image by stan­dard­iz­ing or trans­form­ing pix­els to a nor­mal dis­tri­b­u­tion; this would tend to wipe global bright­ness lev­els, pro­mot­ing invari­ance to illu­mi­na­tion

  2. in addi­tion to or instead of whiten­ing, it is also com­mon to use aggres­sive “data aug­men­ta­tion”: shift­ing the image by a few pix­els in each direc­tion, crop­ping it ran­dom­ly, adjust­ing col­ors to be slightly more red/green/blue, flip­ping hor­i­zon­tal­ly, bar­rel-warp­ing it, adding JPEG com­pres­sion noise/artifacts, bright­en­ing or dark­en­ing, etc.

    None of these trans­for­ma­tions should affect whether an image is clas­si­fi­able as “dog” or “cat”7, the rea­son­ing goes, so the NN should learn to see past them, and gen­er­at­ing vari­ants dur­ing train­ing pro­vides addi­tional data for free. Aggres­sive data aug­men­ta­tion would make it harder to pick up global bright­ness as a cheap trick.

  3. CNNs have built-in biases (com­pared to ful­ly-­con­nected neural net­works) towards edges and other struc­tures, rather than global aver­ages; con­vo­lu­tions want to find edges and geo­met­ric pat­terns like lit­tle squares for tanks. (This point is par­tic­u­larly ger­mane in light of the brain inspi­ra­tion for con­vo­lu­tions & Drey­fus & Drey­fus 1992’s inter­pre­ta­tion of the tank sto­ry.)

  4. image clas­si­fi­ca­tion CNNs, due to their large sizes, are often trained on large datasets with many classes to cat­e­go­rize images into (canon­i­cal­ly, Ima­geNet with 1000 classes over a mil­lion images; much larger datasets, such as 300 mil­lion images, have been explored and found to still offer ben­e­fit­s). Per­force, most of these images will not be gen­er­ated by the dataset main­tainer and will come from a wide vari­ety of peo­ples, places, cam­eras, and set­tings, reduc­ing any sys­tem­atic bias­es. It would be dif­fi­cult to find a cheap trick which works over many of those cat­e­gories simul­ta­ne­ous­ly, and the NN train­ing will con­stantly erode any cat­e­go­ry-spe­cific tricks in favor of more gen­er­al­iz­able pat­tern-recog­ni­tion (in part because there’s no inher­ent ‘mod­u­lar­ity’ which could fac­tor a NN into a “tank cheap trick” NN & a “every­thing else real pat­tern-recog­ni­tion” NN). The power of gen­er­al­iz­able abstrac­tions will tend to over­whelm the short­cuts, and the more data & tasks a NN is trained on, pro­vid­ing greater super­vi­sion & richer insight, the more this will be the case.

    • Even in the some­what unusual case of a spe­cial-pur­pose binary clas­si­fi­ca­tion CNN being trained on a few hun­dred images, because of the large sizes of good CNNs, it is typ­i­cal to at least start with a pre­trained Ima­geNet CNN in order to ben­e­fit from all the learned knowl­edge about edges & what­not before “fine­tun­ing” on the spe­cial-pur­pose small dataset. If the CNN starts with a huge induc­tive bias towards edges etc, it will have a hard time throw­ing away its infor­ma­tive pri­ors and focus­ing purely on global bright­ness. (Often in fine­tun­ing, the lower lev­els of the CNN aren’t allowed to change at all!)
    • Another vari­ant on trans­fer learn­ing is to use the CNN as a fea­ture-­gen­er­a­tor, by tak­ing the final lay­ers’ state com­puted on a spe­cific image and using them as a vec­tor embed­ding, a sort of sum­mary of every­thing about the image con­tent rel­e­vant to clas­si­fi­ca­tion; this embed­ding is use­ful for other kinds of CNNs for pur­poses like style trans­fer (style trans­fer aims to warp an image towards the appear­ance of another image while pre­serv­ing the embed­ding and thus pre­sum­ably the con­tent) or for GANs gen­er­at­ing images (the dis­crim­i­na­tor can use the fea­tures to detect “weird” images which don’t make sense, thereby forc­ing the gen­er­a­tor to learn what images cor­re­spond to real­is­tic embed­dings).
  5. CNNs would typ­i­cally throw warn­ing signs before a seri­ous field deploy­ment, either in diag­nos­tics or fail­ures to extend the results.

    • One ben­e­fit of the fil­ter setup of CNNs is that it’s easy to visu­al­ize what the lower lay­ers are ‘look­ing at’; typ­i­cal­ly, CNN fil­ters will look like diag­o­nal or hor­i­zon­tal lines or curves or other sim­ple geo­met­ric pat­terns. In the case of a hypo­thet­i­cal bright­ness-de­tec­tor CNN, because it is not rec­og­niz­ing any shapes what­so­ever or doing any­thing but triv­ial bright­ness aver­ag­ing, one would expect its fil­ters to look like ran­dom noise and def­i­nitely noth­ing like the usual fil­ter visu­al­iza­tions. This would imme­di­ately alarm any deep learn­ing researcher that the CNN is not learn­ing what they thought it was learn­ing.
    • Related to fil­ter visu­al­iza­tion is input visu­al­iza­tion: it’s com­mon to gen­er­ate some heatmaps of input images to see what regions of the input image are influ­enc­ing the clas­si­fi­ca­tion the most. If you are clas­si­fy­ing “cats vs dogs”, you expect a heatmap of a cat image to focus on the cat’s head and tail, for exam­ple, and not on the paint­ing on the liv­ing room wall behind it; if you have an image of a tank in a forest, you expect the heatmap to focus on the tank rather than trees in the cor­ner or noth­ing in par­tic­u­lar, just ran­dom-seem­ing pix­els all over the image. If it’s not focus­ing on the tank at all, how is it doing the clas­si­fi­ca­tion?, one would then won­der. ( (blog), Hen­der­son & Rothe 2017-05-16 quote Yud­kowsky 2008’s ver­sion of the tank story as a moti­va­tion for their heatmap visu­al­iza­tion tool and demon­strate that, for exam­ple, block­ing out the sky in a tank image does­n’t bother a VGG16 CNN image clas­si­fier but block the tank’s treads does, and the heatmap focuses on the tank itself.) There are addi­tional meth­ods for try­ing to under­stand whether the NN has learned a poten­tially use­ful algo­rithm using other meth­ods such as the pre­vi­ously cited LIME.
  6. Also related to the visu­al­iza­tion is going beyond clas­si­fi­ca­tion to the log­i­cal next step of “local­iza­tion” or “image seg­men­ta­tion”: hav­ing detected an image with a tank in it some­where, it is nat­ural (espe­cially for mil­i­tary pur­pos­es) to ask where in the image the tank is?

    A CNN which is truly detect­ing the tank itself will lend itself to image seg­men­ta­tion (eg CNN suc­cess in reach­ing human lev­els of Ima­geNet clas­si­fi­ca­tion per­for­mance have also resulted in extremely good seg­men­ta­tion of an image by cat­e­go­riz­ing each pixel as human/dog/cat/etc), while one learn­ing the cheap trick of bright­ness will utterly fail at guess­ing bet­ter than chance which pix­els are the tank.

So, it is highly unlikely that a CNN trained via a nor­mal work­flow (data-aug­mented fine­tun­ing of a pre­trained Ima­geNet CNN with stan­dard diag­nos­tics) would fail in this exact way or, at least, make it to a deployed sys­tem with­out fail­ing.

Could Something Like it Happen?

Could some­thing like the tank story hap­pen, in the sense of a selec­tion-bi­ased dataset yield­ing NNs which fail dis­mally in prac­tice? One could imag­ine it hap­pen­ing and it surely does at least occa­sion­al­ly, but in prac­tice it does­n’t seem to be a par­tic­u­larly seri­ous or com­mon prob­lem—peo­ple rou­tinely apply CNNs to very dif­fer­ent con­texts with con­sid­er­able suc­cess.8 If it’s such a seri­ous and com­mon prob­lem, one would think that peo­ple would be able to pro­vide a wealth of real-­world exam­ples of sys­tems deployed with dataset bias mak­ing it entirely use­less, rather than repeat­ing a fic­tion from 50 years ago.

One of the most rel­e­vant (if unfor­tu­nately older & pos­si­bly out of date) papers I’ve read on this ques­tion of dataset bias is “Unbi­ased Look at Dataset Bias”, Tor­ralba & Efros 2011:

Datasets are an inte­gral part of con­tem­po­rary object recog­ni­tion research. They have been the chief rea­son for the con­sid­er­able progress in the field, not just as source of large amounts of train­ing data, but also as means of mea­sur­ing and com­par­ing per­for­mance of com­pet­ing algo­rithms. At the same time, datasets have often been blamed for nar­row­ing the focus of object recog­ni­tion research, reduc­ing it to a sin­gle bench­mark per­for­mance num­ber. Indeed, some datasets, that started out as data cap­ture efforts aimed at rep­re­sent­ing the visual world, have become closed worlds unto them­selves (e.g. the Corel world, the Cal­tech101 world, the PASCAL VOC world). With the focus on beat­ing the lat­est bench­mark num­bers on the lat­est dataset, have we per­haps lost sight of the orig­i­nal pur­pose?

The goal of this paper is to take stock of the cur­rent state of recog­ni­tion datasets. We present a com­par­i­son study using a set of pop­u­lar datasets, eval­u­ated based on a num­ber of cri­te­ria includ­ing: rel­a­tive data bias, cross-­dataset gen­er­al­iza­tion, effects of closed-­world assump­tion, and sam­ple val­ue. The exper­i­men­tal results, some rather sur­pris­ing, sug­gest direc­tions that can improve dataset col­lec­tion as well as algo­rithm eval­u­a­tion pro­to­cols. But more broad­ly, the hope is to stim­u­late dis­cus­sion in the com­mu­nity regard­ing this very impor­tant, but largely neglected issue.

They demon­strate on sev­eral datasets (in­clud­ing Ima­geNet), that it’s pos­si­ble for a SVM (CNNs were not used) to guess at above chance lev­els what dataset an image comes from and that there are notice­able drops in accu­racy when a clas­si­fier trained on one dataset is applied to osten­si­bly the same cat­e­gory in another dataset (eg an Ima­geNet “car” SVM clas­si­fier applied to PASCAL’s “car” images will go from 57% to 36% accu­ra­cy). But—per­haps the glass is half-­ful­l—in none of the pairs does the per­for­mance degrade to near-ze­ro, so despite the def­i­nite pres­ence of dataset bias, the SVMs are still learn­ing gen­er­al­iz­able, trans­fer­able image clas­si­fi­ca­tion (sim­i­lar­ly, //9/// show a gen­er­al­iza­tion gap but only a small one with typ­i­cally bet­ter in-sam­ple clas­si­fiers per­form­ing bet­ter out­-of-sam­ple, show that Ima­geNet resnets pro­duce mul­ti­ple new SOTAs on other image datasets using fine­tun­ing trans­fer learn­ing, com­pares Fisher vec­tors (an SVM trained on SIFT fea­tures, & is one of a num­ber of scal­ing papers show­ing much bet­ter rep­re­sen­ta­tions & robust­ness & trans­fer with extremely large CNNs) to CNNs on PASCAL VOC again, find­ing the Fish­ers over­fit by eg clas­si­fy­ing horses based on copy­right water­marks while the CNN nev­er­the­less clas­si­fies it based on the cor­rect parts, although the CNN may suc­cumb to a dif­fer­ent dataset bias by clas­si­fy­ing air­planes based on hav­ing back­grounds of skies10); and I believe we have good rea­son to expect our CNNs to also work in the wild.

Some real instances of dataset bias, more or less (most of these were caught by stan­dard held­out datasets and arguably aren’t the ‘tank story’ at all):

  • a par­tic­u­larly appro­pri­ate exam­ple is the unsuc­cess­ful : a fail­ure, among sev­eral rea­sons, because the dogs were trained on Russ­ian tanks and sought them out rather than the enemy Ger­man tanks because the dogs rec­og­nized either the fuel smell or fuel can­is­ters (diesel vs gaso­line)

  • “The per­son con­cept in mon­keys (Cebus apella)”, D’Am­ato & Van Sant 1988

  • Google Pho­tos in June 2015 caused a social-­me­dia fuss over mis­la­bel­ing African-Amer­i­cans as goril­las; Google did not explain how the Pho­tos app made that mis­take but it is pre­sum­ably using a CNN and an exam­ple of either dataset bias (many more Caucasian/Asian faces lead­ing to bet­ter per­for­mance on them and con­tin­ued poor per­for­mance every­where else) and/or a mis­-spec­i­fied loss func­tion (the CNN opti­miz­ing a stan­dard clas­si­fi­ca­tion loss and respond­ing to class imbal­ance or objec­tive color sim­i­lar­ity by pre­fer­ring to guess ‘gorilla’ rather than ‘human’ to min­i­mize loss, despite what ought to be a greater penalty for mis­tak­enly clas­si­fy­ing a human as an animal/object rather than vice ver­sa). A sim­i­lar issue occurred with Flickr in May 2015.

  • , Kuehlkamp et al 2017

  • Gidi Shper­ber, “What I’ve learned from Kag­gle’s fish­eries com­pe­ti­tion” (2017-05-01): ini­tial appli­ca­tion of VGG Ima­geNet CNNs for trans­fer solved the fish pho­to­graph clas­si­fi­ca­tion prob­lem almost imme­di­ate­ly, but failed on the sub­mis­sion val­i­da­tion set; fish cat­e­gories could be pre­dicted from the spe­cific boat tak­ing the pho­tographs

  • “Leak­age in data min­ing: For­mu­la­tion, detec­tion, and avoid­ance”, Kauf­man et al 2011 dis­cusses the gen­eral topic and men­tions a few exam­ples from KDD-Cup

  • Dan Piponi (2017-10-16): “Real world exam­ple from work: hos­pi­tals spe­cialise in dif­fer­ent injuries so CNN for diag­no­sis used anno­ta­tions on x-rays to ID hos­pi­tal.”

  • Thomas G. Diet­terich:

    We made exactly the same mis­take in one of my projects on insect recog­ni­tion. We pho­tographed 54 classes of. Insects. Spec­i­mens had been col­lect­ed, iden­ti­fied, and placed in vials. Vials were placed in boxes sorted by class. I hired stu­dent work­ers to pho­to­graph the spec­i­mens. Nat­u­rally they did this one box at a time; hence, one class at a time. Pho­tos were taken in alco­hol. Bub­bles would form in the alco­hol. Dif­fer­ent bub­bles on dif­fer­ent days. The learned clas­si­fier was sur­pris­ingly good. But a saliency map revealed that it was read­ing the bub­ble pat­terns and ignor­ing the spec­i­mens. I was so embar­rassed that I had made the old­est mis­take in the book (even if it was apoc­ryphal). Unbe­liev­able. Lesson: always ran­dom­ize even if you don’t know what you are con­trol­ling for!

  • a pos­si­ble case is Wu & Zhang 2016, “Auto­mated Infer­ence on Crim­i­nal­ity using Face Images”, attempt to use CNNs to clas­sify stan­dard­ized gov­ern­ment ID pho­tos of Chi­nese peo­ple by whether the per­son has been arrest­ed, the source of the crim­i­nal IDs being gov­ern­ment pub­li­ca­tions of wanted sus­pects vs ordi­nary peo­ples’ IDs col­lected online; the pho­tos are repeat­edly described as ID pho­tos and implied to be uni­form. The use of offi­cial gov­ern­ment ID pho­tos taken in advance of any crime would appear to elim­i­nate one’s imme­di­ate objec­tions about dataset bias—cer­tainly ID pho­tos would be dis­tinct in many ways from ordi­nary cropped pro­mo­tional head­shot­s—and so the results seem strong.

    In response to harsh crit­i­cism (some of which points are more rel­e­vant & likely than the oth­er­s…), Wu & Zhang admit in their response () that the dataset is not quite as implied:

    All crim­i­nal ID pho­tos are gov­ern­ment issued, but not mug shots. To our best knowl­edge, they are nor­mal gov­ern­ment issued ID por­traits like those for dri­ver’s license in USA. In con­trast, most of the non­crim­i­nal ID style pho­tos are taken offi­cially by some orga­ni­za­tions (such as real estate com­pa­nies, law firms, etc.) for their web­sites. We stress that they are not self­ies.

    While there is no direct repli­ca­tion test­ing the Wu & Zhang 2016 results that I know of, the inher­ent con­sid­er­able dif­fer­ences between the two class­es, which are not homoge­nous at all, make me highly skep­ti­cal.

  • Pos­si­ble: Win­kler et al 2019 exam­ine a com­mer­cial CNN (“Mole­an­a­lyz­er-Pro”; Haenssle et al 2018) for skin can­cer detec­tion. Con­cerned by the fact that doc­tors some­times use pur­ple mark­ers to high­light poten­tial­ly-­ma­lig­nant skin can­cers for eas­ier exam­i­na­tion, they com­pare before/after pho­tographs of skin can­cers which have been high­light­ed, and find that the pur­ple high­light­ing increases the prob­a­bil­ity of being clas­si­fied as malig­nant.

    How­ev­er, it is unclear that this is a dataset bias prob­lem, as the exist­ing train­ing datasets for skin can­cer are real­is­tic and already include pur­ple marker sam­ples11. The demon­strated manip­u­la­tion may sim­ply reflect the CNN using pur­ple as a proxy for human con­cern, which is an infor­ma­tive sig­nal and desir­able if it improves clas­si­fi­ca­tion per­for­mance in the real world on real med­ical cas­es. It is pos­si­ble that the train­ing datasets are in fact biased to some degree with too much/too lit­tle pur­ple or that use of pur­ple dif­fers sys­tem­at­i­cally across hos­pi­tals, and those would dam­age per­for­mance to some degree, but that is not demon­strated by their before/after com­par­i­son. Ide­al­ly, one would run a field trial to test the CNN’s per­for­mance as a whole by using it in var­i­ous hos­pi­tals and then fol­low­ing up on all cases to deter­mine benign or malig­nant; if the clas­si­fi­ca­tion per­for­mance drops con­sid­er­ably from the orig­i­nal train­ing, then that implies some­thing (pos­si­bly the pur­ple high­light­ing) has gone wrong.

  • Pos­si­ble: Esteva et al 2011 trains a skin can­cer clas­si­fier; the final CNN per­forms well in inde­pen­dent test sets. The paper does not men­tion this prob­lem but media cov­er­age reported that rulers in pho­tographs served as unin­ten­tional fea­tures:

    He and his col­leagues had one such prob­lem in their their study with rulers. When der­ma­tol­o­gists are look­ing at a lesion that they think might be a tumor, they’ll break out a ruler—the type you might have used in grade school—to take an accu­rate mea­sure­ment of its size. Der­ma­tol­o­gists tend to do this only for lesions that are a cause for con­cern. So in the set of biopsy images, if an image had a ruler in it, the algo­rithm was more likely to call a tumor malig­nant, because the pres­ence of a ruler cor­re­lated with an increased like­li­hood a lesion was can­cer­ous. Unfor­tu­nate­ly, as Novoa empha­sizes, the algo­rithm does­n’t know why that cor­re­la­tion makes sense, so it could eas­ily mis­in­ter­pret a ran­dom ruler sight­ing as grounds to diag­nose can­cer.

    It’s unclear how they detected this prob­lem or how they fixed it. And like Win­kler et al 2019, it’s unclear if this was a prob­lem which would reduce real-­world per­for­mance (are der­ma­tol­o­gists going to stop mea­sur­ing wor­ri­some lesion­s?).

Should We Tell Stories We Know Aren’t True?

So the NN tank story prob­a­bly did­n’t hap­pen as described, but some­thing some­what like it could have hap­pened and things sort of like it could hap­pen now, and it is (as proven by its his­to­ry) a catchy story to warn stu­dents with­—it’s not true but it’s . Should we still men­tion it to jour­nal­ists or in blog posts or in dis­cus­sions of AI risk, as a noble lie?

I think not. In gen­er­al, we should pro­mote more epis­temic rigor and higher stan­dards in an area where there is already far too much impact of fic­tional sto­ries (eg the depress­ing inevitabil­ity of a Ter­mi­na­tor allu­sion in AI risk dis­cus­sion­s). Nor do I con­sider the story par­tic­u­larly effec­tive from a didac­tic per­spec­tive: rel­e­gat­ing dataset bias to myth­i­cal sto­ries does not inform the lis­tener about how com­mon or how seri­ous dataset bias is, nor is it help­ful for researchers inves­ti­gat­ing coun­ter­mea­sures and diag­nos­tic­s—the LIME devel­op­ers, for exam­ple, are not helped by sto­ries about Russ­ian tanks, but need real test­cases to show that their inter­pretabil­ity tools work & would help machine learn­ing devel­op­ers diag­nose & fix dataset bias.

I also fear that telling the tank story tends to pro­mote com­pla­cency and under­es­ti­ma­tion of the state of the art by imply­ing that NNs and AI in gen­eral are toy sys­tems which are far from prac­ti­cal­ity & can­not work in the real world (par­tic­u­larly the story vari­ants which date it rel­a­tively recent­ly), or that such sys­tems when they fail will fail in eas­ily diag­nosed, vis­i­ble, some­times amus­ing ways, ways which can be diag­nosed by a human com­par­ing the pho­tos or apply­ing some polit­i­cal rea­son­ing to the out­puts; but mod­ern NNs are pow­er­ful, are often deployed to the real world despite the spec­tre of dataset bias, and do not fail in bla­tant ways—what we actu­ally see with deep learn­ing are far more con­cern­ing fail­ure modes like “adver­sar­ial exam­ples” which are quite as inscrutable as the neural nets them­selves (or AlphaGo’s one mis­judged move result­ing in its only loss to Lee Sedol). Adver­sar­ial exam­ples are par­tic­u­larly insid­i­ous as the NN will work flaw­lessly in all the nor­mal set­tings and con­texts, only to fail totally when exposed to a cus­tom adver­sar­ial input. More impor­tant­ly, dataset bias and fail­ure to trans­fer tends to be a self­-lim­it­ing prob­lem, par­tic­u­larly when embed­ded in an ongo­ing sys­tem or rein­force­ment learn­ing agent, since if the NN is mak­ing errors based on dataset bias, it will in effect be gen­er­at­ing new coun­terex­am­ple dat­a­points for its next iter­a­tion.

Alternative examples

“There is noth­ing so use­less as doing effi­ciently that which should not be done at all.”

Peter Drucker

The more trou­bling errors are ones where the goal itself, the reward func­tion, is mis­-spec­i­fied or wrong or harm­ful. I am less wor­ried about algo­rithms learn­ing to do poorly the right thing for the wrong rea­sons because humans are sloppy in their data col­lec­tion than I am about them learn­ing to do well the wrong thing for the right rea­sons despite per­fect data col­lec­tion. Using losses which have lit­tle to do with the true human util­ity func­tion or deci­sion con­text is far more com­mon than seri­ous dataset bias: peo­ple think about where their data is com­ing from, but they tend not to think about what the con­se­quences of wrong clas­si­fi­ca­tions are. Such reward func­tion prob­lems can­not be fixed by col­lect­ing any amount of data or mak­ing data more rep­re­sen­ta­tive of the real world, and for large-s­cale sys­tems will be more harm­ful.

Unfor­tu­nate­ly, I know of no par­tic­u­larly com­pre­hen­sive lists of exam­ples of mis­-spec­i­fied rewards/unexpectedly bad proxy objec­tive functions/“reward hack­ing”/“wire­head­ing”/“per­verse instan­ti­a­tion”12; per­haps peo­ple can make sug­ges­tions, but a few exam­ples I have found or recall include:

  • opti­miza­tion for nutri­tious (not nec­es­sar­ily palat­able!) low-­cost diets: “The cost of sub­sis­tence”, Stigler 1945, “The Diet Prob­lem”, Dantzig 1990, “Stigler’s Diet Prob­lem Revis­ited”, Gar­ille & Gass 2001

    • SMT/SAT solvers are like­wise infa­mous for find­ing strictly valid yet sur­pris­ing or use­less, which per­ver­sity is exactly what makes them so invalu­able in security/formal-verification research (for exam­ple, in RISC-V ver­i­fi­ca­tion of excep­tions, dis­cov­er­ing that it can trig­ger an excep­tion by turn­ing on a debug unit & set­ting a break­point, or using an obscure mem­ory mode set­ting)
  • boat race reward-shap­ing for pick­ing up tar­gets results in not fin­ish race at all but going in cir­cles to hit tar­gets: “Faulty Reward Func­tions in the Wild”, Ope­nAI

  • a clas­sic 3D robot­-arm NN agent, in a some­what unusual setup where the evaluator/reward func­tion is another NN trained to pre­dict human eval­u­a­tions, learns to move the arm to a posi­tion which looks like it is posi­tioned at the goal but is actu­ally just in between the ‘cam­era’ and the goal: , Chris­tiano et al 2017, Ope­nAI

  • reward-shap­ing a bicy­cle agent for not falling over & mak­ing progress towards a goal point (but not pun­ish­ing for mov­ing away) leads it to learn to cir­cle around the goal in a phys­i­cally sta­ble loop: “Learn­ing to Drive a Bicy­cle using Rein­force­ment Learn­ing and Shap­ing”, Randlov & Alstrom 1998; sim­i­lar dif­fi­cul­ties in avoid­ing patho­log­i­cal opti­miza­tion were expe­ri­enced by Cook 2004 (video of pol­i­cy-it­er­a­tion learn­ing to spin han­dle-bar to stay upright).

  • reward-shap­ing a soc­cer robot for touch­ing the ball caused it to learn to get to the ball and “vibrate” touch­ing it as fast as pos­si­ble: David Andre & Astro Teller in Ng et al 1999, “Pol­icy invari­ance under reward trans­for­ma­tions: the­ory and appli­ca­tion to reward shap­ing”

  • envi­ron­ments involv­ing walking/running/movement and reward­ing move­ment seem to often result in the agents learn­ing to fall over as a local optima of speed gen­er­a­tion, pos­si­bly bounc­ing around; for exam­ple, Sims notes in one paper (Sims 1994) that “It is impor­tant that the phys­i­cal sim­u­la­tion be rea­son­ably accu­rate when opti­miz­ing for crea­tures that can move within it. Any bugs that allow energy leaks from non-­con­ser­va­tion, or even round-off errors, will inevitably be dis­cov­ered and exploited by the evolv­ing crea­tures. …speed is used as the selec­tion cri­te­ria, but the ver­ti­cal com­po­nent of veloc­ity is ignored. For land envi­ron­ments, it can be nec­es­sary to pre­vent crea­tures from gen­er­at­ing high veloc­i­ties by sim­ply falling over.” Com­bined with “3-D Mor­phol­ogy”, Sims dis­cov­ered that with­out height lim­its, the crea­tures just became as tall as pos­si­ble and fell over; and if the con­ser­va­tion-of-­mo­men­tum was not exact, crea­tures could evolve ‘pad­dles’ and pad­dle them­selves at high veloc­i­ty. (Evolv­ing sim­i­lar exploita­tion of round­ing-off has been done by Ope­nAI in 2017 to turn appar­ently lin­ear neural net­works into non­lin­ear ones; Jader­berg et al 2019 appears to have had a sim­i­lar momen­tum bug in its Quake sim­u­la­tor: “In one test, the bots invented a com­pletely novel strat­e­gy, exploit­ing a bug that let team­mates give each other a speed boost by shoot­ing them in the back.”)

  • , train­ing a sim­u­lated robot grip­per arm to stack objects like Legos, included reward shap­ing; patholo­gies included “hov­er­ing” and for a reward-shap­ing for lift­ing the bot­tom face of the top block upwards, DDPG learned to knock the blocks over, thereby (tem­porar­i­ly) ele­vat­ing the bot­tom of the top block and receiv­ing the reward:

    We con­sider three dif­fer­ent com­pos­ite rewards in addi­tional to the orig­i­nal sparse task reward:

    1. Grasp shap­ing: Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.25 when the brick 1 has been grasped and a reward of 1.0 after com­ple­tion of the full task.
    2. Reach and grasp shap­ing: Reach brick 1, Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.125 when being close to brick 1, a reward of 0.25 when brick 1 has been grasped, and a reward of 1.0 after com­ple­tion of the full task.
    3. Full com­pos­ite shap­ing: the sparse reward com­po­nents as before in com­bi­na­tion with the dis­tance-based smoothly vary­ing com­po­nents.

    Fig­ure 5 shows the results of learn­ing with the above reward func­tions (blue traces). The fig­ure makes clear that learn­ing with the sparse reward only does not suc­ceed for the full task. Intro­duc­ing an inter­me­di­ate reward for grasp­ing allows the agent to learn to grasp but learn­ing is very slow. The time to suc­cess­ful grasp­ing can be sub­stan­tially reduced by giv­ing a dis­tance based reward com­po­nent for reach­ing to the first brick, but learn­ing does not progress beyond grasp­ing. Only with an addi­tional inter­me­di­ate reward com­po­nent as in con­tin­u­ous reach, grasp, stack the full task can be solved.

    Although the above reward func­tions are spe­cific to the par­tic­u­lar task, we expect that the idea of a com­pos­ite reward func­tion can be applied to many other tasks thus allow­ing learn­ing for to suc­ceed even for chal­leng­ing prob­lems. Nev­er­the­less, great care must be taken when defin­ing the reward func­tion. We encoun­tered sev­eral unex­pected fail­ure cases while design­ing the reward func­tion com­po­nents: e.g. reach and grasp com­po­nents lead­ing to a grasp unsuit­able for stack­ing, agent not stack­ing the bricks because it will stop receiv­ing the grasp­ing reward before it receives reward for stack­ing and the agent flips the brick because it gets a grasp­ing reward cal­cu­lated with the wrong ref­er­ence point on the brick. We show exam­ples of these in the video.

  • RL agents using learned mod­el-based plan­ning par­a­digms such as the model pre­dic­tive con­trol are noted to have issues with the plan­ner essen­tially exploit­ing the learned model by choos­ing a plan going through the worst-­mod­eled parts of the envi­ron­ment and pro­duc­ing unre­al­is­tic plans using tele­por­ta­tion, eg Mishra et al 2017, “Pre­dic­tion and Con­trol with Tem­po­ral Seg­ment Mod­els” who note:

    If we attempt to solve the opti­miza­tion prob­lem as posed in (2), the solu­tion will often attempt to apply action sequences out­side the man­i­fold where the dynam­ics model is valid: these actions come from a very dif­fer­ent dis­tri­b­u­tion than the action dis­tri­b­u­tion of the train­ing data. This can be prob­lem­at­ic: the opti­miza­tion may find actions that achieve high rewards under the model (by exploit­ing it in a regime where it is invalid) but that do not accom­plish the goal when they are exe­cuted in the real envi­ron­ment.

    …Next, we com­pare our method to the base­lines on tra­jec­tory and pol­icy opti­miza­tion. Of inter­est is both the actual reward achieved in the envi­ron­ment, and the dif­fer­ence between the true reward and the expected reward under the mod­el. If a con­trol algo­rithm exploits the model to pre­dict unre­al­is­tic behav­ior, then the lat­ter will be large. We con­sider two tasks….Un­der each mod­el, the opti­miza­tion finds actions that achieve sim­i­lar mod­el-pre­dicted rewards, but the base­lines suf­fer from large dis­crep­an­cies between model pre­dic­tion and the true dynam­ics. Qual­i­ta­tive­ly, we notice that, on the push­ing task, the opti­miza­tion exploits the LSTM and one-step mod­els to pre­dict unre­al­is­tic state tra­jec­to­ries, such as the object mov­ing with­out being touched or the arm pass­ing through the object instead of col­lid­ing with it. Our model con­sis­tently per­forms bet­ter, and, with a latent action pri­or, the true exe­cu­tion closely matches the mod­el’s pre­dic­tion. When it makes inac­cu­rate pre­dic­tions, it respects phys­i­cal invari­ants, such as objects stay­ing still unless they are touched, or not pen­e­trat­ing each other when they col­lide

    This is sim­i­lar to Sim­s’s issues, or cur­rent issues in train­ing walk­ing or run­ning agents in envi­ron­ments like MuJoCo where it is easy for them to learn odd gaits like hop­ping ( adds extra penal­ties for impacts to try to avoid this) or jump­ing (eg Stel­maszczyk’s attempts at reward shap­ing a skele­ton agent) or flail­ing around wildly ( add ran­dom pushes/shoves to the envi­ron­ment to try to make the agent learn more gen­er­al­iz­able poli­cies) which may work quite well in the spe­cific sim­u­la­tion but not else­where. (To some degree this is ben­e­fi­cial for dri­ving explo­ration in poor­ly-un­der­stood regions, so it’s not all bad.) Chris­tine Bar­ron, work­ing on a pan­cake-­cook­ing robot­-arm sim­u­la­tion, ran into reward-shap­ing prob­lems: reward­ing for each timestep with­out the pan­cake on the floor teaches the agent to hurl the pan­cake into the air as hard as pos­si­ble; and for the pass­ing-the-but­ter agent, reward­ing for get­ting close to the goal pro­duces the same close-ap­proach-but-avoid­ance behav­ior to max­i­mize reward.

  • A curi­ous lex­i­co­graph­ic-pref­er­ence raw-RAM NES AI algo­rithm learns to pause the game to never lose at Tetris: Mur­phy 2013, “The First Level of Super Mario Bros. is Easy with Lex­i­co­graphic Order­ings and Time Trav­el… after that it gets a lit­tle tricky”

  • RL agent in Udac­ity self­-­driv­ing car rewarded for speed learns to spin in cir­cles: Matt Kel­cey

  • NASA Mars mis­sion plan­ning, opti­miz­ing food/water/electricity con­sump­tion for total man-­days sur­vival, yields an opti­mal plan of killing 2/3 crew & keep sur­vivor alive as long as pos­si­ble: iand675

  • Doug Lenat’s famously had issues with “par­a­sitic” heuris­tics, due to the self­-­mod­i­fy­ing abil­i­ty, edited impor­tant results to claim credit and be reward­ed, part of a class of such wire­head­ing heuris­tics that Lenat made the Eurisko core unmod­i­fi­able: EURISKO: A pro­gram that learns new heuris­tics and domain con­cepts: the nature of heuris­tics III: pro­gram design and results”, Lenat 1983 (pg90)

  • genetic algo­rithms for image clas­si­fi­ca­tion evolves tim­ing-at­tack to infer image labels based on hard drive stor­age loca­tion: https://news.ycombinator.com/item?id=6269114

  • train­ing a dog to roll over results in slam­ming against the wall: http://lesswrong.com/lw/7qz/machine_learning_and_unintended_consequences/4vlv ; dol­phins rewarded for find­ing trash & dead seag­ulls in their tank learned to man­u­fac­ture trash & hunt liv­ing seag­ulls for more rewards

  • cir­cuit design with genetic/evolutionary com­pu­ta­tion:

    • an attempt to evolve a cir­cuit on an FPGA, to dis­crim­i­nate audio tones of 1kHz & 10kHz with­out using any tim­ing ele­ments, evolved a design which depended on dis­con­nected cir­cuits in order to work: “An evolved cir­cuit, intrin­sic in sil­i­con, entwined with physics”, Thomp­son 1996. (“Pos­si­ble mech­a­nisms include inter­ac­tions through the pow­er-­sup­ply wiring, or elec­tro­mag­netic cou­pling.” The evolved cir­cuit is sen­si­tive to room tem­per­a­ture vari­a­tions 23–43C, only work­ing per­fectly over the 10C range of room tem­per­a­ture it was exposed to dur­ing the 2 weeks of evo­lu­tion. It is also sen­si­tive to the exact loca­tion on the FPGA, degrad­ing when shifted to a new posi­tion; fur­ther fine­tun­ing evo­lu­tion fixes that, but then is vul­ner­a­ble when shifted back to the orig­i­nal loca­tion.)
    • an attempt to evolve an oscil­la­tor or a timer wound up evolv­ing a cir­cuit which picked up radio sig­nals from the lab PCs (although since the cir­cuits did work at their assigned func­tion as the human intend­ed, should we con­sider this a case of ‘dataset bias’ where the ‘dataset’ is the local lab envi­ron­men­t?): “The evolved radio and its impli­ca­tions for mod­el­ling the evo­lu­tion of novel sen­sors”, Jon Bird and Paul Layzell 2002
  • train­ing a “mini­taur” bot in sim­u­la­tion to carry a ball or duck on its back, CMA-ES dis­cov­ers it can drop the ball into a leg joint and then wig­gle across the floor with­out the ball ever drop­ping

  • CycleGAN, a coop­er­a­tive GAN archi­tec­ture for con­vert­ing images from one genre to another (eg hors­es⟺ze­bras), has a loss func­tion that rewards accu­rate recon­struc­tion of images from its trans­formed ver­sion; CycleGAN turns out to par­tially solve the task by, in addi­tion to the cross-­do­main analo­gies it learns, stegano­graph­i­cally hid­ing autoen­coder-style data about the orig­i­nal image invis­i­bly inside the trans­formed image to assist the recon­struc­tion of details ()

    A researcher in 2020 work­ing on art col­oriza­tion told me of an inter­est­ing sim­i­lar behav­ior: his auto­mat­i­cal­ly-grayscaled images were fail­ing to train the NN well, and he con­cluded that this was because grayscal­ing a color image pro­duces many shades of gray in a way that human artists do not, and that the for­mula used by OpenCV for RGB→grayscale per­mits only a few col­ors to map onto any given shade of gray, enabling accu­rate guess­ing of the orig­i­nal col­or! Such issues might require learn­ing a grayscaler, sim­i­lar to super­res­o­lu­tion need­ing learned down­scalers ().

  • the ROUGE machine trans­la­tion met­ric, based on match­ing sub­-phras­es, is typ­i­cally used with RL tech­niques since it is a non-d­if­fer­en­tiable loss; Sales­force () notes that an effort at a ROUGE-only sum­ma­riza­tion NN pro­duced largely gib­ber­ish sum­maries, and had to add in another loss func­tion to get high­-qual­ity results

  • Alex Irpan writes of 3 anec­dotes:

    In talks with other RL researchers, I’ve heard sev­eral anec­dotes about the novel behav­ior they’ve seen from improp­erly defined rewards.

    • A coworker is teach­ing an agent to nav­i­gate a room. The episode ter­mi­nates if the agent walks out of bounds. He did­n’t add any penalty if the episode ter­mi­nates this way. The final pol­icy learned to be sui­ci­dal, because neg­a­tive reward was plen­ti­ful, pos­i­tive reward was too hard to achieve, and a quick death end­ing in 0 reward was prefer­able to a long life that risked neg­a­tive reward.
    • A friend is train­ing a sim­u­lated robot arm to reach towards a point above a table. It turns out the point was defined with respect to the table, and the table was­n’t anchored to any­thing. The pol­icy learned to slam the table really hard, mak­ing the table fall over, which moved the tar­get point too. The tar­get point just so hap­pened to fall next to the end of the arm.
    • A researcher gives a talk about using RL to train a sim­u­lated robot hand to pick up a ham­mer and ham­mer in a nail. Ini­tial­ly, the reward was defined by how far the nail was pushed into the hole. Instead of pick­ing up the ham­mer, the robot used its own limbs to punch the nail in. So, they added a reward term to encour­age pick­ing up the ham­mer, and retrained the pol­i­cy. They got the pol­icy to pick up the ham­mer…but then it threw the ham­mer at the nail instead of actu­ally using it.

    Admit­ted­ly, these are all sec­ond­hand accounts, and I haven’t seen videos of any of these behav­iors. How­ev­er, none of it sounds implau­si­ble to me. I’ve been burned by RL too many times to believe oth­er­wise…I’ve taken to imag­in­ing deep RL as a demon that’s delib­er­ately mis­in­ter­pret­ing your reward and actively search­ing for the lazi­est pos­si­ble local opti­ma. It’s a bit ridicu­lous, but I’ve found it’s actu­ally a pro­duc­tive mind­set to have.

  • : an evo­lu­tion­ary strate­gies RL in the ALE game finds that it can steadily earn points by com­mit­ting ‘sui­cide’ to lure an enemy into fol­low­ing it; more inter­est­ing­ly, it also dis­cov­ers what appears to be a pre­vi­ously unknown bug where a sequence of jumps will, semi­-ran­dom­ly, per­ma­nently force the game into a state where the entire level begins flash­ing and the score increases rapidly & indef­i­nitely until the game is reset (video)

  • notes a bor­der­line case in the ALE pin­ball game where the ‘nudge’ abil­ity is unlim­ited (un­like all real pin­ball machi­nes) and a DQN can learn to score arbi­trar­ily by the ball budg­ing over a switch repeat­ed­ly:

    The sec­ond show­case exam­ple stud­ies neural net­work mod­els (see Fig­ure 5 for the net­work archi­tec­ture) trained to play Atari games, here Pin­ball. As shown in [5], the DNN achieves excel­lent results beyond human per­for­mance. Like for the pre­vi­ous exam­ple, we con­struct LRP heatmaps to visu­al­ize the DNN’s deci­sion behav­ior in terms of pix­els of the pin­ball game. Inter­est­ing­ly, after exten­sive train­ing, the heatmaps become focused on few pix­els rep­re­sent­ing high­-s­cor­ing switches and loose track of the flip­pers. A sub­se­quent inspec­tion of the games in which these par­tic­u­lar LRP heatmaps occur, reveals that DNN agent firstly moves the ball into the vicin­ity of a high­-s­cor­ing switch with­out using the flip­pers at all, then, sec­ond­ly, “nudges” the vir­tual pin­ball table such that the ball infi­nitely trig­gers the switch by pass­ing over it back and forth,with­out caus­ing a tilt of the pin­ball table (see Fig­ure 2b and Fig­ure 6 for the heatmaps show­ing this point, and also Sup­ple­men­tary Video 1). Here, the model has learned to abuse the “nudg­ing” thresh­old imple­mented through the tilt­ing mech­a­nism in the Atari Pin­ball soft­ware. From a pure game scor­ing per­spec­tive, it is indeed a ratio­nal choice to exploit any game mech­a­nism that is avail­able. In a real pin­ball game, how­ev­er, the player would go likely bust since the pin­ball machin­ery is pro­grammed to tilt after a few strong move­ments of the whole phys­i­cal machine.

  • , Saun­ders et al 2017; the blog writeup notes:

    The Road Run­ner results are espe­cially inter­est­ing. Our goal is to have the agent learn to play Road Run­ner with­out los­ing a sin­gle life on Level 1 of the game. Deep RL agents are known to dis­cover a ‘Score Exploit’ in Road Run­ner: they learn to inten­tion­ally kill them­selves in a way that (para­dox­i­cal­ly) earns greater reward. Dying at a pre­cise time causes the agent to repeat part of Level 1, where it earns more points than on Level 2. This is a local opti­mum in pol­icy space that a human gamer would never be stuck in.

    Ide­al­ly, our Blocker would pre­vent all deaths on Level 1 and hence elim­i­nate the Score Exploit. How­ev­er, through ran­dom explo­ration the agent may hit upon ways of dying that “fool” our Blocker (be­cause they look dif­fer­ent from exam­ples in its train­ing set) and hence learn a new ver­sion of the Score Exploit. In other words, the agent is implic­itly per­form­ing a ran­dom search for adver­sar­ial exam­ples for our Blocker (which is a con­vo­lu­tional neural net)…In Road Run­ner we did not achieve zero cat­a­stro­phes but were able to reduce the rate of deaths per frame from 0.005 (with no human over­sight at all) to 0.0001.

  • note var­i­ous bugs in the ALE games, but also a new infi­nite loop for max­i­miz­ing scores:

    Final­ly, we dis­cov­ered that on some games the actual opti­mal strat­egy is by doing a loop over and over giv­ing a small amount of reward. In Ele­va­tor Action the agent learn to stay at the first floor and kill over and over the first ene­my. This behav­ior can­not be seen as an actual issue as the agent is basi­cally opti­miz­ing score but this is def­i­nitely not the intended goal. A human player would never per­form this way.

  • ’s R2D3 writeup notes:

    Wall Sen­sor Stack: The orig­i­nal Wall Sen­sor Stack envi­ron­ment had a bug that the R2D3 agent was able to exploit. We fixed the bug and ver­i­fied the agent can learn the proper stack­ing behav­ior.

    …An­other desir­able prop­erty of our approach is that our agents are able to learn to out­per­form the demon­stra­tors, and in some cases even to dis­cover strate­gies that the demon­stra­tors were not aware of. In one of our tasks the agent is able to dis­cover and exploit a bug in the envi­ron­ment in spite of all the demon­stra­tors com­plet­ing the task in the intended way…R2D3 per­formed bet­ter than our aver­age human demon­stra­tor on Base­ball, Draw­bridge, Nav­i­gate Cubes and the Wall Sen­sor tasks. The behav­ior on Wall Sen­sor Stack in par­tic­u­lar is quite inter­est­ing. On this task R2D3 found a com­pletely dif­fer­ent strat­egy than the human demon­stra­tors by exploit­ing a bug in the imple­men­ta­tion of the envi­ron­ment. The intended strat­egy for this task is to stack two blocks on top of each other so that one of them can remain in con­tact with a wall mounted sen­sor, and this is the strat­egy employed by the demon­stra­tors. How­ev­er, due to a bug in the envi­ron­ment the strat­egy learned by R2D3 was to trick the sen­sor into remain­ing active even when it is not in con­tact with the key by press­ing the key against it in a pre­cise way.

  • , Baker et al 2019:

    We orig­i­nally believed defend­ing against ramp use would be the last stage of emer­gence in this envi­ron­ment; how­ev­er, we were sur­prised to find that yet two more qual­i­ta­tively new strate­gies emerged. After 380 mil­lion total episodes of train­ing, the seek­ers learn to bring a box to the edge of the play area where the hiders have locked the ramps. The seek­ers then jump on top of the box and surf it to the hiders’ shel­ter; this is pos­si­ble because the envi­ron­ment allows agents to move together with the box regard­less of whether they are on the ground or not. In respon­se, the hiders learn to lock all of the boxes in place before build­ing their shel­ter.

    The OA blog post expands on the noted exploits:

    Sur­pris­ing behav­iors: We’ve shown that agents can learn sophis­ti­cated tool use in a high fidelity physics sim­u­la­tor; how­ev­er, there were many lessons learned along the way to this result. Build­ing envi­ron­ments is not easy and it is quite often the case that agents find a way to exploit the envi­ron­ment you build or the physics engine in an unin­tended way.

    • Box surf­ing: Since agents move by apply­ing forces to them­selves, they can grab a box while on top of it and “surf” it to the hider’s loca­tion.
    • End­less run­ning: With­out adding explicit neg­a­tive rewards for agents leav­ing the play area, in rare cases hiders will learn to take a box and end­lessly run with it.
    • Ramp exploita­tion (hider­s): Rein­force­ment learn­ing is amaz­ing at find­ing small mechan­ics to exploit. In this case, hiders abuse the con­tact physics and remove ramps from the play area.
    • Ramp exploita­tion (seek­er­s): In this case, seek­ers learn that if they run at a wall with a ramp at the right angle, they can launch them­selves upward.
  • , Ziegler et al 2019 (), fine-­tune trained an Eng­lish text gen­er­a­tion model based on human rat­ings; they pro­vide a curi­ous exam­ple of a reward spec­i­fi­ca­tion bug. Here, the reward was acci­den­tally negat­ed; this rever­sal, rather than result­ing in non­sense, resulted in (lit­er­al­ly) per­versely coher­ent behav­ior of emit­ting obscen­i­ties to max­i­mize the new score:

    Bugs can opti­mize for bad behav­ior: One of our code refac­tors intro­duced a bug which flipped the sign of the reward. Flip­ping the reward would usu­ally pro­duce inco­her­ent text, but the same bug also flipped the sign of the KL penal­ty. The result was a model which opti­mized for neg­a­tive sen­ti­ment while pre­serv­ing nat­ural lan­guage. Since our instruc­tions told humans to give very low rat­ings to con­tin­u­a­tions with sex­u­ally explicit text, the model quickly learned to out­put only con­tent of this form. This bug was remark­able since the result was not gib­ber­ish but max­i­mally bad out­put. The authors were asleep dur­ing the train­ing process, so the prob­lem was noticed only once train­ing had fin­ished. A mech­a­nism such as Toy­ota’s cord could have pre­vented this, by allow­ing any labeler to stop a prob­lem­atic train­ing process.

See Also


  1. The paper in ques­tion dis­cusses gen­eral ques­tions of nec­es­sary res­o­lu­tion, com­put­ing require­ments, optics, nec­es­sary error rates, and algo­rithms, but does­n’t describe any imple­mented sys­tems, much less expe­ri­ences which resem­ble the tank sto­ry.↩︎

  2. Another inter­est­ing detail from Harley et al 1962 about their tank study: in dis­cussing design­ing their com­puter ‘sim­u­la­tion’ of their qua­si­-NN algo­rithms, their descrip­tion of the pho­tographs on pg133 makes it sound as if the dataset was con­structed from the same pho­tographs by using large-s­cale aer­ial footage and then crop­ping out the small squares with tanks and then cor­re­spond­ing small squares with­out tanks—so they only had to process one set of pho­tographs, and the result­ing tank/non-tank sam­ples are inher­ently matched on date, weath­er, time of day, light­ing, gen­eral loca­tion, roll of film, cam­era, and pho­tog­ra­ph­er. If true, that would make almost all the var­i­ous sug­gested tank prob­lem short­cuts impos­si­ble, and would be fur­ther evi­dence that Kanal’s project was not & could not have been the ori­gin of the tank sto­ry.↩︎

  3. This seems entirely rea­son­able to me, given that hardly any AI research existed at that point. While it’s unclear what results were accom­plished imme­di­ately thanks to the 1956 work­shop, many of the atten­dees would make major dis­cov­er­ies in AI. Attendee wife, Grace Solomonoff (“Ray Solomonoff and the Dart­mouth Sum­mer Research Project in Arti­fi­cial Intel­li­gence, 1956”, 2016) describes the work­shop as hav­ing vivid dis­cus­sions but was com­pro­mised by get­ting only half its fund­ing (so it did­n’t last the sum­mer) and atten­dees show­ing up spo­rad­i­cally & for short times (“Many par­tic­i­pants only showed up for a day or even less.”); no agree­ment was reached on a spe­cific project to try to tack­le, although Solomonoff did write a paper there he con­sid­ered impor­tant.↩︎

  4. One com­menter observes that the NN tank story and ilk appears to almost always be told about neural net­works, and won­ders why when dataset bias ought to be just as much a prob­lem for other statistical/machine-learning meth­ods like deci­sion trees, which are capa­ble of learn­ing com­plex non­lin­ear prob­lems. I could note that these anec­dotes also get rou­tinely told about genetic algo­rithms & evo­lu­tion­ary meth­ods, so it’s not purely neu­ral, and it might be that NNs are vic­tims of their own suc­cess: par­tic­u­larly as of 2017, NNs are so pow­er­ful & flex­i­ble in some areas (like com­puter vision) there is lit­tle com­pe­ti­tion, and so any hor­ror sto­ries will prob­a­bly involve NNs.↩︎

  5. Here, the num­ber of pho­tographs and exactly how they were divided into training/validation sets is an oddly spe­cific detail. This is rem­i­nis­cent of reli­gions or nov­els, where orig­i­nally sparse and unde­tailed sto­ries become elab­o­rated and ever more detailed, with strik­ing details added to catch the imag­i­na­tion. For exam­ple, the in the Chris­t­ian Gospels are unnamed, but have been given by later Chris­tians exten­sive fic­tional biogra­phies of names (“Names for the Name­less in the New Tes­ta­ment”; one of ), sym­bol­ism, king­doms, con­tem­po­rary successors/descendants, mar­tyr­doms & loca­tions of remains…↩︎

  6. One mem­o­rable exam­ple of this for me was when the Edward Snow­den NSA leaks began.

    Sure­ly, given pre­vi­ous instances like dif­fer­en­tial crypt­analy­sis or pub­lic-key cryp­tog­ra­phy, the NSA had any num­ber of amaz­ing tech­nolo­gies and moon math beyond the ken of the rest of us? I read many of the pre­sen­ta­tions with great inter­est, par­tic­u­larly about how they searched for indi­vid­u­als or data—­cut­ting edge neural net­works? Evo­lu­tion­ary algo­rithms? Even more exotic tech­niques? Nope—reg­ex­ps, lin­ear mod­els, and ran­dom forests. Prac­ti­cal but bor­ing. Nor did any major cryp­to­graphic break­throughs become exposed via Snow­den.

    Over­all, the NSA cor­pus indi­cates that they had the abil­i­ties you would expect from a large group of patient pro­gram­mers with no ethics given a bud­get of bil­lions of dol­lars to spend on a mis­sion whose motto was “hack the planet” using a com­pre­hen­sive set of meth­ods rang­ing from phys­i­cal breakins & bugs, theft of pri­vate keys, bribery, large-s­cale telecom­mu­ni­ca­tions tap­ping, implant­ing back­doors, pur­chase & dis­cov­ery of unpatched vul­ner­a­bil­i­ties, & stan­dards process sub­ver­sion. Highly effec­tive in the aggre­gate but lit­tle that peo­ple had­n’t expected or long spec­u­lated about in the abstract.↩︎

  7. Although there are occa­sional excep­tions where a data aug­men­ta­tion does­n’t pre­serve impor­tant seman­tics: you would­n’t want to use hor­i­zon­tal flips with street signs.↩︎

  8. It amuses me to note when web­sites or tools are clearly using Ima­geNet CNNs, because they assume Ima­geNet cat­e­gories or pro­vide anno­ta­tions in their meta­data, or because they exhibit uncan­nily good recog­ni­tion of dogs. Some­times CNNs are much bet­ter than they are given credit for being and they are assumed by com­menters to fail on prob­lems they actu­ally suc­ceed on; for exam­ple, some meme images have cir­cu­lated claim­ing that CNNs can’t dis­tin­guish fried chick­ens from dogs, chi­huahuas from muffins, or sleep­ing dogs from bagel­s—but as amus­ing as the image-sets are, Miles Brundage reports that Clar­i­fai’s CNN API has lit­tle trou­ble accu­rately dis­tin­guish­ing man’s worst food from man’s best friend.↩︎

  9. Recht et al 2019’s Ima­geNet-v2 turns out to illus­trate some sub­tle issues in mea­sur­ing dataset bias (Engstrom et al 2020): because of mea­sure­ment error in the labels of images caus­ing errors in the final dataset, sim­ply com­par­ing a clas­si­fier trained on one with its per­for­mance on the other and not­ing that per­for­mance fell by X% yields a mis­lead­ingly inflated esti­mate of ‘bias’ by attribut­ing the com­bined error of both datasets to the bias.↩︎

  10. Lapuschkin et al 2019:

    The first learn­ing machine is a model based on Fisher vec­tors (FV) [31, 32] trained on the PASCAL VOC 2007 image dataset [33] (see Sec­tion E). The model and also its com­peti­tor, a pre­trained Deep Neural Net­work (DNN) that we fine-­tune on PASCAL VOC, show both excel­lent state-of-the-art test set accu­racy on cat­e­gories such as ‘per­son’, ‘train’, ‘car’, or ‘horse’ of this bench­mark (see Table 3). Inspect­ing the basis of the deci­sions with LRP, how­ev­er, reveals for cer­tain images sub­stan­tial diver­gence, as the heatmaps exhibit­ing the rea­sons for the respec­tive clas­si­fi­ca­tion could not be more dif­fer­ent. Clear­ly, the DNN’s heatmap points at the horse and rider as the most rel­e­vant fea­tures (see Fig­ure 14). In con­trast, FV’s heatmap is most focused onto the lower left cor­ner of the image,which con­tains a source tag. A closer inspec­tion of the data set (of 9963 sam­ples [33]) that typ­i­cally humans never look through exhaus­tive­ly, shows that such source tags appear dis­tinc­tively on horse images; a strik­ing arti­fact of the dataset that so far had gone unno­ticed [34]. There­fore, the FV model has ‘over­fit­ted’ the PASCAL VOC dataset by rely­ing mainly on the eas­ily iden­ti­fi­able source tag, which inci­den­tally cor­re­lates with the true fea­tures, a clear case of ‘Clever Hans’ behav­ior. This is con­firmed by observ­ing that arti­fi­cially cut­ting the source tag from horse images sig­nif­i­cantly weak­ens the FV mod­el’s deci­sion while the deci­sion of the DNN stays vir­tu­ally unchanged (see Fig­ure 14). If we take instead a cor­rectly clas­si­fied image of a Fer­rari and then add to it a source tag, we observe that the FV’s pre­dic­tion swiftly changes from ‘car’ to ‘horse’ (cf. Fig­ure 2a) a clearly invalid deci­sion (see Sec­tion E and Fig­ures 15–20 for fur­ther exam­ples and analy­ses)… For the clas­si­fi­ca­tion of ships the clas­si­fier is mostly focused on the pres­ence of water in the bot­tom half of an image. Remov­ing the copy­right tag or the back­ground resultsin a drop of pre­dic­tive capa­bil­i­ties. A deep neural net­work, pre-­trained in the Ima­geNet dataset [93], instead shows none of these short­com­ings.

    The air­plane exam­ple is a lit­tle more debat­able—the pres­ence of a lot of blue sky in air­plane images seems like a valid cue to me and not nec­es­sar­ily cheat­ing:

    …The SpRAy analy­sis could fur­ther­more reveal another ‘Clever Hans’ type behav­ior in our fine-­tuned DNN mod­el, which had gone unno­ticed in pre­vi­ous man­ual analy­sis of the rel­e­vance maps. The large eigen­gaps in the eigen­value spec­trum of the DNN heatmaps for class “aero­plane” indi­cate that the model uses very dis­tinct strate­gies for clas­si­fy­ing aero­plane images (see Fig­ure 26). A t-SNE visu­al­iza­tion (Fig­ure28) fur­ther high­lights this clus­ter struc­ture. One unex­pected strat­egy we could dis­cover with the help of SpRAy is to iden­tify aero­plane images by look­ing at the arti­fi­cial padding pat­tern at the image bor­ders, which for aero­plane images pre­dom­i­nantly con­sists of uni­form and struc­ture­less blue back­ground. Note that padding is typ­i­cally intro­duced for tech­ni­cal rea­sons (the DNN model only accepts square shaped input­s), but unex­pect­edly (and unwant­ed­ly) the padding pat­tern became part of the mod­el’s strat­egy to clas­sify aero­plane images. Sub­se­quently we observe that chang­ing the man­ner in which padding is per­formed has a strong effect on the out­put of the DNN clas­si­fier (see Fig­ures 29–32).

    ↩︎
  11. Win­kler et al 2019: “When review­ing the open-ac­cess Inter­na­tional Skin Imag­ing Col­lab­o­ra­tion data­base, which is a source of train­ing images for research groups, we found that a sim­i­lar per­cent-age of melanomas (52 of 2169 [2.4%]) and nevi (214 of 9303 [2.3%]) carry skin mark­ings. Nev­er­the­less, it seems con­ceiv­able that either an imbal­ance in the dis­tri­b­u­tion of skin mark­ings in thou­sands of other train­ing images that were used in the CNN tested herein or the assign­ment of higher weights to blue mark­ings only in lesions with spe­cific (though unknown) accom­pa­ny­ing fea­tures may induce a CNN to asso­ciate skin mark­ings with the diag­no­sis of melanoma. The lat­ter hypoth­e­sis may also explain why melanoma prob­a­bil­ity scores remained almost unchanged in many marked nevi while being increased in oth­ers.”↩︎

  12. Get­ting into more gen­eral eco­nom­ic, behav­ioral, or human sit­u­a­tions would be going too far afield, but the rel­e­vant ana­logues are “”, “”, “law of ”, “”, “”, or “”; such align­ment prob­lems are only par­tially dealt with by hav­ing ground-truth evo­lu­tion­ary , and avoid­ing reward hack­ing remains an open prob­lem (even in the­o­ry). gam­ing com­mu­ni­ties fre­quently pro­vide exam­ples of reward-hack­ing, par­tic­u­larly when games are fin­ished faster by exploit­ing bugs to ; par­tic­u­larly eso­teric tech­niques require out­right hack­ing the present in many games/devices—for exam­ple, ‘par­al­lel uni­verses’ hack which avoids using any jumps by exploit­ing an bug & wrap­around to accel­er­ate Mario to near-in­fi­nite speed, pass­ing through the entire map mul­ti­ple times, in order to stop at the right place. ↩︎