The Neural Net Tank Urban Legend

AI folklore tells a story about a neural network trained to detect tanks which instead learned to detect time of day; investigating, this probably never happened.
NN, history, sociology, Google, bibliography
2011-09-202019-08-14 finished certainty: highly likely importance: 4


A cau­tion­ary tale in ar­ti­fi­cial in­tel­li­gence tells about re­searchers train­ing an neural net­work (NN) to de­tect tanks in pho­tographs, suc­ceed­ing, only to re­al­ize the pho­tographs had been col­lected un­der spe­cific con­di­tions for tanks/non-tanks and the NN had learned some­thing use­less like time of day. This story is often told to warn about the lim­its of al­go­rithms and im­por­tance of data col­lec­tion to avoid “dataset bias”/“data leak­age” where the col­lected data can be solved us­ing al­go­rithms that do not gen­er­al­ize to the true data dis­tri­b­u­tion, but the tank story is usu­ally never sourced.

I col­late many ex­tent ver­sions dat­ing back a quar­ter of a cen­tury to 1992 along with two NN-re­lated anec­dotes from the 1960s; their con­tra­dic­tions & de­tails in­di­cate a clas­sic “ur­ban leg­end”, with a prob­a­ble ori­gin in a spec­u­la­tive ques­tion in the 1960s by Ed­ward Fred­kin at an AI con­fer­ence about some early NN re­search, which was sub­se­quently clas­si­fied & never fol­lowed up on.

I sug­gest that dataset bias is real but ex­ag­ger­ated by the tank sto­ry, giv­ing a mis­lead­ing in­di­ca­tion of risks from deep learn­ing and that it would be bet­ter to not re­peat it but use real ex­am­ples of dataset bias and fo­cus on larg­er-s­cale risks like AI sys­tems op­ti­miz­ing for wrong util­ity func­tions.

Deep learn­ing’s rise over the past decade and dom­i­nance in im­age pro­cess­ing tasks has led to an ex­plo­sion of ap­pli­ca­tions at­tempt­ing to in­fer high­-level se­man­tics locked up in raw sen­sory data like pho­tographs. Con­vo­lu­tional neural net­works are now ap­plied to not just or­di­nary tasks like sort­ing cu­cum­bers by qual­ity but every­thing from pre­dict­ing the best Go move to it was taken to whether a pho­to­graph is “in­ter­est­ing” or “pretty”, not to men­tion su­per­charg­ing tra­di­tional tasks like ra­di­ol­ogy in­ter­pre­ta­tion or fa­cial recog­ni­tion which have reached lev­els of ac­cu­racy that could only be dreamed of decades ago. With this ap­proach of “neural net all the things!”, the ques­tion of to what ex­tent the trained neural net­works are use­ful in the real world and will do what we want it to do & not what we told it to do has taken on ad­di­tional im­por­tance, es­pe­cially given the pos­si­bil­ity of neural net­works learn­ing to ac­com­plish ex­tremely in­con­ve­nient things like in­fer­ring in­di­vid­ual hu­man differ­ences such as crim­i­nal­ity or ho­mo­sex­u­al­ity (to give two highly con­tro­ver­sial re­cent ex­am­ples where the mean­ing­ful­ness of claimed suc­cess have been se­verely ques­tioned).

In this con­text, a cau­tion­ary story is often told of in­cau­tious re­searchers decades ago who trained a NN for the mil­i­tary to find im­ages of tanks, only to dis­cover they had trained a neural net­work to de­tect some­thing else en­tirely (what, pre­cise­ly, that some­thing else was varies in the telling). It would be a good & in­struc­tive sto­ry… if it were true.

Is it?

Did It Happen?

Versions of the Story

Draw­ing on (Google/Google Books/Google Scholar/Libgen/LessWrong/Hacker News/Twitter) in in­ves­ti­gat­ing lep­rechauns, I have com­piled a large num­ber of vari­ants of the sto­ry; be­low, in re­verse chrono­log­i­cal or­der by decade, let­ting us trace the story back to­wards its roots:

2010s

Heather Mur­phy, “Why Stan­ford Re­searchers Tried to Cre­ate a ‘Gay­dar’ Ma­chine” (NYT), 2017-10-09:

So What Did the Ma­chines See? Dr. Kosin­ski and Mr. Wang [Wang & Kosin­ski 2018] say that the al­go­rithm is re­spond­ing to fixed fa­cial fea­tures, like nose shape, along with “groom­ing choic­es,” such as eye make­up. But it’s also pos­si­ble that the al­go­rithm is see­ing some­thing to­tally un­known. “The more data it has, the bet­ter it is at pick­ing up pat­terns,” said Sarah Jamie Lewis, an in­de­pen­dent pri­vacy re­searcher who Tweeted a cri­tique of the study. “But the pat­terns aren’t nec­es­sar­ily the ones you think that they are.” , the di­rec­tor of M.I.T.’s Cen­ter for Brains, Minds and Ma­chi­nes, offered a clas­sic para­ble used to il­lus­trate this dis­con­nect. The Army trained a pro­gram to differ­en­ti­ate Amer­i­can tanks from Russ­ian tanks with 100% ac­cu­ra­cy. Only later did an­a­lysts re­al­ized that the Amer­i­can tanks had been pho­tographed on a sunny day and the Russ­ian tanks had been pho­tographed on a cloudy day. The com­puter had learned to de­tect bright­ness. Dr. Cox has spot­ted a ver­sion of this in his own stud­ies of dat­ing pro­files. Gay peo­ple, he has found, tend to post high­er-qual­ity pho­tos. Dr. Kosin­ski said that they went to great lengths to guar­an­tee that such con­founders did not in­flu­ence their re­sults. Still, he agreed that it’s eas­ier to teach a ma­chine to see than to un­der­stand what it has seen.

[It is worth not­ing that Arcs et al’s crit­i­cisms, such as their ‘gay ver­sion’ pho­tographs, do not ap­pear to have been con­firmed by an .]

Alexan­der Har­row­ell, “It was called a per­cep­tron for a rea­son, damn it”, 2017-09-30:

You might think that this is rather like one of the clas­sic op­ti­cal il­lu­sions, but it’s worse than that. If you no­tice that you look at some­thing this way, and then that way, and it looks differ­ent, you’ll no­tice some­thing is odd. This is not some­thing our deep learner will do. Nor is it able to iden­tify any bias that might ex­ist in the cor­pus of data it was trained on…or maybe it is. If there is any prop­erty of the train­ing data set that is strongly pre­dic­tive of the train­ing cri­te­ri­on, it will zero in on that prop­erty with the fe­ro­cious clar­ity of Dar­win­ism. In the 1980s, an early back­prop­a­gat­ing neural net­work was set to find So­viet tanks in a pile of re­con­nais­sance pho­tographs. It worked, un­til some­one no­ticed that the Red Army usu­ally trained when the weather was good, and in any case the satel­lite could only see them when the sky was clear. The med­ical school at St Thomas’ Hos­pi­tal in Lon­don found theirs had learned that their suc­cess­ful stu­dents were usu­ally white.

An in­ter­est­ing story with a dis­tinct “fam­ily re­sem­blance” is told about a NN clas­si­fy­ing wolves/dogs, by Evgeniy Niko­lay­chuk, “Dogs, Wolves, Data Sci­ence, and Why Ma­chines Must Learn Like Hu­mans Do”, 2017-06-09:

Neural net­works are de­signed to learn like the hu­man brain, but we have to be care­ful. This is not be­cause I’m scared of ma­chines tak­ing over the plan­et. Rather, we must make sure ma­chines learn cor­rect­ly. One ex­am­ple that al­ways pops into my head is how one neural net­work learned to differ­en­ti­ate be­tween dogs and wolves. It did­n’t learn the differ­ences be­tween dogs and wolves, but in­stead learned that wolves were on snow in their pic­ture and dogs were on grass. It learned to differ­en­ti­ate the two an­i­mals by look­ing at snow and grass. Ob­vi­ous­ly, the net­work learned in­cor­rect­ly. What if the dog was on snow and the wolf was on grass? Then, it would be wrong.

How­ev­er, in his source, , Ribeiro et al 2016, they spec­ify of their dog/wolf snow-de­tec­tor NN that they “trained this bad clas­si­fier in­ten­tion­al­ly, to eval­u­ate whether sub­jects are able to de­tect it [the bad per­for­mance]” us­ing LIME for in­sight into how the clas­si­fier was mak­ing its clas­si­fi­ca­tion, con­clud­ing that “After ex­am­in­ing the ex­pla­na­tions, how­ev­er, al­most all of the sub­jects iden­ti­fied the cor­rect in­sight, with much more cer­tainty that it was a de­ter­min­ing fac­tor. Fur­ther, the trust in the clas­si­fier also dropped sub­stan­tial­ly.” So Niko­lay­chuk ap­pears to have mis­re­mem­bered. (Per­haps in an­other 25 years stu­dents will be told in their classes of how a NN was once trained by ecol­o­gists to count wolves…)

Red­di­tor mantrap2 gives on 2015-06-20 this ver­sion of the sto­ry:

I re­mem­ber this kind of thing from the 1980s: the US Army was test­ing im­age recog­ni­tion seek­ers for mis­siles and was get­ting ex­cel­lent re­sults on North­ern Ger­man tests with NATO tanks. Then they tested the same sys­tems in other en­vi­ron­ment and there re­sults were sud­denly shock­ingly bad. Turns out the im­age recog­ni­tion was key­ing off the trees with tank-like mi­nor fea­tures rather than the tank it­self. Putting other ve­hi­cles in the same forests got sim­i­lar high hits but tanks by them­selves (in desert test ranges) did­n’t reg­is­ter. Luck­ily a scep­tic some­where de­cided to “do one more test to make sure”.

Den­nis Polis, God, Sci­ence and Mind, 2012 (pg131, lim­ited Google Books snip­pet, un­clear what ref 44 is):

These facts re­fute a Neo­pla­tonic ar­gu­ment for the es­sen­tial im­ma­te­ri­al­ity of the soul, viz. that since the mind deals with uni­ver­sal rep­re­sen­ta­tions, it op­er­ates in a specifi­cally im­ma­te­r­ial way…­So, aware­ness is not ex­plained by con­nec­tion­ism. The re­sults of neural net train­ing are not al­ways as ex­pect­ed. One team in­tended to train neural nets to rec­og­nize bat­tle tanks in aer­ial pho­tos. The sys­tem was trained us­ing pho­tos with and with­out tanks. After the train­ing, a differ­ent set of pho­tos was used for eval­u­a­tion, and the sys­tem failed mis­er­ably—be­ing to­tally in­ca­pable of dis­tin­guish­ing those with tanks. The sys­tem ac­tu­ally dis­crim­i­nated cloudy from sunny days. It hap­pened that all the train­ing pho­tos with tanks were taken on cloudy days, while those with­out were on clear days.44 What does this show? That neural net train­ing is mind­less. The sys­tem had no idea of the in­tent of the en­ter­prise, and did what it was pro­grammed to do with­out any con­cept of its pur­pose. As with Dawkins’ evo­lu­tion sim­u­la­tion (p. 66), the goals of com­puter neural nets are im­posed by hu­man pro­gram­mers.

Blay Whit­by, Ar­ti­fi­cial In­tel­li­gence: A Be­gin­ner’s Guide 2012 (pg53):

It is not yet clear how an ar­ti­fi­cial neural net could be trained to deal with “the world” or any re­ally open-ended sets of prob­lems. Now some read­ers may feel that this un­pre­dictabil­ity is not a prob­lem. After all, we are talk­ing about train­ing not pro­gram­ming and we ex­pect a neural net to be­have rather more like a brain than a com­put­er. Given the use­ful­ness of nets in un­su­per­vised learn­ing, it might seem there­fore that we do not re­ally need to worry about the prob­lem be­ing of man­age­able size and the train­ing process be­ing pre­dictable. This is not the case; we re­ally do need a man­age­able and well-de­fined prob­lem for the train­ing process to work. A fa­mous AI ur­ban myth may help to make this clear­er.

The story goes some­thing like this. A re­search team was train­ing a neural net to rec­og­nize pic­tures con­tain­ing tanks. (I’ll leave you to guess why it was tanks and not tea-cup­s.) To do this they showed it two train­ing sets of pho­tographs. One set of pic­tures con­tained at least one tank some­where in the scene, the other set con­tained no tanks. The net had to be trained to dis­crim­i­nate be­tween the two sets of pho­tographs. Even­tu­al­ly, after all that back­-prop­a­ga­tion stuff, it cor­rectly gave the out­put “tank” when there was a tank in the pic­ture and “no tank” when there was­n’t. Even if, say, only a lit­tle bit of the gun was peep­ing out from be­hind a sand dune it said “tank”. Then they pre­sented a pic­ture where no part of the tank was vis­i­ble—it was ac­tu­ally com­pletely hid­den be­hind a sand dune—and the pro­gram said “tank”.

Now when this sort of thing hap­pens re­search labs tend to split along age-based lines. The young hairs say “Great! We’re in line for the No­bel Prize!” and the old heads say “Some­thing’s gone wrong”. Un­for­tu­nate­ly, the old heads are usu­ally right—as they were in this case. What had hap­pened was that the pho­tographs con­tain­ing tanks had been taken in the morn­ing while the army played tanks on the range. After lunch the pho­tog­ra­pher had gone back and taken pic­tures from the same an­gles of the empty range. So the net had iden­ti­fied the most re­li­able sin­gle fea­ture which en­abled it to clas­sify the two sets of pho­tos, namely the an­gle of the shad­ows. “AM = tank, PM = no tank”. This was an ex­tremely effec­tive way of clas­si­fy­ing the two sets of pho­tographs in the train­ing set. What it most cer­tainly was not was a pro­gram that rec­og­nizes tanks. The great ad­van­tage of neural nets is that they find their own clas­si­fi­ca­tion cri­te­ria. The great prob­lem is that it may not be the one you want!

Thom Blake notes in 2011-09-20 that the story is:

Prob­a­bly apoc­ryphal. I haven’t been able to track this down, de­spite hav­ing heard the story both in com­puter ethics class and at aca­d­e­mic con­fer­ences.

“Em­bar­rass­ing mis­takes in per­cep­tron re­search”, Mar­vin Min­sky, 2011-01-31:

Like I had a friend in Italy who had a per­cep­tron that looked at a vi­su­al… it had vi­sual in­puts. So, he… he had scores of mu­sic writ­ten by Bach of chorales and he had scores of chorales writ­ten by mu­sic stu­dents at the lo­cal con­ser­va­to­ry. And he had a per­cep­tron—a big ma­chine—that looked at these and those and tried to dis­tin­guish be­tween them. And he was able to train it to dis­tin­guish be­tween the mas­ter­pieces by Bach and the pretty good chorales by the con­ser­va­tory stu­dents. Well, so, he showed us this data and I was look­ing through it and what I dis­cov­ered was that in the lower left hand cor­ner of each page, one of the sets of data had sin­gle whole notes. And I think the ones by the stu­dents usu­ally had four quar­ter notes. So that, in fact, it was pos­si­ble to dis­tin­guish be­tween these two classes of… of pieces of mu­sic just by look­ing at the lower left… lower right hand cor­ner of the page. So, I told this to the… to our sci­en­tist friend and he went through the data and he said: ‘You guessed right. That’s… that’s how it hap­pened to make that dis­tinc­tion.’ We thought it was very fun­ny.

A sim­i­lar thing hap­pened here in the United States at one of our re­search in­sti­tu­tions. Where a per­cep­tron had been trained to dis­tin­guish be­tween—this was for mil­i­tary pur­pos­es—It could… it was look­ing at a scene of a for­est in which there were cam­ou­flaged tanks in one pic­ture and no cam­ou­flaged tanks in the oth­er. And the per­cep­tron—after a lit­tle train­ing—­got… made a 100% cor­rect dis­tinc­tion be­tween these two differ­ent sets of pho­tographs. Then they were em­bar­rassed a few hours later to dis­cover that the two rolls of film had been de­vel­oped differ­ent­ly. And so these pic­tures were just a lit­tle darker than all of these pic­tures and the per­cep­tron was just mea­sur­ing the to­tal amount of light in the scene. But it was very clever of the per­cep­tron to find some way of mak­ing the dis­tinc­tion.

2000s

Eliezer Yud­kowsky, 2008-08-24 (sim­i­larly quoted in “Ar­ti­fi­cial In­tel­li­gence as a Neg­a­tive and Pos­i­tive Fac­tor in Global Risk”, “Ar­ti­fi­cial In­tel­li­gence in global risk” in Global Cat­a­strophic Risks 2011, & “Friendly Ar­ti­fi­cial In­tel­li­gence” in Sin­gu­lar­ity Hy­pothe­ses 2013):

Once upon a time—I’ve seen this story in sev­eral ver­sions and sev­eral places, some­times cited as fact, but I’ve never tracked down an orig­i­nal source—once upon a time, I say, the US Army wanted to use neural net­works to au­to­mat­i­cally de­tect cam­ou­flaged en­emy tanks. The re­searchers trained a neural net on 50 pho­tos of cam­ou­flaged tanks amid trees, and 50 pho­tos of trees with­out tanks. Us­ing stan­dard tech­niques for su­per­vised learn­ing, the re­searchers trained the neural net­work to a weight­ing that cor­rectly loaded the train­ing set—out­put “yes” for the 50 pho­tos of cam­ou­flaged tanks, and out­put “no” for the 50 pho­tos of for­est. Now this did not prove, or even im­ply, that new ex­am­ples would be clas­si­fied cor­rect­ly. The neural net­work might have “learned” 100 spe­cial cases that would­n’t gen­er­al­ize to new prob­lems. Not, “cam­ou­flaged tanks ver­sus for­est”, but just, “pho­to-1 pos­i­tive, pho­to-2 neg­a­tive, pho­to-3 neg­a­tive, pho­to-4 pos­i­tive…” But wise­ly, the re­searchers had orig­i­nally taken 200 pho­tos, 100 pho­tos of tanks and 100 pho­tos of trees, and had used only half in the train­ing set. The re­searchers ran the neural net­work on the re­main­ing 100 pho­tos, and with­out fur­ther train­ing the neural net­work clas­si­fied all re­main­ing pho­tos cor­rect­ly. Suc­cess con­firmed! The re­searchers handed the fin­ished work to the Pen­tagon, which soon handed it back, com­plain­ing that in their own tests the neural net­work did no bet­ter than chance at dis­crim­i­nat­ing pho­tos. It turned out that in the re­searchers’ data set, pho­tos of cam­ou­flaged tanks had been taken on cloudy days, while pho­tos of plain for­est had been taken on sunny days. The neural net­work had learned to dis­tin­guish cloudy days from sunny days, in­stead of dis­tin­guish­ing cam­ou­flaged tanks from empty for­est. This para­ble—which might or might not be fac­t—il­lus­trates one of the most fun­da­men­tal prob­lems in the field of su­per­vised learn­ing and in fact the whole field of Ar­ti­fi­cial In­tel­li­gence…

Gor­don Rugg, Us­ing Sta­tis­tics: A Gen­tle In­tro­duc­tion, 2007-10-01 (pg114–115):

Neural nets and ge­netic al­go­rithms (in­clud­ing the story of the Russ­ian tanks): Neural nets (or ar­ti­fi­cial neural net­works, to give them their full name) are pieces of soft­ware in­spired by the way the hu­man brain works. In brief, you can train a neural net to do tasks like clas­si­fy­ing im­ages by giv­ing it lots of ex­am­ples, and telling it which ex­am­ples fit into which cat­e­gories; the neural net works out for it­self what the defin­ing char­ac­ter­is­tics are for each cat­e­go­ry. Al­ter­na­tive­ly, you can give it a large set of data and leave it to work out con­nec­tions by it­self, with­out giv­ing it any feed­back. There’s a sto­ry, which is prob­a­bly an ur­ban leg­end, which il­lus­trates how the ap­proach works and what can go wrong with it. Ac­cord­ing to the sto­ry, some NATO re­searchers trained a neural net to dis­tin­guish be­tween pho­tos of NATO and War­saw Pact tanks. After a while, the neural net could get it right every time, even with pho­tos it had never seen be­fore. The re­searchers had glee­ful vi­sions of in­stalling neural nets with minia­ture cam­eras in mis­siles, which could then be fired at a bat­tle­field and left to choose their own tar­gets. To demon­strate the method, and se­cure fund­ing for the next stage, they or­gan­ised a view­ing by the mil­i­tary. On the day, they set up the sys­tem and fed it a new batch of pho­tos. The neural net re­sponded with ap­par­ently ran­dom de­ci­sions, some­times iden­ti­fy­ing NATO tanks cor­rect­ly, some­times iden­ti­fy­ing them mis­tak­enly as War­saw Pact tanks. This did not in­spire the pow­ers that be, and the whole scheme was aban­doned on the spot. It was only after­wards that the re­searchers re­alised that all their train­ing pho­tos of NATO tanks had been taken on sunny days in Ari­zona, whereas the War­saw Pact tanks had been pho­tographed on grey, mis­er­able win­ter days on the steppes, so the neural net had flaw­lessly learned the un­in­tended les­son that if you saw a tank on a gloomy day, then you made its day even gloomier by mark­ing it for de­struc­tion.

N. Kather­ine Hayles, “Com­put­ing the Hu­man” (In­ven­tive Life: Ap­proaches to the New Vi­tal­ism, Fraser et al 2006; pg424):

While hu­mans have for mil­len­nia used what Car­i­ani calls ‘ac­tive sens­ing’—‘pok­ing, push­ing, bend­ing’—to ex­tend their sen­sory range and for hun­dreds of years have used pros­the­ses to cre­ate new sen­sory ex­pe­ri­ences (for ex­am­ple, mi­cro­scopes and tele­scopes), only re­cently has it been pos­si­ble to con­struct evolv­ing sen­sors and what Car­i­ani (1998: 718) calls ‘in­ter­nal­ized sens­ing’, that is, “bring­ing the world into the de­vice” by cre­at­ing in­ter­nal, ana­log rep­re­sen­ta­tions of the world out of which in­ter­nal sen­sors ex­tract new­ly-rel­e­vant prop­er­ties’.

…An­other con­clu­sion emerges from Car­i­an­i’s call (1998) for re­search in sen­sors that can adapt and evolve in­de­pen­dently of the epis­temic cat­e­gories of the hu­mans who cre­ate them. The well-known and per­haps apoc­ryphal story of the neural net trained to rec­og­nize army tanks will il­lus­trate the point. For ob­vi­ous rea­sons, the army wanted to de­velop an in­tel­li­gent ma­chine that could dis­crim­i­nate be­tween real and pre­tend tanks. A neural net was con­structed and trained us­ing two sets of data, one con­sist­ing of pho­tographs show­ing ply­wood cutouts of tanks and the other ac­tual tanks. After some train­ing, the net was able to dis­crim­i­nate flaw­lessly be­tween the sit­u­a­tions. As is cus­tom­ary, the net was then tested against a third data set show­ing pre­tend and real tanks in the same land­scape; it failed mis­er­ably. Fur­ther in­ves­ti­ga­tion re­vealed that the orig­i­nal two data sets had been filmed on differ­ent days. One of the days was over­cast with lots of clouds, and the other day was clear. The net, it turned out, was dis­crim­i­nat­ing be­tween the pres­ence and ab­sence of clouds. The anec­dote shows the am­bigu­ous po­ten­tial of epis­tem­i­cally au­tonomous de­vices for cat­e­go­riz­ing the world in en­tirely differ­ent ways from the hu­mans with whom they in­ter­act. While this au­ton­omy might be used to en­rich the hu­man per­cep­tion of the world by re­veal­ing novel kinds of con­struc­tions, it also can cre­ate a breed of au­tonomous de­vices that parse the world in rad­i­cally differ­ent ways from their hu­man train­ers.

A coun­ter-nar­ra­tive, also per­haps apoc­ryphal, emerged from the 1991 Gulf War. US sol­diers fir­ing at tanks had been trained on sim­u­la­tors that im­aged flames shoot­ing out from the tank to in­di­cate a kill. When army in­ves­ti­ga­tors ex­am­ined Iraqi tanks that were de­feated in bat­tles, they found that for some tanks the sol­diers had fired four to five times the amount of mu­ni­tions nec­es­sary to dis­able the tanks. They hy­poth­e­sized that the overuse of fire­power hap­pened be­cause no flames shot out, so the sol­diers con­tin­ued fir­ing. If the hy­poth­e­sis is cor­rect, hu­man per­cep­tions were al­tered in ac­cord with the idio­syn­crasies of in­tel­li­gent ma­chi­nes, pro­vid­ing an ex­am­ple of what can hap­pen when hu­man-ma­chine per­cep­tions are caught in a feed­back loop with one an­oth­er.

Linda Null & Julie Lobur, The Es­sen­tials of Com­puter Or­ga­ni­za­tion and Ar­chi­tec­ture (third edi­tion), 2003/2014 (pg439–440 in 1st edi­tion, pg658 in 3rd edi­tion):

Cor­rect train­ing re­quires thou­sands of steps. The train­ing time it­self de­pends on the size of the net­work. As the num­ber of per­cep­trons in­creas­es, the num­ber of pos­si­ble “states” also in­creas­es.

Let’s con­sider a more so­phis­ti­cated ex­am­ple, that of de­ter­min­ing whether a tank is hid­ing in a pho­to­graph. A neural net can be con­fig­ured so that each out­put value cor­re­lates to ex­actly one pix­el. If the pixel is part of the im­age of a tank, the net should out­put a one; oth­er­wise, the net should out­put a ze­ro. The in­put in­for­ma­tion would most likely con­sist of the color of the pix­el. The net­work would be trained by feed­ing it many pic­tures with and with­out tanks. The train­ing would con­tinue un­til the net­work cor­rectly iden­ti­fied whether the pho­tos in­cluded tanks. The U.S. mil­i­tary con­ducted a re­search project ex­actly like the one we just de­scribed. One hun­dred pho­tographs were taken of tanks hid­ing be­hind trees and in bush­es, and an­other 100 pho­tographs were taken of or­di­nary land­scape with no tanks. Fifty pho­tos from each group were kept “se­cret,” and the rest were used to train the neural net­work. The net­work was ini­tial­ized with ran­dom weights be­fore be­ing fed one pic­ture at a time. When the net­work was in­cor­rect, it ad­justed its in­put weights un­til the cor­rect out­put was reached. Fol­low­ing the train­ing pe­ri­od, the 50 “se­cret” pic­tures from each group of pho­tos were fed into the net­work. The neural net­work cor­rectly iden­ti­fied the pres­ence or ab­sence of a tank in each pho­to. The real ques­tion at this point has to do with the train­ing—had the neural net ac­tu­ally learned to rec­og­nize tanks? The Pen­tagon’s nat­ural sus­pi­cion led to more test­ing. Ad­di­tional pho­tos were taken and fed into the net­work, and to the re­searchers’ dis­may, the re­sults were quite ran­dom. The neural net could not cor­rectly iden­tify tanks within pho­tos. After some in­ves­ti­ga­tion, the re­searchers de­ter­mined that in the orig­i­nal set of 200 pho­tos, all pho­tos with tanks had been taken on a cloudy day, whereas the pho­tos with no tanks had been taken on a sunny day. The neural net had prop­erly sep­a­rated the two groups of pic­tures, but had done so us­ing the color of the sky to do this rather than the ex­is­tence of a hid­den tank. The gov­ern­ment was now the proud owner of a very ex­pen­sive neural net that could ac­cu­rately dis­tin­guish be­tween sunny and cloudy days!

This is a great ex­am­ple of what many con­sider the biggest is­sue with neural net­works. If there are more than 10 to 20 neu­rons, it is im­pos­si­ble to un­der­stand how the net­work is ar­riv­ing at its re­sults. One can­not tell if the net is mak­ing de­ci­sions based on cor­rect in­for­ma­tion, or, as in the above ex­am­ple, some­thing to­tally ir­rel­e­vant. Neural net­works have a re­mark­able abil­ity to de­rive mean­ing and ex­tract pat­terns from data that are too com­plex to be an­a­lyzed by hu­man be­ings. How­ev­er, some peo­ple trust neural net­works to be ex­perts in their area of train­ing. Neural nets are used in such ar­eas as sales fore­cast­ing, risk man­age­ment, cus­tomer re­search, un­der­sea mine de­tec­tion, fa­cial recog­ni­tion, and data val­i­da­tion. Al­though neural net­works are promis­ing, and the progress made in the past sev­eral years has led to sig­nifi­cant fund­ing for neural net re­search, many peo­ple are hes­i­tant to put con­fi­dence in some­thing that no hu­man be­ing can com­pletely un­der­stand.

David Ger­hard, “Pitch Ex­trac­tion and Fun­da­men­tal Fre­quen­cy: His­tory and Cur­rent Tech­niques”, Tech­ni­cal Re­port TR-CS 2003–06, No­vem­ber 2003:

The choice of the di­men­sion­al­ity and do­main of the in­put set is cru­cial to the suc­cess of any con­nec­tion­ist mod­el. A com­mon ex­am­ple of a poor choice of in­put set and test data is the Pen­tagon’s foray into the field of ob­ject recog­ni­tion. This story is prob­a­bly apoc­ryphal and many differ­ent ver­sions ex­ist on-line, but the story de­scribes a true diffi­culty with neural nets.

As the story goes, a net­work was set up with the in­put be­ing the pix­els in a pic­ture, and the out­put was a sin­gle bit, yes or no, for the ex­is­tence of an en­emy tank hid­den some­where in the pic­ture. When the train­ing was com­plete, the net­work per­formed beau­ti­ful­ly, but when ap­plied to new data, it failed mis­er­ably. The prob­lem was that in the test data, all of the pic­tures that had tanks in them were taken on cloudy days, and all of the pic­tures with­out tanks were taken on sunny days. The neural net was iden­ti­fy­ing the ex­is­tence or non-ex­is­tence of sun­shine, not tanks.

Rice lec­ture #24, “COMP 200: El­e­ments of Com­puter Sci­ence”, 2002-03-18:

  1. Tanks in Desert Storm

Some­times you have to be care­ful what you train on . . .

The prob­lem with neural nets is that you never know what fea­tures they’re ac­tu­ally train­ing on. For ex­am­ple:

The US mil­i­tary tried to use neural nets in Desert Storm for tank recog­ni­tion, so un­manned tanks could iden­tify en­emy tanks and de­stroy them. They trained the neural net on mul­ti­ple im­ages of “friendly” and en­emy tanks, and even­tu­ally had a de­cent pro­gram that seemed to cor­rectly iden­tify friendly and en­emy tanks.

Then, when they ac­tu­ally used the pro­gram in a re­al-world test phase with ac­tual tanks, they found that the tanks would ei­ther shoot at noth­ing or shoot at every­thing. They cer­tainly seemed to be in­ca­pable of dis­tin­guish­ing friendly or en­emy tanks.

Why was this? It turns out that the im­ages they were train­ing on al­ways had glam­our-shot type pho­tos of friendly tanks, with an im­mac­u­late blue sky, etc. The en­emy tank pho­tos, on the other hand, were all spy pho­tos, not very clear, some­times fuzzy, etc. And it was these char­ac­ter­is­tics that the neural net was train­ing on, not the tanks at all. On a bright sunny day, the tanks would do noth­ing. On an over­cast, hazy day, they’d start fir­ing like crazy . . .

An­drew Ilachin­ski, Cel­lu­lar Au­tomata: A Dis­crete Uni­verse, 2001 (pg547):

There is an telling story about how the Army re­cently went about teach­ing a back­prop­a­gat­ing net to iden­tify tanks set against a va­ri­ety of en­vi­ron­men­tal back­drops. The pro­gram­mers cor­rectly fed their mul­ti­-layer net pho­to­graph after pho­to­graph of tanks in grass­lands, tanks in swamps, no tanks on con­crete, and so on. After many tri­als and many thou­sands of it­er­a­tions, their net fi­nally learned all of the im­ages in their data­base. The prob­lem was that when the pre­sum­ably “trained” net was tested with other im­ages that were not part of the orig­i­nal train­ing set, it failed to do any bet­ter than what would be ex­pected by chance. What had hap­pened was that the input/training fact set was sta­tis­ti­cally cor­rupt. The data­base con­sisted mostly of im­ages that showed a tank only if there were heavy clouds, the tank it­self was im­mersed in shadow or there was no sun at all. The Army’s neural net had in­deed iden­ti­fied a la­tent pat­tern, but it un­for­tu­nately had noth­ing to do with tanks: it had effec­tively learned to iden­tify the time of day! The ob­vi­ous les­son to be taken away from this amus­ing ex­am­ple is that how well a net “learns” the de­sired as­so­ci­a­tions de­pends al­most en­tirely on how well the data­base of facts is de­fined. Just as Monte Carlo sim­u­la­tions in sta­tis­ti­cal me­chan­ics may fall short of in­tended re­sults if they are forced to rely upon poorly coded ran­dom num­ber gen­er­a­tors, so do back­prop­a­gat­ing nets typ­i­cally fail to achieve ex­pected re­sults if the facts they are trained on are sta­tis­ti­cally cor­rupt.

In­tel­li­gent Data Analy­sis In Sci­ence, Hugh M. Cartwright 2000, pg126, writes (ac­cord­ing to Google Book­s’s snip­pet view; Cartwright’s ver­sion ap­pears to be a di­rect quote or close para­phrase of an ear­lier 1994 chem­istry pa­per, Goodacre et al 1994):

…tele­vi­sion pro­gramme ; a neural net­work was trained to at­tempt to dis­tin­guish tanks from trees. Pic­tures were taken of for­est scenes lack­ing mil­i­tary hard­ware and of sim­i­lar but per­haps less bu­colic land­scapes which also con­tained more-or-less cam­ou­flaged bat­tle tanks. A neural net­work was trained with these in­put data and found to differ­en­ti­ate suc­cess­fully be­tween tanks and trees. How­ev­er, when a new set of pic­tures was analysed by the net­work, it failed to de­tect the tanks. After fur­ther in­ves­ti­ga­tion, it was found…

Daniel Robert Franklin & Philippe Crochat, libneural tu­to­r­ial, 2000-03-23:

A neural net­work is use­less if it only sees one ex­am­ple of a match­ing input/output pair. It can­not in­fer the char­ac­ter­is­tics of the in­put data for which you are look­ing for from only one ex­am­ple; rather, many ex­am­ples are re­quired. This is anal­o­gous to a child learn­ing the differ­ence be­tween (say) differ­ent types of an­i­mal­s—the child will need to see sev­eral ex­am­ples of each to be able to clas­sify an ar­bi­trary an­i­mal… It is the same with neural net­works. The best train­ing pro­ce­dure is to com­pile a wide range of ex­am­ples (for more com­plex prob­lems, more ex­am­ples are re­quired) which ex­hibit all the differ­ent char­ac­ter­is­tics you are in­ter­ested in. It is im­por­tant to se­lect ex­am­ples which do not have ma­jor dom­i­nant fea­tures which are of no in­ter­est to you, but are com­mon to your in­put data any­way. One fa­mous ex­am­ple is of the US Army “Ar­ti­fi­cial In­tel­li­gence” tank clas­si­fi­er. It was shown ex­am­ples of So­viet tanks from many differ­ent dis­tances and an­gles on a bright sunny day, and ex­am­ples of US tanks on a cloudy day. Need­less to say it was great at clas­si­fy­ing weath­er, but not so good at pick­ing out en­emy tanks.

1990s

“Neural Net­work Fol­lies”, Neil Fraser, Sep­tem­ber 1998:

In the 1980s, the Pen­ta­gon wanted to har­ness com­puter tech­nol­ogy to make their tanks harder to at­tack­…The re­search team went out and took 100 pho­tographs of tanks hid­ing be­hind trees, and then took 100 pho­tographs of trees—with no tanks. They took half the pho­tos from each group and put them in a vault for safe-keep­ing, then scanned the other half into their main­frame com­put­er. The huge neural net­work was fed each photo one at a time and asked if there was a tank hid­ing be­hind the trees. Of course at the be­gin­ning its an­swers were com­pletely ran­dom since the net­work did­n’t know what was go­ing on or what it was sup­posed to do. But each time it was fed a photo and it gen­er­ated an an­swer, the sci­en­tists told it if it was right or wrong. If it was wrong it would ran­domly change the weight­ings in its net­work un­til it gave the cor­rect an­swer. Over time it got bet­ter and bet­ter un­til even­tu­ally it was get­ting each photo cor­rect. It could cor­rectly de­ter­mine if there was a tank hid­ing be­hind the trees in any one of the pho­to­s…So the sci­en­tists took out the pho­tos they had been keep­ing in the vault and fed them through the com­put­er. The com­puter had never seen these pho­tos be­fore—this would be the big test. To their im­mense re­lief the neural net cor­rectly iden­ti­fied each photo as ei­ther hav­ing a tank or not hav­ing one. In­de­pen­dent test­ing: The Pen­ta­gon was very pleased with this, but a lit­tle bit sus­pi­cious. They com­mis­sioned an­other set of pho­tos (half with tanks and half with­out) and scanned them into the com­puter and through the neural net­work. The re­sults were com­pletely ran­dom. For a long time no­body could fig­ure out why. After all no­body un­der­stood how the neural had trained it­self. Even­tu­ally some­one no­ticed that in the orig­i­nal set of 200 pho­tos, all the im­ages with tanks had been taken on a cloudy day while all the im­ages with­out tanks had been taken on a sunny day. The neural net­work had been asked to sep­a­rate the two groups of pho­tos and it had cho­sen the most ob­vi­ous way to do it—not by look­ing for a cam­ou­flaged tank hid­ing be­hind a tree, but merely by look­ing at the color of the sky…This story might be apoc­ryphal, but it does­n’t re­ally mat­ter. It is a per­fect il­lus­tra­tion of the biggest prob­lem be­hind neural net­works. Any au­to­mat­i­cally trained net with more than a few dozen neu­rons is vir­tu­ally im­pos­si­ble to an­a­lyze and un­der­stand.

Tom White at­trib­utes (in Oc­to­ber 2017) to Mar­vin Min­sky some ver­sion of the tank story be­ing told in MIT classes 20 years be­fore, ~1997 (but does­n’t spec­ify the de­tailed story or ver­sion other than ap­par­ently the re­sults were “clas­si­fied”).

Vas­ant Dhar & Roger Stein, In­tel­li­gent De­ci­sion Sup­port Meth­ods, 1997 (pg98, lim­ited Google Books snip­pet):

…How­ev­er, when a new set of pho­tographs were used, the re­sults were hor­ri­ble. At first the team was puz­zled. But after care­ful in­spec­tion of the first two sets of pho­tographs, they dis­cov­ered a very sim­ple ex­pla­na­tion. The pho­tos with tanks in them were all taken on sunny days, and those with­out the tanks were taken on over­cast days. The net­work had not learned to iden­tify tank like im­ages; in­stead, it had learned to iden­tify pho­tographs of sunny days and over­cast days.

Roys­ton Goodacre, Mark J. Neal, & Dou­glas B. Kell, “Quan­ti­ta­tive Analy­sis of Mul­ti­vari­ate Data Us­ing Ar­ti­fi­cial Neural Net­works: A Tu­to­r­ial Re­view and Ap­pli­ca­tions to the De­con­vo­lu­tion of Py­rol­y­sis Mass Spec­tra”, 1994-04-29:

…As in all other data analy­sis tech­niques, these su­per­vised learn­ing meth­ods are not im­mune from sen­si­tiv­ity to badly cho­sen ini­tial data (113). [113: Zu­pan, J. and J. Gasteiger: Neural Net­works for Chemists: An In­tro­duc­tion. VCH Ver­lags­ge­sellschaft, Wein­heim (1993)] There­fore the ex­em­plars for the train­ing set must be care­fully cho­sen; the golden rule is “garbage in­—­garbage out”. An ex­cel­lent ex­am­ple of an un­rep­re­sen­ta­tive train­ing set was dis­cussed some time ago on the BBC tele­vi­sion pro­gramme Hori­zon; a neural net­work was trained to at­tempt to dis­tin­guish tanks from trees. Pic­tures were taken of for­est scenes lack­ing mil­i­tary hard­ware and of sim­i­lar but per­haps less bu­colic land­scapes which also con­tained more-or-less cam­ou­flaged bat­tle tanks. A neural net­work was trained with these in­put data and found to differ­en­ti­ate most suc­cess­fully be­tween tanks and trees. How­ev­er, when a new set of pic­tures was analysed by the net­work, it failed to dis­tin­guish the tanks from the trees. After fur­ther in­ves­ti­ga­tion, it was found that the first set of pic­tures con­tain­ing tanks had been taken on a sunny day whilst those con­tain­ing no tanks were ob­tained when it was over­cast. The neural net­work had there­fore thus learned sim­ply to recog­nise the weath­er! We can con­clude from this that the train­ing and tests sets should be care­fully se­lected to con­tain rep­re­sen­ta­tive ex­em­plars en­com­pass­ing the ap­pro­pri­ate vari­ance over all rel­e­vant prop­er­ties for the prob­lem at hand.

Fer­nando Pereira, “neural redlin­ing”, RISKS 16(41), 1994-09-12:

Fred’s com­ments will hold not only of neural nets but of any de­ci­sion model trained from data (eg. Bayesian mod­els, de­ci­sion trees). It’s just an in­stance of the old “GIGO” phe­nom­e­non in sta­tis­ti­cal mod­el­ing…Over­all, the whole is­sue of eval­u­a­tion, let alone cer­ti­fi­ca­tion and le­gal stand­ing, of com­plex sta­tis­ti­cal mod­els is still very much open. (This re­minds me of a pos­si­bly apoc­ryphal story of prob­lems with bi­ased data in neural net train­ing. Some US de­fense con­trac­tor had sup­pos­edly trained a neural net to find tanks in scenes. The re­ported per­for­mance was ex­cel­lent, with even cam­ou­flaged tanks mostly hid­den in veg­e­ta­tion be­ing spot­ted. How­ev­er, when the net was tested on yet a new set of im­ages sup­plied by the client, the net did not do bet­ter than chance. After an em­bar­rass­ing in­ves­ti­ga­tion, it turned out that all the tank im­ages in the orig­i­nal train­ing and test sets had very differ­ent av­er­age in­ten­sity than the non-tank im­ages, and thus the net had just learned to dis­crim­i­nate be­tween two im­age in­ten­sity lev­els. Does any­one know if this ac­tu­ally hap­pened, or is it just in the neural net “ur­ban folk­lore”?)

Erich Harth, The Cre­ative Loop: How the Brain Makes a Mind, 1993/1995 (pg158, lim­ited Google Books snip­pet):

…55. The net was trained to de­tect the pres­ence of tanks in a land­scape. The train­ing con­sisted in show­ing the de­vice many pho­tographs of scene, some with tanks, some with­out. In some cas­es—as in the pic­ture on page 143—the tank’s pres­ence was not very ob­vi­ous. The in­puts to the neural net were dig­i­tized pho­tographs;

& , “What Ar­ti­fi­cial Ex­perts Can and Can­not Do”, 1992:

All the “con­tinue this se­quence” ques­tions found on in­tel­li­gence tests, for ex­am­ple, re­ally have more than one pos­si­ble an­swer but most hu­man be­ings share a sense of what is sim­ple and rea­son­able and there­fore ac­cept­able. But when the net pro­duces an un­ex­pected as­so­ci­a­tion can one say it has failed to gen­er­al­ize? One could equally well say that the net has all along been act­ing on a differ­ent de­fi­n­i­tion of “type” and that that differ­ence has just been re­vealed. For an amus­ing and dra­matic case of cre­ative but un­in­tel­li­gent gen­er­al­iza­tion, con­sider the leg­end of one of con­nec­tion­is­m’s first ap­pli­ca­tions. In the early days of the per­cep­tron the army de­cided to train an ar­ti­fi­cial neural net­work to rec­og­nize tanks partly hid­den be­hind trees in the woods. They took a num­ber of pic­tures of a woods with­out tanks, and then pic­tures of the same woods with tanks clearly stick­ing out from be­hind trees. They then trained a net to dis­crim­i­nate the two classes of pic­tures. The re­sults were im­pres­sive, and the army was even more im­pressed when it turned out that the net could gen­er­al­ize its knowl­edge to pic­tures from each set that had not been used in train­ing the net. Just to make sure that the net had in­deed learned to rec­og­nize par­tially hid­den tanks, how­ev­er, the re­searchers took some more pic­tures in the same woods and showed them to the trained net. They were shocked and de­pressed to find that with the new pic­tures the net to­tally failed to dis­crim­i­nate be­tween pic­tures of trees with par­tially con­cealed tanks be­hind them and just plain trees. The mys­tery was fi­nally solved when some­one no­ticed that the train­ing pic­tures of the woods with­out tanks were taken on a cloudy day, whereas those with tanks were taken on a sunny day. The net had learned to rec­og­nize and gen­er­al­ize the differ­ence be­tween a woods with and with­out shad­ows! Ob­vi­ous­ly, not what stood out for the re­searchers as the im­por­tant differ­ence. This ex­am­ple il­lus­trates the gen­eral point that a net must share size, ar­chi­tec­ture, ini­tial con­nec­tions, con­fig­u­ra­tion and so­cial­iza­tion with the hu­man brain if it is to share our sense of ap­pro­pri­ate gen­er­al­iza­tion

Hu­bert Drey­fus ap­pears to have told this story ear­lier in 1990 or 1991, as a sim­i­lar story ap­pears in episode 4 (Ger­man) (s­tart­ing 33m49s) of the BBC doc­u­men­tary se­ries , broad­cast 1991-11-08. Hu­bert L. Drey­fus, What Com­put­ers Still Can’t Do: A Cri­tique of Ar­ti­fi­cial Rea­son, 1992, re­peats the story in very sim­i­lar but not quite iden­ti­cal word­ing (Jeff Kauf­man notes that Drey­fus drops the qual­i­fy­ing “leg­end of” de­scrip­tion):

…But when the net pro­duces an un­ex­pected as­so­ci­a­tion, can one say that it has failed to gen­er­al­ize? One could equally well say that the net has all along been act­ing on a differ­ent de­fi­n­i­tion of “type” and that that differ­ence has just been re­vealed. For an amus­ing and dra­matic case of cre­ative but un­in­tel­li­gent gen­er­al­iza­tion, con­sider one of con­nec­tion­is­m’s first ap­pli­ca­tions. In the early days of this work the army tried to train an ar­ti­fi­cial neural net­work to rec­og­nize tanks in a for­est. They took a num­ber of pic­tures of a for­est with­out tanks and then, on a later day, with tanks clearly stick­ing out from be­hind trees, and they trained a net to dis­crim­i­nate the two classes of pic­tures. The re­sults were im­pres­sive, and the army was even more im­pressed when it turned out that the net could gen­er­al­ize its knowl­edge to pic­tures that had not been part of the train­ing set. Just to make sure that the net was in­deed rec­og­niz­ing par­tially hid­den tanks, how­ev­er, the re­searchers took more pic­tures in the same for­est and showed them to the trained net. They were de­pressed to find that the net failed to dis­crim­i­nate be­tween the new pic­tures of trees with tanks be­hind them and the new pic­tures of just plain trees. After some ag­o­niz­ing, the mys­tery was fi­nally solved when some­one no­ticed that the orig­i­nal pic­tures of the for­est with­out tanks were taken on a cloudy day and those with tanks were taken on a sunny day. The net had ap­par­ently learned to rec­og­nize and gen­er­al­ize the differ­ence be­tween a for­est with and with­out shad­ows! This ex­am­ple il­lus­trates the gen­eral point that a net­work must share our com­mon­sense un­der­stand­ing of the world if it is to share our sense of ap­pro­pri­ate gen­er­al­iza­tion.

Drey­fus’s What Com­put­ers Still Can’t Do is listed as a re­vi­sion of his 1972 book, What Com­put­ers Can’t Do: A Cri­tique of Ar­ti­fi­cial Rea­son, but the tank story is not in the 1972 book, only the 1992 one. (Drey­fus’s ver­sion is also quoted in the 2017 NYT ar­ti­cle and Hillis 1996’s Ge­og­ra­phy, Iden­ti­ty, and Em­bod­i­ment in Vir­tual Re­al­ity, pg346.)

Laveen N. Kanal, Ar­ti­fi­cial Neural Net­works and Sta­tis­ti­cal Pat­tern Recog­ni­tion: Old and New Con­nec­tions’s Fore­word, dis­cusses some early NN/tank re­search (pre­dat­ing not just Le­Cun’s con­vo­lu­tions but back­prop­a­ga­tion), 1991:

…[Frank] Rosen­blatt had not lim­ited him­self to us­ing just a sin­gle Thresh­old Logic Unit but used net­works of such units. The prob­lem was how to train mul­ti­layer per­cep­tron net­works. A pa­per on the topic writ­ten by Block, Knight and Rosen­blatt was murky in­deed, and did not demon­strate a con­ver­gent pro­ce­dure to train such net­works. In 1962–63 at Philco-Ford, seek­ing a sys­tem­atic ap­proach to de­sign­ing lay­ered clas­si­fi­ca­tion nets, we de­cided to use a hi­er­ar­chy of thresh­old logic units with a first layer of “fea­ture log­ics” which were thresh­old logic units on over­lap­ping re­cep­tive fields of the im­age, feed­ing two ad­di­tional lev­els of weighted thresh­old logic de­ci­sion units. The weights in each level of the hi­er­ar­chy were es­ti­mated us­ing sta­tis­ti­cal meth­ods rather than it­er­a­tive train­ing pro­ce­dures [L.N. Kanal & N.C. Ran­dall, “Recog­ni­tion Sys­tem De­sign by Sta­tis­ti­cal Analy­sis”, Proc. 19th Conf. ACM, 1964]. We re­ferred to the net­works as two layer net­works since we did not count the in­put as a lay­er. On a project to rec­og­nize tanks in aer­ial pho­tog­ra­phy, the method worked well enough in prac­tice that the U.S. Army agency spon­sor­ing the project de­cided to clas­sify the fi­nal re­ports, al­though pre­vi­ously the project had been un­clas­si­fied. We were un­able to pub­lish the clas­si­fied re­sults! Then, en­am­ored by the claimed promise of co­her­ent op­ti­cal fil­ter­ing as a par­al­lel im­ple­men­ta­tion for au­to­matic tar­get recog­ni­tion, the fund­ing we had been promised was di­verted away from our elec­tro-op­ti­cal im­ple­men­ta­tion to a co­her­ent op­ti­cal fil­ter­ing group. Some years later we pre­sented the ar­gu­ments fa­vor­ing our ap­proach, com­pared to op­ti­cal im­ple­men­ta­tions and train­able sys­tems, in an ar­ti­cle ti­tled “Sys­tems Con­sid­er­a­tions for Au­to­matic Im­agery Screen­ing” by T.J. Harley, L.N. Kanal and N.C. Ran­dall, which is in­cluded in the IEEE Press reprint vol­ume ti­tled Ma­chine Recog­ni­tion of Pat­terns edited by A. Agrawala 19771. In the years which fol­lowed mul­ti­level sta­tis­ti­cally de­signed clas­si­fiers and AI search pro­ce­dures ap­plied to pat­tern recog­ni­tion held my in­ter­est, al­though com­ments in my 1974 sur­vey, “Pat­terns In Pat­tern Recog­ni­tion: 1968–1974” [IEEE Trans. on IT, 1974], men­tion pa­pers by Amari and oth­ers and show an aware­ness that neural net­works and bi­o­log­i­cally mo­ti­vated au­tomata were mak­ing a come­back. In the last few years train­able mul­ti­layer neural net­works have re­turned to dom­i­nate re­search in pat­tern recog­ni­tion and this time there is po­ten­tial for gain­ing much greater in­sight into their sys­tem­atic de­sign and per­for­mance analy­sis…

While Kanal & Ran­dall 1964 matches in some ways, in­clud­ing the im­age counts, there is no men­tion of fail­ure ei­ther in the pa­per or Kanal’s 1991 rem­i­nis­cences (rather, Kanal im­plies it was highly promis­ing), there is no men­tion of a field de­ploy­ment or ad­di­tional test­ing which could have re­vealed over­fit­ting, and given their use of bi­na­riz­ing, it’s not clear to me that their 2-layer al­go­rithm even could over­fit to global bright­ness; the pho­tos also ap­pear to have been taken at low enough al­ti­tude for there to be no clouds, and to be taken un­der sim­i­lar (pos­si­bly con­trolled) light­ing con­di­tions. The de­scrip­tion in Kanal & Ran­dall 1964 is some­what opaque to me, par­tic­u­larly of the ‘Lapla­cian’ they use to bi­na­rize or con­vert to edges, but there’s more back­ground in their “Semi­-Au­to­matic Im­agery Screen­ing Re­search Study and Ex­per­i­men­tal In­ves­ti­ga­tion, Vol­ume 1”, Harley, Bryan, Kanal, Tay­lor & Grayum 1962 (mir­ror), which in­di­cates that in their pre­lim­i­nary stud­ies they were al­ready in­ter­ested in prenormalization/preprocessing im­ages to cor­rect for al­ti­tude and bright­ness, and the Lapla­cian, along with sil­hou­et­ting and “line­ness edit­ing”, not­ing that “The Lapla­cian op­er­a­tion elim­i­nates ab­solute bright­ness scale as well as low-s­pa­tial fre­quen­cies which are of lit­tle con­se­quence in screen­ing op­er­a­tions.”2

An anony­mous reader says he heard the story in 1990:

I was told about the tank recog­ni­tion fail­ure by a lec­turer on my 1990 In­tel­li­gent Knowl­edge Based Sys­tems MSc, al­most cer­tainly Li­bor Spacek, in terms of be­ing aware of con­text in data sets; that be­ing from (the for­mer) Czecho­slo­va­kia he ex­pected to see tanks on a mo­tor­way whereas most British peo­ple did­n’t. I also re­mem­ber read­ing about a project with DARPA fund­ing aimed at differ­en­ti­at­ing Rus­sian, Eu­ro­pean and US tanks where what the im­age recog­ni­tion learned was not to spot the differ­ences be­tween tanks but to find trees, be­cause of the US tank pho­tos be­ing on open ground and the Russ­ian ones be­ing in forests; that was dur­ing the same MSc course—so very sim­i­lar to pre­dict­ing tu­mours by look­ing for the ruler used to mea­sure them in the pho­to—but I don’t re­call the source (it was­n’t one of the books you cite though, it was ei­ther a jour­nal ar­ti­cle or an­other text book).

1980s

Chris Brew states (2017-10-16) that he “Heard the story in 1984 with pi­geons in­stead of neural nets”.

1960s

, in an email to Eliezer Yud­kowsky on 2013-02-26, re­counts an in­ter­est­ing anec­dote about the 1960s claim­ing to be the grain of truth:

By the way, the story about the two pic­tures of a field, with and with­out army tanks in the pic­ture, comes from me. I at­tended a meet­ing in Los An­ge­les [at RAND?], about half a cen­tury ago [~1963?] where some­one gave a pa­per show­ing how a ran­dom net could be trained to de­tect the tanks in the pic­ture. I was in the au­di­ence. At the end of the talk I stood up and made the com­ment that it was ob­vi­ous that the pic­ture with the tanks was made on a sunny day while the other pic­ture (of the same field with­out the tanks) was made on a cloudy day. I sug­gested that the “neural net” had merely trained it­self to rec­og­nize the differ­ence be­tween a bright pic­ture and a dim pic­ture.

Evaluation

Sourcing

The ab­sence of any hard ci­ta­tions is strik­ing: even when a ci­ta­tion is sup­plied, it is in­vari­ably to a rel­a­tively re­cent source like Drey­fus, and then the chain ends. Typ­i­cally for a real sto­ry, one will find at least one or two hints of a penul­ti­mate ci­ta­tion and then a fi­nal de­fin­i­tive ci­ta­tion to some very diffi­cult-to-ob­tain or ob­scure work (which then is often quite differ­ent from the pop­u­lar­ized ver­sion but still rec­og­niz­able as the orig­i­nal); for ex­am­ple, an­other pop­u­lar cau­tion­ary AI ur­ban leg­end is that the 1956 claimed that a sin­gle grad­u­ate stu­dent work­ing for a sum­mer could solve com­puter vi­sion (or per­haps AI in gen­er­al), which is a highly dis­torted mis­lead­ing de­scrip­tion of the orig­i­nal 1955 pro­posal’s re­al­is­tic claim that “a 2 mon­th, 10 man study of ar­ti­fi­cial in­tel­li­gence” could yield “a sig­nifi­cant ad­vance can be made in one or more of these prob­lems if a care­fully se­lected group of sci­en­tists work on it to­gether for a sum­mer.”3 In­stead, every­one ei­ther dis­avows it as an ur­ban leg­end or pos­si­bly apoc­ryphal, or punts to some­one else. (Min­sky’s 2011 ver­sion ini­tially seems con­crete, but while he specifi­cally at­trib­utes the mu­si­cal score story to a friend & claims to have found the trick per­son­al­ly, he is then as vague as any­one else about the tank sto­ry, say­ing it just “hap­pened” some­where “in the United States at one of our re­search in­sti­tutes”, at an un­men­tioned in­sti­tute by un­men­tioned peo­ple at an un­men­tioned point in time for an un­men­tioned branch of the mil­i­tary.)

Variations

Ques­tion to Ra­dio Yere­van: “Is it cor­rect that Grig­ori Grig­orievich Grig­oriev won a lux­ury car at the Al­l-U­nion Cham­pi­onship in Moscow?”

Ra­dio Yere­van an­swered: “In prin­ci­ple, yes. But first of all it was not Grig­ori Grig­orievich Grig­oriev, but Vas­sili Vas­silievich Vas­siliev; sec­ond, it was not at the Al­l-U­nion Cham­pi­onship in Moscow, but at a Col­lec­tive Farm Sports Fes­ti­val in Smolen­sk; third, it was not a car, but a bi­cy­cle; and fourth he did­n’t win it, but rather it was stolen from him.”

“Ra­dio Yere­van Jokes” (col­lected by Al­lan Stevo)

It is also in­ter­est­ing that not all the sto­ries im­ply quite the same prob­lem with the hy­po­thet­i­cal NN. Dataset bias/selection effects is not the same thing as over­fit­ting or dis­parate im­pact, but some of the story tellers don’t re­al­ize that. For ex­am­ple, in some sto­ries, the NN fails when it’s tested on ad­di­tional held­out data (over­fit­ting), not when it’s tested on data from an en­tire differ­ent pho­tog­ra­pher or field ex­er­cise or data source (dataset bias/distributional shift). Or, Alexan­der Har­row­ell cites dis­parate im­pact in a med­ical school as if it were an ex­am­ple of the same prob­lem, but it’s not—at least in the USA, a NN would be cor­rect in in­fer­ring that white stu­dents are more likely to suc­ceed, as that is a real pre­dic­tor (this would be an ex­am­ple of how peo­ple play rather fast and loose with claims of “al­go­rith­mic bias”), and it would not nec­es­sar­ily be the case that, say, ran­dom­ized ad­mis­sion of more non-white stu­dents would be cer­tain to in­crease the num­ber of suc­cess­ful grad­u­ates; such a sce­nario is, how­ev­er, pos­si­ble and il­lus­trates the differ­ence be­tween pre­dic­tive mod­els & causal mod­els for con­trol & op­ti­miza­tion, and the need for experiments/reinforcement learn­ing.

A read of all the vari­ants to­gether raises more ques­tions than it an­swers:

  • Did this story hap­pen in the 1960s, 1980s, 1990s, or dur­ing Desert Storm in the 1990s?
  • Was the re­search con­ducted by the US mil­i­tary, or re­searchers for an­other NATO coun­try?
  • Were the pho­tographs taken by satel­lite, from the air, on the ground, or by spy cam­eras?
  • Were the pho­tographs of Amer­i­can tanks, ply­wood cutouts, So­viet tanks, or War­saw Pact tanks?
  • Were the tanks out in the open, un­der cov­er, or fully cam­ou­flaged?
  • Were these pho­tographs taken in forests, fields, deserts, swamps, or all of them?
  • Were the pho­tographs taken in same place but differ­ent time of day, same place but differ­ent days, or differ­ent places en­tire­ly?
  • Were there 100, 200, or thou­sands of pho­tographs; and how many were in the train­ing vs val­i­da­tion set?
  • Was the in­put in black­-and-white bi­na­ry, grayscale, or col­or?
  • Was the tel­l-tale fea­ture ei­ther field vs forest, bright vs dark, the pres­ence vs ab­sence of clouds, the pres­ence vs ab­sence of shad­ows, the length of shad­ows, or an ac­ci­dent in film de­vel­op­ment un­re­lated to weather en­tire­ly?
  • Was the NN to be used for im­age pro­cess­ing or in au­tonomous ro­botic tanks?
  • Was it even a NN?
  • Was the dataset bias caught quickly within “a few hours”, later by a sus­pi­cious team mem­ber, later still when ap­plied to an ad­di­tional set of tank pho­tographs, dur­ing fur­ther test­ing pro­duc­ing a new dataset, much later dur­ing a live demo for mil­i­tary offi­cers, or only after live de­ploy­ment in the field?

Al­most every as­pect of the tank story which could vary does vary.

Urban Legends

We could also com­pare the tank story with many of the char­ac­ter­is­tics of (of the sort so fa­mil­iar from Snopes): they typ­i­cally have a clear dra­matic arc, in­volve hor­ror or hu­mor while play­ing on com­mon con­cerns (dis­trust of NNs has been a theme from the start of NN re­search4), make an im­por­tant di­dac­tic or moral point, claim to be true while sourc­ing re­mains lim­ited to so­cial proof such as the usual “friend of a friend” at­tri­bu­tions, often try to as­so­ciate with a re­spected in­sti­tu­tion (such as the US mil­i­tary), are trans­mit­ted pri­mar­ily orally through so­cial mech­a­nisms & ap­pear spon­ta­neously & in­de­pen­dently in many sources with­out ap­par­ent ori­gin (most peo­ple seem to hear the tank story in un­spec­i­fied class­es, con­fer­ences, per­sonal dis­cus­sions rather than in a book or pa­per), ex­ists in many mu­tu­al­ly-con­tra­dic­tory vari­ants often with over­ly-spe­cific de­tails5 spon­ta­neously aris­ing in the retelling, been around for a long time (it ap­pears al­most fully formed in Drey­fus 1992, sug­gest­ing in­cu­ba­tion be­fore then), some­times have a grain of truth (dataset bias cer­tainly is re­al), and the full tank story is “too good not to pass along” (even au­thors who are sure it’s an ur­ban leg­end can’t re­sist retelling it yet again for di­dac­tic effect or en­ter­tain­men­t). The tank story matches al­most all the usual cri­te­ria for an ur­ban leg­end.

Origin

So where does this ur­ban leg­end come from? The key anec­dote ap­pears to be Ed­ward Fred­kin’s as it pre­cedes all other ex­cerpts ex­cept per­haps the re­search Kanal de­scribes; Fred­kin’s story does not con­firm the tank story as he merely spec­u­lates that bright­ness was dri­ving the re­sults, much less all the ex­tra­ne­ous de­tails about pho­to­graphic film be­ing ac­ci­den­tally overde­vel­oped or ro­bot tanks go­ing berserk or a demo fail­ing in front of Army brass.

But it’s easy to see how Fred­kin’s rea­son­able ques­tion could have memet­i­cally evolved into the tank story as fi­nally fixed into pub­lished form by Drey­fus’s ar­ti­cle:

  1. Set­ting: Kanal & Ran­dall set up their very small sim­ple early per­cep­trons on some tiny bi­nary aer­ial pho­tos of tanks, in in­ter­est­ing early work, and Fred­kin at­tends the talk some­time around 1960–1963

  2. The Ques­tion: Fred­kin then asks in the Q&A whether the per­cep­tron is not learn­ing square-shapes but bright­ness

  3. Punt­ing: of course nei­ther Fred­kin nor Kanal & Ran­dall can know on the spot whether this cri­tique is right or wrong (per­haps that ques­tion mo­ti­vated the bi­na­rized re­sults re­ported in Kanal & Ran­dall 1964?), and the ques­tion re­mains unan­swered

  4. Anec­do­tiz­ing: but some­one in the au­di­ence con­sid­ers that an ex­cel­lent ob­ser­va­tion about method­olog­i­cal flaws in NN re­search, and per­haps they (or Fred­kin) re­peats the story to oth­ers, who find it use­ful too, and along the way, Fred­kin’s ques­tion mark gets dropped and the pos­si­ble flaw be­comes an ac­tual flaw, with the punch­line: “…and it turned out their NN were just de­tect­ing av­er­age bright­ness!”

    One might ex­pect Kanal & Ran­dall to re­but these ru­mors, if only by pub­lish­ing ad­di­tional pa­pers on their func­tion­ing sys­tem, but by a quirk of fate, as Kanal ex­plains in his pref­ace, after their 1964 pa­per, the Army liked it enough to make it clas­si­fied and then they were re­as­signed to an en­tirely differ­ent task, killing progress en­tire­ly. (Some­thing sim­i­lar hap­pened to .)

  5. Pro­lif­er­a­tion: In the ab­sence of any coun­ternar­ra­tive (si­lence is con­sid­ered con­sen­t), the tank story con­tin­ues spread­ing.

  6. Mu­ta­tion: but now the story is in­com­plete, a joke miss­ing most of the setup to its punch­line—how did these Army re­searchers dis­cover the NN had tricked them and what was the bright­ness differ­ence from? The var­i­ous ver­sions pro­pose differ­ent res­o­lu­tions, and like­wise, ap­pro­pri­ate de­tails about the tank data must in­vent­ed.

  7. : Even­tu­al­ly, after enough mu­ta­tions, a ver­sion reaches Drey­fus, al­ready a well-known critic of the AI es­tab­lish­ment, who then uses it in his article/book, vi­rally spread­ing it glob­ally to pop up in ran­dom places thence­forth, and fix­at­ing it as a uni­ver­sal­ly-known ur-text. (Fur­ther memetic mu­ta­tions can and often will oc­cur, but dili­gent writ­ers & re­searchers will ‘cor­rect’ vari­ants by re­turn­ing to the Drey­fus ver­sion.)

One might try to write Drey­fus off as a co­in­ci­dence and ar­gue that the US Army must have had so many neural net re­search pro­grams go­ing that one of the oth­ers is the real orig­in, but one would ex­pect those pro­grams to re­sult in spin­offs, more re­ports, re­ports since de­clas­si­fied, etc. It’s been half a cen­tu­ry, after all. And de­spite the close as­so­ci­a­tion of the US mil­i­tary with MIT and early AI work, tanks do not seem to have been a ma­jor fo­cus of early NN re­search—­for ex­am­ple, does not men­tion tanks at all, and most of my pa­per searches kept pulling up NN pa­pers about ‘tanks’ as in vats, such as con­trol­ling stirring/mixing tanks for chem­istry. Nor is it a safe as­sump­tion that the mil­i­tary al­ways has much more ad­vanced tech­nol­ogy than the pub­lic or pri­vate sec­tors; often, they can be quite be­hind or at the sta­tus quo.6

Could it Happen?

Could some­thing like the tank story (a NN learn­ing to dis­tin­guish solely on av­er­age bright­ness lev­els) hap­pen in 2017 with state-of-the-art tech­niques like con­vo­lu­tional neural net­works (CNNs)? (After all, pre­sum­ably no­body re­ally cares about what mis­takes a crude per­cep­tron may or may not have once made back in the 1960s; most/all of the sto­ry-tellers are us­ing it for di­dac­tic effect in warn­ing against care­less­ness in con­tem­po­rary & fu­ture AI research/applications.) I would guess that while it could hap­pen, it would be con­sid­er­ably less likely now than then for sev­eral rea­sons:

  1. a com­mon pre­pro­cess­ing step in com­puter vi­sion (and NNs in gen­er­al) is to “whiten” the im­age by stan­dard­iz­ing or trans­form­ing pix­els to a nor­mal dis­tri­b­u­tion; this would tend to wipe global bright­ness lev­els, pro­mot­ing in­vari­ance to il­lu­mi­na­tion

  2. in ad­di­tion to or in­stead of whiten­ing, it is also com­mon to use ag­gres­sive “data aug­men­ta­tion”: shift­ing the im­age by a few pix­els in each di­rec­tion, crop­ping it ran­dom­ly, ad­just­ing col­ors to be slightly more red/green/blue, flip­ping hor­i­zon­tal­ly, bar­rel-warp­ing it, adding JPEG com­pres­sion noise/artifacts, bright­en­ing or dark­en­ing, etc.

    None of these trans­for­ma­tions should affect whether an im­age is clas­si­fi­able as “dog” or “cat”7, the rea­son­ing goes, so the NN should learn to see past them, and gen­er­at­ing vari­ants dur­ing train­ing pro­vides ad­di­tional data for free. Ag­gres­sive data aug­men­ta­tion would make it harder to pick up global bright­ness as a cheap trick.

  3. CNNs have built-in bi­ases (com­pared to ful­ly-con­nected neural net­works) to­wards edges and other struc­tures, rather than global av­er­ages; con­vo­lu­tions want to find edges and geo­met­ric pat­terns like lit­tle squares for tanks. (This point is par­tic­u­larly ger­mane in light of the brain in­spi­ra­tion for con­vo­lu­tions & Drey­fus & Drey­fus 1992’s in­ter­pre­ta­tion of the tank sto­ry.)

  4. im­age clas­si­fi­ca­tion CNNs, due to their large sizes, are often trained on large datasets with many classes to cat­e­go­rize im­ages into (canon­i­cal­ly, Im­a­geNet with 1000 classes over a mil­lion im­ages; much larger datasets, such as 300 mil­lion im­ages, have been ex­plored and found to still offer ben­e­fit­s). Per­force, most of these im­ages will not be gen­er­ated by the dataset main­tainer and will come from a wide va­ri­ety of peo­ples, places, cam­eras, and set­tings, re­duc­ing any sys­tem­atic bi­as­es. It would be diffi­cult to find a cheap trick which works over many of those cat­e­gories si­mul­ta­ne­ous­ly, and the NN train­ing will con­stantly erode any cat­e­go­ry-spe­cific tricks in fa­vor of more gen­er­al­iz­able pat­tern-recog­ni­tion (in part be­cause there’s no in­her­ent ‘mod­u­lar­ity’ which could fac­tor a NN into a “tank cheap trick” NN & a “every­thing else real pat­tern-recog­ni­tion” NN). The power of gen­er­al­iz­able ab­strac­tions will tend to over­whelm the short­cuts, and the more data & tasks a NN is trained on, pro­vid­ing greater su­per­vi­sion & richer in­sight, the more this will be the case.

    • Even in the some­what un­usual case of a spe­cial-pur­pose bi­nary clas­si­fi­ca­tion CNN be­ing trained on a few hun­dred im­ages, be­cause of the large sizes of good CNNs, it is typ­i­cal to at least start with a pre­trained Im­a­geNet CNN in or­der to ben­e­fit from all the learned knowl­edge about edges & what­not be­fore “fine­tun­ing” on the spe­cial-pur­pose small dataset. If the CNN starts with a huge in­duc­tive bias to­wards edges etc, it will have a hard time throw­ing away its in­for­ma­tive pri­ors and fo­cus­ing purely on global bright­ness. (Often in fine­tun­ing, the lower lev­els of the CNN aren’t al­lowed to change at al­l!)
    • An­other vari­ant on trans­fer learn­ing is to use the CNN as a fea­ture-gen­er­a­tor, by tak­ing the fi­nal lay­ers’ state com­puted on a spe­cific im­age and us­ing them as a vec­tor em­bed­ding, a sort of sum­mary of every­thing about the im­age con­tent rel­e­vant to clas­si­fi­ca­tion; this em­bed­ding is use­ful for other kinds of CNNs for pur­poses like style trans­fer (style trans­fer aims to warp an im­age to­wards the ap­pear­ance of an­other im­age while pre­serv­ing the em­bed­ding and thus pre­sum­ably the con­tent) or for GANs gen­er­at­ing im­ages (the dis­crim­i­na­tor can use the fea­tures to de­tect “weird” im­ages which don’t make sense, thereby forc­ing the gen­er­a­tor to learn what im­ages cor­re­spond to re­al­is­tic em­bed­dings).
  5. CNNs would typ­i­cally throw warn­ing signs be­fore a se­ri­ous field de­ploy­ment, ei­ther in di­ag­nos­tics or fail­ures to ex­tend the re­sults.

    • One ben­e­fit of the fil­ter setup of CNNs is that it’s easy to vi­su­al­ize what the lower lay­ers are ‘look­ing at’; typ­i­cal­ly, CNN fil­ters will look like di­ag­o­nal or hor­i­zon­tal lines or curves or other sim­ple geo­met­ric pat­terns. In the case of a hy­po­thet­i­cal bright­ness-de­tec­tor CNN, be­cause it is not rec­og­niz­ing any shapes what­so­ever or do­ing any­thing but triv­ial bright­ness av­er­ag­ing, one would ex­pect its fil­ters to look like ran­dom noise and defi­nitely noth­ing like the usual fil­ter vi­su­al­iza­tions. This would im­me­di­ately alarm any deep learn­ing re­searcher that the CNN is not learn­ing what they thought it was learn­ing.
    • Re­lated to fil­ter vi­su­al­iza­tion is in­put vi­su­al­iza­tion: it’s com­mon to gen­er­ate some heatmaps of in­put im­ages to see what re­gions of the in­put im­age are in­flu­enc­ing the clas­si­fi­ca­tion the most. If you are clas­si­fy­ing “cats vs dogs”, you ex­pect a heatmap of a cat im­age to fo­cus on the cat’s head and tail, for ex­am­ple, and not on the paint­ing on the liv­ing room wall be­hind it; if you have an im­age of a tank in a forest, you ex­pect the heatmap to fo­cus on the tank rather than trees in the cor­ner or noth­ing in par­tic­u­lar, just ran­dom-seem­ing pix­els all over the im­age. If it’s not fo­cus­ing on the tank at all, how is it do­ing the clas­si­fi­ca­tion?, one would then won­der. ( (blog), Hen­der­son & Rothe 2017-05-16 quote Yud­kowsky 2008’s ver­sion of the tank story as a mo­ti­va­tion for their heatmap vi­su­al­iza­tion tool and demon­strate that, for ex­am­ple, block­ing out the sky in a tank im­age does­n’t bother a VGG16 CNN im­age clas­si­fier but block the tank’s treads does, and the heatmap fo­cuses on the tank it­self.) There are ad­di­tional meth­ods for try­ing to un­der­stand whether the NN has learned a po­ten­tially use­ful al­go­rithm us­ing other meth­ods such as the pre­vi­ously cited LIME.
  6. Also re­lated to the vi­su­al­iza­tion is go­ing be­yond clas­si­fi­ca­tion to the log­i­cal next step of “lo­cal­iza­tion” or “im­age seg­men­ta­tion”: hav­ing de­tected an im­age with a tank in it some­where, it is nat­ural (e­spe­cially for mil­i­tary pur­pos­es) to ask where in the im­age the tank is?

    A CNN which is truly de­tect­ing the tank it­self will lend it­self to im­age seg­men­ta­tion (eg CNN suc­cess in reach­ing hu­man lev­els of Im­a­geNet clas­si­fi­ca­tion per­for­mance have also re­sulted in ex­tremely good seg­men­ta­tion of an im­age by cat­e­go­riz­ing each pixel as human/dog/cat/etc), while one learn­ing the cheap trick of bright­ness will ut­terly fail at guess­ing bet­ter than chance which pix­els are the tank.

So, it is highly un­likely that a CNN trained via a nor­mal work­flow (data-aug­mented fine­tun­ing of a pre­trained Im­a­geNet CNN with stan­dard di­ag­nos­tics) would fail in this ex­act way or, at least, make it to a de­ployed sys­tem with­out fail­ing.

Could Something Like it Happen?

Could some­thing like the tank story hap­pen, in the sense of a se­lec­tion-bi­ased dataset yield­ing NNs which fail dis­mally in prac­tice? One could imag­ine it hap­pen­ing and it surely does at least oc­ca­sion­al­ly, but in prac­tice it does­n’t seem to be a par­tic­u­larly se­ri­ous or com­mon prob­lem—peo­ple rou­tinely ap­ply CNNs to very differ­ent con­texts with con­sid­er­able suc­cess.8 If it’s such a se­ri­ous and com­mon prob­lem, one would think that peo­ple would be able to pro­vide a wealth of re­al-world ex­am­ples of sys­tems de­ployed with dataset bias mak­ing it en­tirely use­less, rather than re­peat­ing a fic­tion from 50 years ago.

One of the most rel­e­vant (if un­for­tu­nately older & pos­si­bly out of date) pa­pers I’ve read on this ques­tion of dataset bias is “Un­bi­ased Look at Dataset Bias”, Tor­ralba & Efros 2011:

Datasets are an in­te­gral part of con­tem­po­rary ob­ject recog­ni­tion re­search. They have been the chief rea­son for the con­sid­er­able progress in the field, not just as source of large amounts of train­ing data, but also as means of mea­sur­ing and com­par­ing per­for­mance of com­pet­ing al­go­rithms. At the same time, datasets have often been blamed for nar­row­ing the fo­cus of ob­ject recog­ni­tion re­search, re­duc­ing it to a sin­gle bench­mark per­for­mance num­ber. In­deed, some datasets, that started out as data cap­ture efforts aimed at rep­re­sent­ing the vi­sual world, have be­come closed worlds unto them­selves (e.g. the Corel world, the Cal­tech101 world, the PASCAL VOC world). With the fo­cus on beat­ing the lat­est bench­mark num­bers on the lat­est dataset, have we per­haps lost sight of the orig­i­nal pur­pose?

The goal of this pa­per is to take stock of the cur­rent state of recog­ni­tion datasets. We present a com­par­i­son study us­ing a set of pop­u­lar datasets, eval­u­ated based on a num­ber of cri­te­ria in­clud­ing: rel­a­tive data bi­as, cross-dataset gen­er­al­iza­tion, effects of closed-world as­sump­tion, and sam­ple val­ue. The ex­per­i­men­tal re­sults, some rather sur­pris­ing, sug­gest di­rec­tions that can im­prove dataset col­lec­tion as well as al­go­rithm eval­u­a­tion pro­to­cols. But more broad­ly, the hope is to stim­u­late dis­cus­sion in the com­mu­nity re­gard­ing this very im­por­tant, but largely ne­glected is­sue.

They demon­strate on sev­eral datasets (in­clud­ing Im­a­geNet), that it’s pos­si­ble for a SVM (CNNs were not used) to guess at above chance lev­els what dataset an im­age comes from and that there are no­tice­able drops in ac­cu­racy when a clas­si­fier trained on one dataset is ap­plied to os­ten­si­bly the same cat­e­gory in an­other dataset (eg an Im­a­geNet “car” SVM clas­si­fier ap­plied to PASCAL’s “car” im­ages will go from 57% to 36% ac­cu­ra­cy). But—per­haps the glass is half-ful­l—in none of the pairs does the per­for­mance de­grade to near-ze­ro, so de­spite the defi­nite pres­ence of dataset bi­as, the SVMs are still learn­ing gen­er­al­iz­able, trans­fer­able im­age clas­si­fi­ca­tion (sim­i­lar­ly, //9/// show a gen­er­al­iza­tion gap but only a small one with typ­i­cally bet­ter in­-sam­ple clas­si­fiers per­form­ing bet­ter out­-of-sam­ple, show that Im­a­geNet resnets pro­duce mul­ti­ple new SOTAs on other im­age datasets us­ing fine­tun­ing trans­fer learn­ing, com­pares Fisher vec­tors (an SVM trained on SIFT fea­tures, & is one of a num­ber of scal­ing pa­pers show­ing much bet­ter rep­re­sen­ta­tions & ro­bust­ness & trans­fer with ex­tremely large CNNs) to CNNs on PASCAL VOC again, find­ing the Fish­ers over­fit by eg clas­si­fy­ing horses based on copy­right wa­ter­marks while the CNN nev­er­the­less clas­si­fies it based on the cor­rect parts, al­though the CNN may suc­cumb to a differ­ent dataset bias by clas­si­fy­ing air­planes based on hav­ing back­grounds of skies10); and I be­lieve we have good rea­son to ex­pect our CNNs to also work in the wild.

Some real in­stances of dataset bi­as, more or less (most of these were caught by stan­dard held­out datasets and ar­guably aren’t the ‘tank story’ at al­l):

  • a par­tic­u­larly ap­pro­pri­ate ex­am­ple is the un­suc­cess­ful : a fail­ure, among sev­eral rea­sons, be­cause the dogs were trained on Russ­ian tanks and sought them out rather than the en­emy Ger­man tanks be­cause the dogs rec­og­nized ei­ther the fuel smell or fuel can­is­ters (diesel vs gaso­line)

  • “The per­son con­cept in mon­keys (Ce­bus apella)”, D’Am­ato & Van Sant 1988

  • Google Pho­tos in June 2015 caused a so­cial-me­dia fuss over mis­la­bel­ing African-Amer­i­cans as go­ril­las; Google did not ex­plain how the Pho­tos app made that mis­take but it is pre­sum­ably us­ing a CNN and an ex­am­ple of ei­ther dataset bias (many more Caucasian/Asian faces lead­ing to bet­ter per­for­mance on them and con­tin­ued poor per­for­mance every­where else) and/or a mis­-spec­i­fied loss func­tion (the CNN op­ti­miz­ing a stan­dard clas­si­fi­ca­tion loss and re­spond­ing to class im­bal­ance or ob­jec­tive color sim­i­lar­ity by pre­fer­ring to guess ‘go­rilla’ rather than ‘hu­man’ to min­i­mize loss, de­spite what ought to be a greater penalty for mis­tak­enly clas­si­fy­ing a hu­man as an animal/object rather than vice ver­sa). A sim­i­lar is­sue oc­curred with Flickr in May 2015.

  • , Kuehlkamp et al 2017

  • Gidi Sh­per­ber, “What I’ve learned from Kag­gle’s fish­eries com­pe­ti­tion” (2017-05-01): ini­tial ap­pli­ca­tion of VGG Im­a­geNet CNNs for trans­fer solved the fish pho­to­graph clas­si­fi­ca­tion prob­lem al­most im­me­di­ate­ly, but failed on the sub­mis­sion val­i­da­tion set; fish cat­e­gories could be pre­dicted from the spe­cific boat tak­ing the pho­tographs

  • “Leak­age in data min­ing: For­mu­la­tion, de­tec­tion, and avoid­ance”, Kauf­man et al 2011 dis­cusses the gen­eral topic and men­tions a few ex­am­ples from KDD-Cup

  • Dan Piponi (2017-10-16): “Real world ex­am­ple from work: hos­pi­tals spe­cialise in differ­ent in­juries so CNN for di­ag­no­sis used an­no­ta­tions on x-rays to ID hos­pi­tal.”

  • Thomas G. Di­et­terich:

    We made ex­actly the same mis­take in one of my projects on in­sect recog­ni­tion. We pho­tographed 54 classes of. In­sects. Spec­i­mens had been col­lect­ed, iden­ti­fied, and placed in vials. Vials were placed in boxes sorted by class. I hired stu­dent work­ers to pho­to­graph the spec­i­mens. Nat­u­rally they did this one box at a time; hence, one class at a time. Pho­tos were taken in al­co­hol. Bub­bles would form in the al­co­hol. Differ­ent bub­bles on differ­ent days. The learned clas­si­fier was sur­pris­ingly good. But a saliency map re­vealed that it was read­ing the bub­ble pat­terns and ig­nor­ing the spec­i­mens. I was so em­bar­rassed that I had made the old­est mis­take in the book (even if it was apoc­ryphal). Un­be­liev­able. Lesson: al­ways ran­dom­ize even if you don’t know what you are con­trol­ling for!

  • a pos­si­ble case is Wu & Zhang 2016, “Au­to­mated In­fer­ence on Crim­i­nal­ity us­ing Face Im­ages”, at­tempt to use CNNs to clas­sify stan­dard­ized gov­ern­ment ID pho­tos of Chi­nese peo­ple by whether the per­son has been ar­rest­ed, the source of the crim­i­nal IDs be­ing gov­ern­ment pub­li­ca­tions of wanted sus­pects vs or­di­nary peo­ples’ IDs col­lected on­line; the pho­tos are re­peat­edly de­scribed as ID pho­tos and im­plied to be uni­form. The use of offi­cial gov­ern­ment ID pho­tos taken in ad­vance of any crime would ap­pear to elim­i­nate one’s im­me­di­ate ob­jec­tions about dataset bi­as—cer­tainly ID pho­tos would be dis­tinct in many ways from or­di­nary cropped pro­mo­tional head­shot­s—and so the re­sults seem strong.

    In re­sponse to harsh crit­i­cism (some of which points are more rel­e­vant & likely than the oth­er­s…), Wu & Zhang ad­mit in their re­sponse () that the dataset is not quite as im­plied:

    All crim­i­nal ID pho­tos are gov­ern­ment is­sued, but not mug shots. To our best knowl­edge, they are nor­mal gov­ern­ment is­sued ID por­traits like those for dri­ver’s li­cense in USA. In con­trast, most of the non­crim­i­nal ID style pho­tos are taken offi­cially by some or­ga­ni­za­tions (such as real es­tate com­pa­nies, law firms, etc.) for their web­sites. We stress that they are not selfies.

    While there is no di­rect repli­ca­tion test­ing the Wu & Zhang 2016 re­sults that I know of, the in­her­ent con­sid­er­able differ­ences be­tween the two class­es, which are not ho­moge­nous at all, make me highly skep­ti­cal.

  • Pos­si­ble: Win­kler et al 2019 ex­am­ine a com­mer­cial CNN (“Mole­an­a­lyz­er-Pro”; Haenssle et al 2018) for skin can­cer de­tec­tion. Con­cerned by the fact that doc­tors some­times use pur­ple mark­ers to high­light po­ten­tial­ly-ma­lig­nant skin can­cers for eas­ier ex­am­i­na­tion, they com­pare before/after pho­tographs of skin can­cers which have been high­light­ed, and find that the pur­ple high­light­ing in­creases the prob­a­bil­ity of be­ing clas­si­fied as ma­lig­nant.

    How­ev­er, it is un­clear that this is a dataset bias prob­lem, as the ex­ist­ing train­ing datasets for skin can­cer are re­al­is­tic and al­ready in­clude pur­ple marker sam­ples11. The demon­strated ma­nip­u­la­tion may sim­ply re­flect the CNN us­ing pur­ple as a proxy for hu­man con­cern, which is an in­for­ma­tive sig­nal and de­sir­able if it im­proves clas­si­fi­ca­tion per­for­mance in the real world on real med­ical cas­es. It is pos­si­ble that the train­ing datasets are in fact bi­ased to some de­gree with too much/too lit­tle pur­ple or that use of pur­ple differs sys­tem­at­i­cally across hos­pi­tals, and those would dam­age per­for­mance to some de­gree, but that is not demon­strated by their before/after com­par­i­son. Ide­al­ly, one would run a field trial to test the CNN’s per­for­mance as a whole by us­ing it in var­i­ous hos­pi­tals and then fol­low­ing up on all cases to de­ter­mine be­nign or ma­lig­nant; if the clas­si­fi­ca­tion per­for­mance drops con­sid­er­ably from the orig­i­nal train­ing, then that im­plies some­thing (pos­si­bly the pur­ple high­light­ing) has gone wrong.

  • Pos­si­ble: Es­teva et al 2011 trains a skin can­cer clas­si­fier; the fi­nal CNN per­forms well in in­de­pen­dent test sets. The pa­per does not men­tion this prob­lem but me­dia cov­er­age re­ported that rulers in pho­tographs served as un­in­ten­tional fea­tures:

    He and his col­leagues had one such prob­lem in their their study with rulers. When der­ma­tol­o­gists are look­ing at a le­sion that they think might be a tu­mor, they’ll break out a ruler—the type you might have used in grade school—to take an ac­cu­rate mea­sure­ment of its size. Der­ma­tol­o­gists tend to do this only for le­sions that are a cause for con­cern. So in the set of biopsy im­ages, if an im­age had a ruler in it, the al­go­rithm was more likely to call a tu­mor ma­lig­nant, be­cause the pres­ence of a ruler cor­re­lated with an in­creased like­li­hood a le­sion was can­cer­ous. Un­for­tu­nate­ly, as Novoa em­pha­sizes, the al­go­rithm does­n’t know why that cor­re­la­tion makes sense, so it could eas­ily mis­in­ter­pret a ran­dom ruler sight­ing as grounds to di­ag­nose can­cer.

    It’s un­clear how they de­tected this prob­lem or how they fixed it. And like Win­kler et al 2019, it’s un­clear if this was a prob­lem which would re­duce re­al-world per­for­mance (are der­ma­tol­o­gists go­ing to stop mea­sur­ing wor­ri­some le­sion­s?).

Should We Tell Stories We Know Aren’t True?

So the NN tank story prob­a­bly did­n’t hap­pen as de­scribed, but some­thing some­what like it could have hap­pened and things sort of like it could hap­pen now, and it is (as proven by its his­to­ry) a catchy story to warn stu­dents with­—it’s not true but it’s . Should we still men­tion it to jour­nal­ists or in blog posts or in dis­cus­sions of AI risk, as a no­ble lie?

I think not. In gen­er­al, we should pro­mote more epis­temic rigor and higher stan­dards in an area where there is al­ready far too much im­pact of fic­tional sto­ries (eg the de­press­ing in­evitabil­ity of a Ter­mi­na­tor al­lu­sion in AI risk dis­cus­sion­s). Nor do I con­sider the story par­tic­u­larly effec­tive from a di­dac­tic per­spec­tive: rel­e­gat­ing dataset bias to myth­i­cal sto­ries does not in­form the lis­tener about how com­mon or how se­ri­ous dataset bias is, nor is it help­ful for re­searchers in­ves­ti­gat­ing coun­ter­mea­sures and di­ag­nos­tic­s—the LIME de­vel­op­ers, for ex­am­ple, are not helped by sto­ries about Russ­ian tanks, but need real test­cases to show that their in­ter­pretabil­ity tools work & would help ma­chine learn­ing de­vel­op­ers di­ag­nose & fix dataset bias.

I also fear that telling the tank story tends to pro­mote com­pla­cency and un­der­es­ti­ma­tion of the state of the art by im­ply­ing that NNs and AI in gen­eral are toy sys­tems which are far from prac­ti­cal­ity & can­not work in the real world (par­tic­u­larly the story vari­ants which date it rel­a­tively re­cent­ly), or that such sys­tems when they fail will fail in eas­ily di­ag­nosed, vis­i­ble, some­times amus­ing ways, ways which can be di­ag­nosed by a hu­man com­par­ing the pho­tos or ap­ply­ing some po­lit­i­cal rea­son­ing to the out­puts; but mod­ern NNs are pow­er­ful, are often de­ployed to the real world de­spite the spec­tre of dataset bi­as, and do not fail in bla­tant ways—what we ac­tu­ally see with deep learn­ing are far more con­cern­ing fail­ure modes like “ad­ver­sar­ial ex­am­ples” which are quite as in­scrutable as the neural nets them­selves (or Al­phaGo’s one mis­judged move re­sult­ing in its only loss to Lee Sedol). Ad­ver­sar­ial ex­am­ples are par­tic­u­larly in­sid­i­ous as the NN will work flaw­lessly in all the nor­mal set­tings and con­texts, only to fail to­tally when ex­posed to a cus­tom ad­ver­sar­ial in­put. More im­por­tant­ly, dataset bias and fail­ure to trans­fer tends to be a self­-lim­it­ing prob­lem, par­tic­u­larly when em­bed­ded in an on­go­ing sys­tem or re­in­force­ment learn­ing agent, since if the NN is mak­ing er­rors based on dataset bi­as, it will in effect be gen­er­at­ing new coun­terex­am­ple dat­a­points for its next it­er­a­tion.

Alternative examples

“There is noth­ing so use­less as do­ing effi­ciently that which should not be done at all.”

Pe­ter Drucker

The more trou­bling er­rors are ones where the goal it­self, the re­ward func­tion, is mis­-spec­i­fied or wrong or harm­ful. I am less wor­ried about al­go­rithms learn­ing to do poorly the right thing for the wrong rea­sons be­cause hu­mans are sloppy in their data col­lec­tion than I am about them learn­ing to do well the wrong thing for the right rea­sons de­spite per­fect data col­lec­tion. Us­ing losses which have lit­tle to do with the true hu­man util­ity func­tion or de­ci­sion con­text is far more com­mon than se­ri­ous dataset bi­as: peo­ple think about where their data is com­ing from, but they tend not to think about what the con­se­quences of wrong clas­si­fi­ca­tions are. Such re­ward func­tion prob­lems can­not be fixed by col­lect­ing any amount of data or mak­ing data more rep­re­sen­ta­tive of the real world, and for large-s­cale sys­tems will be more harm­ful.

Un­for­tu­nate­ly, I know of no par­tic­u­larly com­pre­hen­sive lists of ex­am­ples of mis­-spec­i­fied rewards/unexpectedly bad proxy ob­jec­tive functions/“re­ward hack­ing”/“wire­head­ing”/“per­verse in­stan­ti­a­tion”12; per­haps peo­ple can make sug­ges­tions, but a few ex­am­ples I have found or re­call in­clude:

  • op­ti­miza­tion for nu­tri­tious (not nec­es­sar­ily palat­able!) low-cost di­ets: “The cost of sub­sis­tence”, Stigler 1945, “The Diet Prob­lem”, Dantzig 1990, “Stigler’s Diet Prob­lem Re­vis­ited”, Gar­ille & Gass 2001

    • SMT/SAT solvers are like­wise in­fa­mous for find­ing strictly valid yet sur­pris­ing or use­less, which per­ver­sity is ex­actly what makes them so in­valu­able in security/formal-verification re­search (for ex­am­ple, in RISC-V ver­i­fi­ca­tion of ex­cep­tions, dis­cov­er­ing that it can trig­ger an ex­cep­tion by turn­ing on a de­bug unit & set­ting a break­point, or us­ing an ob­scure mem­ory mode set­ting)
  • boat race re­ward-shap­ing for pick­ing up tar­gets re­sults in not fin­ish race at all but go­ing in cir­cles to hit tar­gets: “Faulty Re­ward Func­tions in the Wild”, Ope­nAI

  • a clas­sic 3D ro­bot­-arm NN agent, in a some­what un­usual setup where the evaluator/reward func­tion is an­other NN trained to pre­dict hu­man eval­u­a­tions, learns to move the arm to a po­si­tion which looks like it is po­si­tioned at the goal but is ac­tu­ally just in be­tween the ‘cam­era’ and the goal: , Chris­tiano et al 2017, Ope­nAI

  • re­ward-shap­ing a bi­cy­cle agent for not falling over & mak­ing progress to­wards a goal point (but not pun­ish­ing for mov­ing away) leads it to learn to cir­cle around the goal in a phys­i­cally sta­ble loop: “Learn­ing to Drive a Bi­cy­cle us­ing Re­in­force­ment Learn­ing and Shap­ing”, Randlov & Al­strom 1998; sim­i­lar diffi­cul­ties in avoid­ing patho­log­i­cal op­ti­miza­tion were ex­pe­ri­enced by Cook 2004 (video of pol­i­cy-it­er­a­tion learn­ing to spin han­dle-bar to stay up­right).

  • re­ward-shap­ing a soc­cer ro­bot for touch­ing the ball caused it to learn to get to the ball and “vi­brate” touch­ing it as fast as pos­si­ble: David An­dre & As­tro Teller in Ng et al 1999, “Pol­icy in­vari­ance un­der re­ward trans­for­ma­tions: the­ory and ap­pli­ca­tion to re­ward shap­ing”

  • en­vi­ron­ments in­volv­ing walking/running/movement and re­ward­ing move­ment seem to often re­sult in the agents learn­ing to fall over as a lo­cal op­tima of speed gen­er­a­tion, pos­si­bly bounc­ing around; for ex­am­ple, Sims notes in one pa­per (Sims 1994) that “It is im­por­tant that the phys­i­cal sim­u­la­tion be rea­son­ably ac­cu­rate when op­ti­miz­ing for crea­tures that can move within it. Any bugs that al­low en­ergy leaks from non-con­ser­va­tion, or even round-off er­rors, will in­evitably be dis­cov­ered and ex­ploited by the evolv­ing crea­tures. …speed is used as the se­lec­tion cri­te­ria, but the ver­ti­cal com­po­nent of ve­loc­ity is ig­nored. For land en­vi­ron­ments, it can be nec­es­sary to pre­vent crea­tures from gen­er­at­ing high ve­loc­i­ties by sim­ply falling over.” Com­bined with “3-D Mor­phol­ogy”, Sims dis­cov­ered that with­out height lim­its, the crea­tures just be­came as tall as pos­si­ble and fell over; and if the con­ser­va­tion-of-mo­men­tum was not ex­act, crea­tures could evolve ‘pad­dles’ and pad­dle them­selves at high ve­loc­i­ty. (E­volv­ing sim­i­lar ex­ploita­tion of round­ing-off has been done by Ope­nAI in 2017 to turn ap­par­ently lin­ear neural net­works into non­lin­ear ones; Jader­berg et al 2019 ap­pears to have had a sim­i­lar mo­men­tum bug in its Quake sim­u­la­tor: “In one test, the bots in­vented a com­pletely novel strat­e­gy, ex­ploit­ing a bug that let team­mates give each other a speed boost by shoot­ing them in the back.”)

  • , train­ing a sim­u­lated ro­bot grip­per arm to stack ob­jects like Legos, in­cluded re­ward shap­ing; patholo­gies in­cluded “hov­er­ing” and for a re­ward-shap­ing for lift­ing the bot­tom face of the top block up­wards, DDPG learned to knock the blocks over, thereby (tem­porar­i­ly) el­e­vat­ing the bot­tom of the top block and re­ceiv­ing the re­ward:

    We con­sider three differ­ent com­pos­ite re­wards in ad­di­tional to the orig­i­nal sparse task re­ward:

    1. Grasp shap­ing: Grasp brick 1 and Stack brick 1, i.e. the agent re­ceives a re­ward of 0.25 when the brick 1 has been grasped and a re­ward of 1.0 after com­ple­tion of the full task.
    2. Reach and grasp shap­ing: Reach brick 1, Grasp brick 1 and Stack brick 1, i.e. the agent re­ceives a re­ward of 0.125 when be­ing close to brick 1, a re­ward of 0.25 when brick 1 has been grasped, and a re­ward of 1.0 after com­ple­tion of the full task.
    3. Full com­pos­ite shap­ing: the sparse re­ward com­po­nents as be­fore in com­bi­na­tion with the dis­tance-based smoothly vary­ing com­po­nents.

    Fig­ure 5 shows the re­sults of learn­ing with the above re­ward func­tions (blue traces). The fig­ure makes clear that learn­ing with the sparse re­ward only does not suc­ceed for the full task. In­tro­duc­ing an in­ter­me­di­ate re­ward for grasp­ing al­lows the agent to learn to grasp but learn­ing is very slow. The time to suc­cess­ful grasp­ing can be sub­stan­tially re­duced by giv­ing a dis­tance based re­ward com­po­nent for reach­ing to the first brick, but learn­ing does not progress be­yond grasp­ing. Only with an ad­di­tional in­ter­me­di­ate re­ward com­po­nent as in con­tin­u­ous reach, grasp, stack the full task can be solved.

    Al­though the above re­ward func­tions are spe­cific to the par­tic­u­lar task, we ex­pect that the idea of a com­pos­ite re­ward func­tion can be ap­plied to many other tasks thus al­low­ing learn­ing for to suc­ceed even for chal­leng­ing prob­lems. Nev­er­the­less, great care must be taken when defin­ing the re­ward func­tion. We en­coun­tered sev­eral un­ex­pected fail­ure cases while de­sign­ing the re­ward func­tion com­po­nents: e.g. reach and grasp com­po­nents lead­ing to a grasp un­suit­able for stack­ing, agent not stack­ing the bricks be­cause it will stop re­ceiv­ing the grasp­ing re­ward be­fore it re­ceives re­ward for stack­ing and the agent flips the brick be­cause it gets a grasp­ing re­ward cal­cu­lated with the wrong ref­er­ence point on the brick. We show ex­am­ples of these in the video.

  • RL agents us­ing learned mod­el-based plan­ning par­a­digms such as the model pre­dic­tive con­trol are noted to have is­sues with the plan­ner es­sen­tially ex­ploit­ing the learned model by choos­ing a plan go­ing through the worst-mod­eled parts of the en­vi­ron­ment and pro­duc­ing un­re­al­is­tic plans us­ing tele­por­ta­tion, eg Mishra et al 2017, “Pre­dic­tion and Con­trol with Tem­po­ral Seg­ment Mod­els” who note:

    If we at­tempt to solve the op­ti­miza­tion prob­lem as posed in (2), the so­lu­tion will often at­tempt to ap­ply ac­tion se­quences out­side the man­i­fold where the dy­nam­ics model is valid: these ac­tions come from a very differ­ent dis­tri­b­u­tion than the ac­tion dis­tri­b­u­tion of the train­ing da­ta. This can be prob­lem­at­ic: the op­ti­miza­tion may find ac­tions that achieve high re­wards un­der the model (by ex­ploit­ing it in a regime where it is in­valid) but that do not ac­com­plish the goal when they are ex­e­cuted in the real en­vi­ron­ment.

    …Next, we com­pare our method to the base­lines on tra­jec­tory and pol­icy op­ti­miza­tion. Of in­ter­est is both the ac­tual re­ward achieved in the en­vi­ron­ment, and the differ­ence be­tween the true re­ward and the ex­pected re­ward un­der the mod­el. If a con­trol al­go­rithm ex­ploits the model to pre­dict un­re­al­is­tic be­hav­ior, then the lat­ter will be large. We con­sider two tasks….Un­der each mod­el, the op­ti­miza­tion finds ac­tions that achieve sim­i­lar mod­el-pre­dicted re­wards, but the base­lines suffer from large dis­crep­an­cies be­tween model pre­dic­tion and the true dy­nam­ics. Qual­i­ta­tive­ly, we no­tice that, on the push­ing task, the op­ti­miza­tion ex­ploits the LSTM and one-step mod­els to pre­dict un­re­al­is­tic state tra­jec­to­ries, such as the ob­ject mov­ing with­out be­ing touched or the arm pass­ing through the ob­ject in­stead of col­lid­ing with it. Our model con­sis­tently per­forms bet­ter, and, with a la­tent ac­tion pri­or, the true ex­e­cu­tion closely matches the mod­el’s pre­dic­tion. When it makes in­ac­cu­rate pre­dic­tions, it re­spects phys­i­cal in­vari­ants, such as ob­jects stay­ing still un­less they are touched, or not pen­e­trat­ing each other when they col­lide

    This is sim­i­lar to Sim­s’s is­sues, or cur­rent is­sues in train­ing walk­ing or run­ning agents in en­vi­ron­ments like Mu­JoCo where it is easy for them to learn odd gaits like hop­ping ( adds ex­tra penal­ties for im­pacts to try to avoid this) or jump­ing (eg Stel­maszczyk’s at­tempts at re­ward shap­ing a skele­ton agent) or flail­ing around wildly ( add ran­dom pushes/shoves to the en­vi­ron­ment to try to make the agent learn more gen­er­al­iz­able poli­cies) which may work quite well in the spe­cific sim­u­la­tion but not else­where. (To some de­gree this is ben­e­fi­cial for dri­ving ex­plo­ration in poor­ly-un­der­stood re­gions, so it’s not all bad.) Chris­tine Bar­ron, work­ing on a pan­cake-cook­ing ro­bot­-arm sim­u­la­tion, ran into re­ward-shap­ing prob­lems: re­ward­ing for each timestep with­out the pan­cake on the floor teaches the agent to hurl the pan­cake into the air as hard as pos­si­ble; and for the pass­ing-the-but­ter agent, re­ward­ing for get­ting close to the goal pro­duces the same close-ap­proach-but-avoid­ance be­hav­ior to max­i­mize re­ward.

  • A cu­ri­ous lex­i­co­graph­ic-pref­er­ence raw-RAM NES AI al­go­rithm learns to pause the game to never lose at Tetris: Mur­phy 2013, “The First Level of Su­per Mario Bros. is Easy with Lex­i­co­graphic Or­der­ings and Time Trav­el… after that it gets a lit­tle tricky”

  • RL agent in Udac­ity self­-driv­ing car re­warded for speed learns to spin in cir­cles: Matt Kel­cey

  • NASA Mars mis­sion plan­ning, op­ti­miz­ing food/water/electricity con­sump­tion for to­tal man-days sur­vival, yields an op­ti­mal plan of killing 2/3 crew & keep sur­vivor alive as long as pos­si­ble: iand675

  • Doug Lenat’s fa­mously had is­sues with “par­a­sitic” heuris­tics, due to the self­-mod­i­fy­ing abil­i­ty, edited im­por­tant re­sults to claim credit and be re­ward­ed, part of a class of such wire­head­ing heuris­tics that Lenat made the Eu­risko core un­mod­i­fi­able: EURISKO: A pro­gram that learns new heuris­tics and do­main con­cepts: the na­ture of heuris­tics III: pro­gram de­sign and re­sults”, Lenat 1983 (pg90)

  • ge­netic al­go­rithms for im­age clas­si­fi­ca­tion evolves tim­ing-at­tack to in­fer im­age la­bels based on hard drive stor­age lo­ca­tion: https://news.ycombinator.com/item?id=6269114

  • train­ing a dog to roll over re­sults in slam­ming against the wall: http://lesswrong.com/lw/7qz/machine_learning_and_unintended_consequences/4vlv ; dol­phins re­warded for find­ing trash & dead seag­ulls in their tank learned to man­u­fac­ture trash & hunt liv­ing seag­ulls for more re­wards

  • cir­cuit de­sign with genetic/evolutionary com­pu­ta­tion:

    • an at­tempt to evolve a cir­cuit on an FPGA, to dis­crim­i­nate au­dio tones of 1kHz & 10kHz with­out us­ing any tim­ing el­e­ments, evolved a de­sign which de­pended on dis­con­nected cir­cuits in or­der to work: “An evolved cir­cuit, in­trin­sic in sil­i­con, en­twined with physics”, Thomp­son 1996. (“Pos­si­ble mech­a­nisms in­clude in­ter­ac­tions through the pow­er-sup­ply wiring, or elec­tro­mag­netic cou­pling.” The evolved cir­cuit is sen­si­tive to room tem­per­a­ture vari­a­tions 23–43C, only work­ing per­fectly over the 10C range of room tem­per­a­ture it was ex­posed to dur­ing the 2 weeks of evo­lu­tion. It is also sen­si­tive to the ex­act lo­ca­tion on the FPGA, de­grad­ing when shifted to a new po­si­tion; fur­ther fine­tun­ing evo­lu­tion fixes that, but then is vul­ner­a­ble when shifted back to the orig­i­nal lo­ca­tion.)
    • an at­tempt to evolve an os­cil­la­tor or a timer wound up evolv­ing a cir­cuit which picked up ra­dio sig­nals from the lab PCs (although since the cir­cuits did work at their as­signed func­tion as the hu­man in­tend­ed, should we con­sider this a case of ‘dataset bias’ where the ‘dataset’ is the lo­cal lab en­vi­ron­men­t?): “The evolved ra­dio and its im­pli­ca­tions for mod­el­ling the evo­lu­tion of novel sen­sors”, Jon Bird and Paul Layzell 2002
  • train­ing a “mini­taur” bot in sim­u­la­tion to carry a ball or duck on its back, CMA-ES dis­cov­ers it can drop the ball into a leg joint and then wig­gle across the floor with­out the ball ever drop­ping

  • CycleGAN, a co­op­er­a­tive GAN ar­chi­tec­ture for con­vert­ing im­ages from one genre to an­other (eg hors­es⟺ze­bras), has a loss func­tion that re­wards ac­cu­rate re­con­struc­tion of im­ages from its trans­formed ver­sion; CycleGAN turns out to par­tially solve the task by, in ad­di­tion to the cross-do­main analo­gies it learns, stegano­graph­i­cally hid­ing au­toen­coder-style data about the orig­i­nal im­age in­vis­i­bly in­side the trans­formed im­age to as­sist the re­con­struc­tion of de­tails ()

    A re­searcher in 2020 work­ing on art col­oriza­tion told me of an in­ter­est­ing sim­i­lar be­hav­ior: his au­to­mat­i­cal­ly-grayscaled im­ages were fail­ing to train the NN well, and he con­cluded that this was be­cause grayscal­ing a color im­age pro­duces many shades of gray in a way that hu­man artists do not, and that the for­mula used by OpenCV for RGB → grayscale per­mits only a few col­ors to map onto any given shade of gray, en­abling ac­cu­rate guess­ing of the orig­i­nal col­or! Such is­sues might re­quire learn­ing a grayscaler, sim­i­lar to su­per­res­o­lu­tion need­ing learned down­scalers ().

  • the ROUGE ma­chine trans­la­tion met­ric, based on match­ing sub­-phras­es, is typ­i­cally used with RL tech­niques since it is a non-d­iffer­en­tiable loss; Sales­force () notes that an effort at a ROUGE-only sum­ma­riza­tion NN pro­duced largely gib­ber­ish sum­maries, and had to add in an­other loss func­tion to get high­-qual­ity re­sults

  • Alex Ir­pan writes of 3 anec­dotes:

    In talks with other RL re­searchers, I’ve heard sev­eral anec­dotes about the novel be­hav­ior they’ve seen from im­prop­erly de­fined re­wards.

    • A coworker is teach­ing an agent to nav­i­gate a room. The episode ter­mi­nates if the agent walks out of bounds. He did­n’t add any penalty if the episode ter­mi­nates this way. The fi­nal pol­icy learned to be sui­ci­dal, be­cause neg­a­tive re­ward was plen­ti­ful, pos­i­tive re­ward was too hard to achieve, and a quick death end­ing in 0 re­ward was prefer­able to a long life that risked neg­a­tive re­ward.
    • A friend is train­ing a sim­u­lated ro­bot arm to reach to­wards a point above a table. It turns out the point was de­fined with re­spect to the ta­ble, and the ta­ble was­n’t an­chored to any­thing. The pol­icy learned to slam the ta­ble re­ally hard, mak­ing the ta­ble fall over, which moved the tar­get point too. The tar­get point just so hap­pened to fall next to the end of the arm.
    • A re­searcher gives a talk about us­ing RL to train a sim­u­lated ro­bot hand to pick up a ham­mer and ham­mer in a nail. Ini­tial­ly, the re­ward was de­fined by how far the nail was pushed into the hole. In­stead of pick­ing up the ham­mer, the ro­bot used its own limbs to punch the nail in. So, they added a re­ward term to en­cour­age pick­ing up the ham­mer, and re­trained the pol­i­cy. They got the pol­icy to pick up the ham­mer…but then it threw the ham­mer at the nail in­stead of ac­tu­ally us­ing it.

    Ad­mit­ted­ly, these are all sec­ond­hand ac­counts, and I haven’t seen videos of any of these be­hav­iors. How­ev­er, none of it sounds im­plau­si­ble to me. I’ve been burned by RL too many times to be­lieve oth­er­wise…I’ve taken to imag­in­ing deep RL as a de­mon that’s de­lib­er­ately mis­in­ter­pret­ing your re­ward and ac­tively search­ing for the lazi­est pos­si­ble lo­cal op­ti­ma. It’s a bit ridicu­lous, but I’ve found it’s ac­tu­ally a pro­duc­tive mind­set to have.

  • : an evo­lu­tion­ary strate­gies RL in the ALE game finds that it can steadily earn points by com­mit­ting ‘sui­cide’ to lure an en­emy into fol­low­ing it; more in­ter­est­ing­ly, it also dis­cov­ers what ap­pears to be a pre­vi­ously un­known bug where a se­quence of jumps will, semi­-ran­dom­ly, per­ma­nently force the game into a state where the en­tire level be­gins flash­ing and the score in­creases rapidly & in­defi­nitely un­til the game is re­set (video)

  • notes a bor­der­line case in the ALE pin­ball game where the ‘nudge’ abil­ity is un­lim­ited (un­like all real pin­ball ma­chi­nes) and a DQN can learn to score ar­bi­trar­ily by the ball budg­ing over a switch re­peat­ed­ly:

    The sec­ond show­case ex­am­ple stud­ies neural net­work mod­els (see Fig­ure 5 for the net­work ar­chi­tec­ture) trained to play Atari games, here Pin­ball. As shown in [5], the DNN achieves ex­cel­lent re­sults be­yond hu­man per­for­mance. Like for the pre­vi­ous ex­am­ple, we con­struct LRP heatmaps to vi­su­al­ize the DNN’s de­ci­sion be­hav­ior in terms of pix­els of the pin­ball game. In­ter­est­ing­ly, after ex­ten­sive train­ing, the heatmaps be­come fo­cused on few pix­els rep­re­sent­ing high­-s­cor­ing switches and loose track of the flip­pers. A sub­se­quent in­spec­tion of the games in which these par­tic­u­lar LRP heatmaps oc­cur, re­veals that DNN agent firstly moves the ball into the vicin­ity of a high­-s­cor­ing switch with­out us­ing the flip­pers at all, then, sec­ond­ly, “nudges” the vir­tual pin­ball ta­ble such that the ball in­fi­nitely trig­gers the switch by pass­ing over it back and forth,with­out caus­ing a tilt of the pin­ball ta­ble (see Fig­ure 2b and Fig­ure 6 for the heatmaps show­ing this point, and also Sup­ple­men­tary Video 1). Here, the model has learned to abuse the “nudg­ing” thresh­old im­ple­mented through the tilt­ing mech­a­nism in the Atari Pin­ball soft­ware. From a pure game scor­ing per­spec­tive, it is in­deed a ra­tio­nal choice to ex­ploit any game mech­a­nism that is avail­able. In a real pin­ball game, how­ev­er, the player would go likely bust since the pin­ball ma­chin­ery is pro­grammed to tilt after a few strong move­ments of the whole phys­i­cal ma­chine.

  • , Saun­ders et al 2017; the blog writeup notes:

    The Road Run­ner re­sults are es­pe­cially in­ter­est­ing. Our goal is to have the agent learn to play Road Run­ner with­out los­ing a sin­gle life on Level 1 of the game. Deep RL agents are known to dis­cover a ‘Score Ex­ploit’ in Road Run­ner: they learn to in­ten­tion­ally kill them­selves in a way that (para­dox­i­cal­ly) earns greater re­ward. Dy­ing at a pre­cise time causes the agent to re­peat part of Level 1, where it earns more points than on Level 2. This is a lo­cal op­ti­mum in pol­icy space that a hu­man gamer would never be stuck in.

    Ide­al­ly, our Blocker would pre­vent all deaths on Level 1 and hence elim­i­nate the Score Ex­ploit. How­ev­er, through ran­dom ex­plo­ration the agent may hit upon ways of dy­ing that “fool” our Blocker (be­cause they look differ­ent from ex­am­ples in its train­ing set) and hence learn a new ver­sion of the Score Ex­ploit. In other words, the agent is im­plic­itly per­form­ing a ran­dom search for ad­ver­sar­ial ex­am­ples for our Blocker (which is a con­vo­lu­tional neural net)…In Road Run­ner we did not achieve zero cat­a­stro­phes but were able to re­duce the rate of deaths per frame from 0.005 (with no hu­man over­sight at all) to 0.0001.

  • note var­i­ous bugs in the ALE games, but also a new in­fi­nite loop for max­i­miz­ing scores:

    Fi­nal­ly, we dis­cov­ered that on some games the ac­tual op­ti­mal strat­egy is by do­ing a loop over and over giv­ing a small amount of re­ward. In El­e­va­tor Ac­tion the agent learn to stay at the first floor and kill over and over the first en­e­my. This be­hav­ior can­not be seen as an ac­tual is­sue as the agent is ba­si­cally op­ti­miz­ing score but this is defi­nitely not the in­tended goal. A hu­man player would never per­form this way.

  • ’s R2D3 writeup notes:

    Wall Sen­sor Stack: The orig­i­nal Wall Sen­sor Stack en­vi­ron­ment had a bug that the R2D3 agent was able to ex­ploit. We fixed the bug and ver­i­fied the agent can learn the proper stack­ing be­hav­ior.

    …An­other de­sir­able prop­erty of our ap­proach is that our agents are able to learn to out­per­form the demon­stra­tors, and in some cases even to dis­cover strate­gies that the demon­stra­tors were not aware of. In one of our tasks the agent is able to dis­cover and ex­ploit a bug in the en­vi­ron­ment in spite of all the demon­stra­tors com­plet­ing the task in the in­tended way…R2D3 per­formed bet­ter than our av­er­age hu­man demon­stra­tor on Base­ball, Draw­bridge, Nav­i­gate Cubes and the Wall Sen­sor tasks. The be­hav­ior on Wall Sen­sor Stack in par­tic­u­lar is quite in­ter­est­ing. On this task R2D3 found a com­pletely differ­ent strat­egy than the hu­man demon­stra­tors by ex­ploit­ing a bug in the im­ple­men­ta­tion of the en­vi­ron­ment. The in­tended strat­egy for this task is to stack two blocks on top of each other so that one of them can re­main in con­tact with a wall mounted sen­sor, and this is the strat­egy em­ployed by the demon­stra­tors. How­ev­er, due to a bug in the en­vi­ron­ment the strat­egy learned by R2D3 was to trick the sen­sor into re­main­ing ac­tive even when it is not in con­tact with the key by press­ing the key against it in a pre­cise way.

  • , Baker et al 2019:

    We orig­i­nally be­lieved de­fend­ing against ramp use would be the last stage of emer­gence in this en­vi­ron­ment; how­ev­er, we were sur­prised to find that yet two more qual­i­ta­tively new strate­gies emerged. After 380 mil­lion to­tal episodes of train­ing, the seek­ers learn to bring a box to the edge of the play area where the hiders have locked the ramps. The seek­ers then jump on top of the box and surf it to the hiders’ shel­ter; this is pos­si­ble be­cause the en­vi­ron­ment al­lows agents to move to­gether with the box re­gard­less of whether they are on the ground or not. In re­spon­se, the hiders learn to lock all of the boxes in place be­fore build­ing their shel­ter.

    The OA blog post ex­pands on the noted ex­ploits:

    Sur­pris­ing be­hav­iors: We’ve shown that agents can learn so­phis­ti­cated tool use in a high fi­delity physics sim­u­la­tor; how­ev­er, there were many lessons learned along the way to this re­sult. Build­ing en­vi­ron­ments is not easy and it is quite often the case that agents find a way to ex­ploit the en­vi­ron­ment you build or the physics en­gine in an un­in­tended way.

    • Box surfing: Since agents move by ap­ply­ing forces to them­selves, they can grab a box while on top of it and “surf” it to the hider’s lo­ca­tion.
    • End­less run­ning: With­out adding ex­plicit neg­a­tive re­wards for agents leav­ing the play area, in rare cases hiders will learn to take a box and end­lessly run with it.
    • Ramp ex­ploita­tion (hider­s): Re­in­force­ment learn­ing is amaz­ing at find­ing small me­chan­ics to ex­ploit. In this case, hiders abuse the con­tact physics and re­move ramps from the play area.
    • Ramp ex­ploita­tion (seek­er­s): In this case, seek­ers learn that if they run at a wall with a ramp at the right an­gle, they can launch them­selves up­ward.
  • , Ziegler et al 2019 (), fine-tune trained an Eng­lish text gen­er­a­tion model based on hu­man rat­ings; they pro­vide a cu­ri­ous ex­am­ple of a re­ward spec­i­fi­ca­tion bug. Here, the re­ward was ac­ci­den­tally negat­ed; this re­ver­sal, rather than re­sult­ing in non­sense, re­sulted in (lit­er­al­ly) per­versely co­her­ent be­hav­ior of emit­ting ob­scen­i­ties to max­i­mize the new score:

    Bugs can op­ti­mize for bad be­hav­ior: One of our code refac­tors in­tro­duced a bug which flipped the sign of the re­ward. Flip­ping the re­ward would usu­ally pro­duce in­co­her­ent text, but the same bug also flipped the sign of the KL penal­ty. The re­sult was a model which op­ti­mized for neg­a­tive sen­ti­ment while pre­serv­ing nat­ural lan­guage. Since our in­struc­tions told hu­mans to give very low rat­ings to con­tin­u­a­tions with sex­u­ally ex­plicit text, the model quickly learned to out­put only con­tent of this form. This bug was re­mark­able since the re­sult was not gib­ber­ish but max­i­mally bad out­put. The au­thors were asleep dur­ing the train­ing process, so the prob­lem was no­ticed only once train­ing had fin­ished. A mech­a­nism such as Toy­ota’s cord could have pre­vented this, by al­low­ing any la­beler to stop a prob­lem­atic train­ing process.

See Also


  1. The pa­per in ques­tion dis­cusses gen­eral ques­tions of nec­es­sary res­o­lu­tion, com­put­ing re­quire­ments, op­tics, nec­es­sary er­ror rates, and al­go­rithms, but does­n’t de­scribe any im­ple­mented sys­tems, much less ex­pe­ri­ences which re­sem­ble the tank sto­ry.↩︎

  2. An­other in­ter­est­ing de­tail from Harley et al 1962 about their tank study: in dis­cussing de­sign­ing their com­puter ‘sim­u­la­tion’ of their qua­si­-NN al­go­rithms, their de­scrip­tion of the pho­tographs on pg133 makes it sound as if the dataset was con­structed from the same pho­tographs by us­ing large-s­cale aer­ial footage and then crop­ping out the small squares with tanks and then cor­re­spond­ing small squares with­out tanks—so they only had to process one set of pho­tographs, and the re­sult­ing tank/non-tank sam­ples are in­her­ently matched on date, weath­er, time of day, light­ing, gen­eral lo­ca­tion, roll of film, cam­era, and pho­tog­ra­ph­er. If true, that would make al­most all the var­i­ous sug­gested tank prob­lem short­cuts im­pos­si­ble, and would be fur­ther ev­i­dence that Kanal’s project was not & could not have been the ori­gin of the tank sto­ry.↩︎

  3. This seems en­tirely rea­son­able to me, given that hardly any AI re­search ex­isted at that point. While it’s un­clear what re­sults were ac­com­plished im­me­di­ately thanks to the 1956 work­shop, many of the at­ten­dees would make ma­jor dis­cov­er­ies in AI. At­tendee wife, Grace Solomonoff (“Ray Solomonoff and the Dart­mouth Sum­mer Re­search Project in Ar­ti­fi­cial In­tel­li­gence, 1956”, 2016) de­scribes the work­shop as hav­ing vivid dis­cus­sions but was com­pro­mised by get­ting only half its fund­ing (so it did­n’t last the sum­mer) and at­ten­dees show­ing up spo­rad­i­cally & for short times (“Many par­tic­i­pants only showed up for a day or even less.”); no agree­ment was reached on a spe­cific project to try to tack­le, al­though Solomonoff did write a pa­per there he con­sid­ered im­por­tant.↩︎

  4. One com­menter ob­serves that the NN tank story and ilk ap­pears to al­most al­ways be told about neural net­works, and won­ders why when dataset bias ought to be just as much a prob­lem for other statistical/machine-learning meth­ods like de­ci­sion trees, which are ca­pa­ble of learn­ing com­plex non­lin­ear prob­lems. I could note that these anec­dotes also get rou­tinely told about ge­netic al­go­rithms & evo­lu­tion­ary meth­ods, so it’s not purely neu­ral, and it might be that NNs are vic­tims of their own suc­cess: par­tic­u­larly as of 2017, NNs are so pow­er­ful & flex­i­ble in some ar­eas (like com­puter vi­sion) there is lit­tle com­pe­ti­tion, and so any hor­ror sto­ries will prob­a­bly in­volve NNs.↩︎

  5. Here, the num­ber of pho­tographs and ex­actly how they were di­vided into training/validation sets is an oddly spe­cific de­tail. This is rem­i­nis­cent of re­li­gions or nov­els, where orig­i­nally sparse and un­de­tailed sto­ries be­come elab­o­rated and ever more de­tailed, with strik­ing de­tails added to catch the imag­i­na­tion. For ex­am­ple, the in the Chris­t­ian Gospels are un­named, but have been given by later Chris­tians ex­ten­sive fic­tional bi­ogra­phies of names (“Names for the Name­less in the New Tes­ta­ment”; one of ), sym­bol­ism, king­doms, con­tem­po­rary successors/descendants, mar­tyr­doms & lo­ca­tions of re­mains…↩︎

  6. One mem­o­rable ex­am­ple of this for me was when the Ed­ward Snow­den NSA leaks be­gan.

    Sure­ly, given pre­vi­ous in­stances like differ­en­tial crypt­analy­sis or pub­lic-key cryp­tog­ra­phy, the NSA had any num­ber of amaz­ing tech­nolo­gies and moon math be­yond the ken of the rest of us? I read many of the pre­sen­ta­tions with great in­ter­est, par­tic­u­larly about how they searched for in­di­vid­u­als or data—­cut­ting edge neural net­works? Evo­lu­tion­ary al­go­rithms? Even more ex­otic tech­niques? Nope—reg­ex­ps, lin­ear mod­els, and ran­dom forests. Prac­ti­cal but bor­ing. Nor did any ma­jor cryp­to­graphic break­throughs be­come ex­posed via Snow­den.

    Over­all, the NSA cor­pus in­di­cates that they had the abil­i­ties you would ex­pect from a large group of pa­tient pro­gram­mers with no ethics given a bud­get of bil­lions of dol­lars to spend on a mis­sion whose motto was “hack the planet” us­ing a com­pre­hen­sive set of meth­ods rang­ing from phys­i­cal breakins & bugs, theft of pri­vate keys, bribery, large-s­cale telecom­mu­ni­ca­tions tap­ping, im­plant­ing back­doors, pur­chase & dis­cov­ery of un­patched vul­ner­a­bil­i­ties, & stan­dards process sub­ver­sion. Highly effec­tive in the ag­gre­gate but lit­tle that peo­ple had­n’t ex­pected or long spec­u­lated about in the ab­stract.↩︎

  7. Al­though there are oc­ca­sional ex­cep­tions where a data aug­men­ta­tion does­n’t pre­serve im­por­tant se­man­tics: you would­n’t want to use hor­i­zon­tal flips with street signs.↩︎

  8. It amuses me to note when web­sites or tools are clearly us­ing Im­a­geNet CNNs, be­cause they as­sume Im­a­geNet cat­e­gories or pro­vide an­no­ta­tions in their meta­data, or be­cause they ex­hibit un­can­nily good recog­ni­tion of dogs. Some­times CNNs are much bet­ter than they are given credit for be­ing and they are as­sumed by com­menters to fail on prob­lems they ac­tu­ally suc­ceed on; for ex­am­ple, some meme im­ages have cir­cu­lated claim­ing that CNNs can’t dis­tin­guish fried chick­ens from dogs, chi­huahuas from muffins, or sleep­ing dogs from bagel­s—but as amus­ing as the im­age-sets are, Miles Brundage re­ports that Clar­i­fai’s CNN API has lit­tle trou­ble ac­cu­rately dis­tin­guish­ing man’s worst food from man’s best friend.↩︎

  9. Recht et al 2019’s Im­a­geNet-v2 turns out to il­lus­trate some sub­tle is­sues in mea­sur­ing dataset bias (En­gstrom et al 2020): be­cause of mea­sure­ment er­ror in the la­bels of im­ages caus­ing er­rors in the fi­nal dataset, sim­ply com­par­ing a clas­si­fier trained on one with its per­for­mance on the other and not­ing that per­for­mance fell by X% yields a mis­lead­ingly in­flated es­ti­mate of ‘bias’ by at­tribut­ing the com­bined er­ror of both datasets to the bias.↩︎

  10. La­puschkin et al 2019:

    The first learn­ing ma­chine is a model based on Fisher vec­tors (FV) [31, 32] trained on the PASCAL VOC 2007 im­age dataset [33] (see Sec­tion E). The model and also its com­peti­tor, a pre­trained Deep Neural Net­work (DNN) that we fine-tune on PASCAL VOC, show both ex­cel­lent state-of-the-art test set ac­cu­racy on cat­e­gories such as ‘per­son’, ‘train’, ‘car’, or ‘horse’ of this bench­mark (see Ta­ble 3). In­spect­ing the ba­sis of the de­ci­sions with LRP, how­ev­er, re­veals for cer­tain im­ages sub­stan­tial di­ver­gence, as the heatmaps ex­hibit­ing the rea­sons for the re­spec­tive clas­si­fi­ca­tion could not be more differ­ent. Clear­ly, the DNN’s heatmap points at the horse and rider as the most rel­e­vant fea­tures (see Fig­ure 14). In con­trast, FV’s heatmap is most fo­cused onto the lower left cor­ner of the im­age,which con­tains a source tag. A closer in­spec­tion of the data set (of 9963 sam­ples [33]) that typ­i­cally hu­mans never look through ex­haus­tive­ly, shows that such source tags ap­pear dis­tinc­tively on horse im­ages; a strik­ing ar­ti­fact of the dataset that so far had gone un­no­ticed [34]. There­fore, the FV model has ‘over­fit­ted’ the PASCAL VOC dataset by re­ly­ing mainly on the eas­ily iden­ti­fi­able source tag, which in­ci­den­tally cor­re­lates with the true fea­tures, a clear case of ‘Clever Hans’ be­hav­ior. This is con­firmed by ob­serv­ing that ar­ti­fi­cially cut­ting the source tag from horse im­ages sig­nifi­cantly weak­ens the FV mod­el’s de­ci­sion while the de­ci­sion of the DNN stays vir­tu­ally un­changed (see Fig­ure 14). If we take in­stead a cor­rectly clas­si­fied im­age of a Fer­rari and then add to it a source tag, we ob­serve that the FV’s pre­dic­tion swiftly changes from ‘car’ to ‘horse’ (cf. Fig­ure 2a) a clearly in­valid de­ci­sion (see Sec­tion E and Fig­ures 15–20 for fur­ther ex­am­ples and analy­ses)… For the clas­si­fi­ca­tion of ships the clas­si­fier is mostly fo­cused on the pres­ence of wa­ter in the bot­tom half of an im­age. Re­mov­ing the copy­right tag or the back­ground re­sultsin a drop of pre­dic­tive ca­pa­bil­i­ties. A deep neural net­work, pre-trained in the Im­a­geNet dataset [93], in­stead shows none of these short­com­ings.

    The air­plane ex­am­ple is a lit­tle more de­bat­able—the pres­ence of a lot of blue sky in air­plane im­ages seems like a valid cue to me and not nec­es­sar­ily cheat­ing:

    …The SpRAy analy­sis could fur­ther­more re­veal an­other ‘Clever Hans’ type be­hav­ior in our fine-tuned DNN mod­el, which had gone un­no­ticed in pre­vi­ous man­ual analy­sis of the rel­e­vance maps. The large eigen­gaps in the eigen­value spec­trum of the DNN heatmaps for class “aero­plane” in­di­cate that the model uses very dis­tinct strate­gies for clas­si­fy­ing aero­plane im­ages (see Fig­ure 26). A t-SNE vi­su­al­iza­tion (Fig­ure28) fur­ther high­lights this clus­ter struc­ture. One un­ex­pected strat­egy we could dis­cover with the help of SpRAy is to iden­tify aero­plane im­ages by look­ing at the ar­ti­fi­cial padding pat­tern at the im­age bor­ders, which for aero­plane im­ages pre­dom­i­nantly con­sists of uni­form and struc­ture­less blue back­ground. Note that padding is typ­i­cally in­tro­duced for tech­ni­cal rea­sons (the DNN model only ac­cepts square shaped in­put­s), but un­ex­pect­edly (and un­want­ed­ly) the padding pat­tern be­came part of the mod­el’s strat­egy to clas­sify aero­plane im­ages. Sub­se­quently we ob­serve that chang­ing the man­ner in which padding is per­formed has a strong effect on the out­put of the DNN clas­si­fier (see Fig­ures 29–32).

    ↩︎
  11. Win­kler et al 2019: “When re­view­ing the open-ac­cess In­ter­na­tional Skin Imag­ing Col­lab­o­ra­tion data­base, which is a source of train­ing im­ages for re­search groups, we found that a sim­i­lar per­cent-age of melanomas (52 of 2169 [2.4%]) and nevi (214 of 9303 [2.3%]) carry skin mark­ings. Nev­er­the­less, it seems con­ceiv­able that ei­ther an im­bal­ance in the dis­tri­b­u­tion of skin mark­ings in thou­sands of other train­ing im­ages that were used in the CNN tested herein or the as­sign­ment of higher weights to blue mark­ings only in le­sions with spe­cific (though un­known) ac­com­pa­ny­ing fea­tures may in­duce a CNN to as­so­ciate skin mark­ings with the di­ag­no­sis of melanoma. The lat­ter hy­poth­e­sis may also ex­plain why melanoma prob­a­bil­ity scores re­mained al­most un­changed in many marked nevi while be­ing in­creased in oth­ers.”↩︎

  12. Get­ting into more gen­eral eco­nom­ic, be­hav­ioral, or hu­man sit­u­a­tions would be go­ing too far afield, but the rel­e­vant ana­logues are “”, “”, “law of ”, “”, “”, or “”; such align­ment prob­lems are only par­tially dealt with by hav­ing ground-truth evo­lu­tion­ary , and avoid­ing re­ward hack­ing re­mains an open prob­lem (even in the­o­ry). gam­ing com­mu­ni­ties fre­quently pro­vide ex­am­ples of re­ward-hack­ing, par­tic­u­larly when games are fin­ished faster by ex­ploit­ing bugs to ; par­tic­u­larly es­o­teric tech­niques re­quire out­right hack­ing the “weird ma­chines” present in many games/devices—for ex­am­ple, ‘par­al­lel uni­verses’ hack which avoids us­ing any jumps by ex­ploit­ing an bug & wrap­around to ac­cel­er­ate Mario to near-in­fi­nite speed, pass­ing through the en­tire map mul­ti­ple times, in or­der to stop at the right place. ↩︎