AI folklore tells a story about a neural network trained to detect tanks which instead learned to detect time of day; investigating, this probably never happened.
A cautionary tale in artificial intelligence tells about researchers training an neural network (NN) to detect tanks in photographs, succeeding, only to realize the photographs had been collected under specific conditions for tanks/non-tanks and the NN had learned something useless like time of day. This story is often told to warn about the limits of algorithms and importance of data collection to avoid “dataset bias”/“data leakage” where the collected data can be solved using algorithms that do not generalize to the true data distribution, but the tank story is usually never sourced.
I collate many extent versions dating back a quarter of a century to 1992 along with two NN-related anecdotes from the 1960s; their contradictions & details indicate a classic “urban legend”, with a probable origin in a speculative question in the 1960s by Edward Fredkin at an AI conference about some early NN research, which was then classified & never followed up on.
I suggest that dataset bias is real but exaggerated by the tank story, giving a misleading indication of risks from deep learning and that it would be better to not repeat it but use real examples of dataset bias and focus on larger-scale risks like AI systems optimizing for wrong utility functions.
Deep learning’s rise over the past decade and dominance in image processing tasks has led to an explosion of applications attempting to infer high-level semantics locked up in raw sensory data like photographs. Convolutional neural networks are now applied to not just ordinary tasks like sorting cucumbers by quality but everything from predicting the best Go move to where in the world it was taken to whether a photograph is “interesting” or “pretty”, not to mention supercharging traditional tasks like radiology interpretation or facial recognition which have reached levels of accuracy that could only be dreamed of decades ago. With this approach of “neural net all the things!”, the question of to what extent the trained neural networks are useful in the real world and will do what we want it to do & not what we told it to do has taken on additional importance, especially given the possibility of neural networks learning to accomplish extremely inconvenient things like inferring individual human differences such as criminality or homosexuality (to give two highly controversial recent examples where the meaningfulness of claimed success have been severely questioned).
In this context, a cautionary story is often told of incautious researchers decades ago who trained a NN for the military to find images of tanks, only to discover they had trained a neural network to detect something else entirely (what, precisely, that something else was varies in the telling). It would be a good & instructive story… if it were true. Is it?
As it would be so useful a cautionary example for AI safety/alignment research, and was cited to that effect by Eliezer Yudkowsky but only to a secondary source, I decided to make myself useful by finding a proper primary source for it & see if there were more juicy details worth mentioning. My initial attempt failed, and I & several others failed for over more than half a decade to find any primary source (just secondary sources citing each other). I began to wonder if it was even real.
Trying again more seriously, I conclude that, unfortunately, it is definitely not real as usually told: it is just an urban legend/leprechaun; and in fact, the seed of the story could not have run into the issue the tank story warns about, because they correctly constructed their training dataset to avoid such issues. More broadly, considering that issue in contemporary deep learning, the issue it cautions against is real but not that important and conflated with more dangerous safety/alignment problems.
Drawing on the usual suspects (Google/Google Books/Google Scholar/Libgen/LessWrong/Hacker News/Twitter) in investigating leprechauns, I have compiled a large number of variants of the story; below, in reverse chronological order by decade, letting us trace the evolution of the story back towards its roots:
Heather Murphy, “Why Stanford Researchers Tried to Create a ‘Gaydar’ Machine” (NYT), 2017-10-09:
So What Did the Machines See? Dr. Kosinski and Mr. Wang [Wang & Kosinski 2018; see also Leuner 2019/Kosinski 201] say that the algorithm is responding to fixed facial features, like nose shape, along with “grooming choices,” such as eye makeup. But it’s also possible that the algorithm is seeing something totally unknown. “The more data it has, the better it is at picking up patterns,” said Sarah Jamie Lewis, an independent privacy researcher who Tweeted a critique of the study. “But the patterns aren’t necessarily the ones you think that they are.” Tomaso Poggio, the director of M.I.T.’s Center for Brains, Minds and Machines, offered a classic parable used to illustrate this disconnect. The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realized that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness. Dr. Cox has spotted a version of this in his own studies of dating profiles. Gay people, he has found, tend to post higher-quality photos. Dr. Kosinski said that they went to great lengths to guarantee that such confounders did not influence their results. Still, he agreed that it’s easier to teach a machine to see than to understand what it has seen.
Alexander Harrowell, “It was called a perceptron for a reason, damn it”, 2017-09-30:
You might think that this is rather like one of the classic optical illusions, but it’s worse than that. If you notice that you look at something this way, and then that way, and it looks different, you’ll notice something is odd. This is not something our deep learner will do. Nor is it able to identify any bias that might exist in the corpus of data it was trained on…or maybe it is. If there is any property of the training data set that is strongly predictive of the training criterion, it will zero in on that property with the ferocious clarity of Darwinism. In the 1980s, an early backpropagating neural network was set to find Soviet tanks in a pile of reconnaissance photographs. It worked, until someone noticed that the Red Army usually trained when the weather was good, and in any case the satellite could only see them when the sky was clear. The medical school at St Thomas’ Hospital in London found theirs had learned that their successful students were usually white.
An interesting story with a distinct “family resemblance” is told about a NN classifying wolves/dogs, by Evgeniy Nikolaychuk, “Dogs, Wolves, Data Science, and Why Machines Must Learn Like Humans Do”, 2017-06-09:
Neural networks are designed to learn like the human brain, but we have to be careful. This is not because I’m scared of machines taking over the planet. Rather, we must make sure machines learn correctly. One example that always pops into my head is how one neural network learned to differentiate between dogs and wolves. It didn’t learn the differences between dogs and wolves, but instead learned that wolves were on snow in their picture and dogs were on grass. It learned to differentiate the two animals by looking at snow and grass. Obviously, the network learned incorrectly. What if the dog was on snow and the wolf was on grass? Then, it would be wrong.
However, in his source, “‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier [LIME]”, Ribeiro et al 2016, they specify of their dog/wolf snow-detector NN that they “trained this bad classifier intentionally, to evaluate whether subjects are able to detect it [the bad performance]” using LIME for insight into how the classifier was making its classification, concluding that “After examining the explanations, however, almost all of the subjects identified the correct insight, with much more certainty that it was a determining factor. Further, the trust in the classifier also dropped substantially.” So Nikolaychuk appears to have misremembered. (Perhaps in another 25 years students will be told in their classes of how a NN was once trained by ecologists to count wolves…)
Redditor mantrap2 gives on 2015-06-20 this version of the story:
I remember this kind of thing from the 1980s: the US Army was testing image recognition seekers for missiles and was getting excellent results on Northern German tests with NATO tanks. Then they tested the same systems in other environment and there results were suddenly shockingly bad. Turns out the image recognition was keying off the trees with tank-like minor features rather than the tank itself. Putting other vehicles in the same forests got similar high hits but tanks by themselves (in desert test ranges) didn’t register. Luckily a sceptic somewhere decided to “do one more test to make sure”.
Dennis Polis, God, Science and Mind, 2012 (pg131, limited Google Books snippet, unclear what ref 44 is):
These facts refute a Neoplatonic argument for the essential immateriality of the soul, viz. that since the mind deals with universal representations, it operates in a specifically immaterial way…So, awareness is not explained by connectionism. The results of neural net training are not always as expected. One team intended to train neural nets to recognize battle tanks in aerial photos. The system was trained using photos with and without tanks. After the training, a different set of photos was used for evaluation, and the system failed miserably—being totally incapable of distinguishing those with tanks. The system actually discriminated cloudy from sunny days. It happened that all the training photos with tanks were taken on cloudy days, while those without were on clear days.44 What does this show? That neural net training is mindless. The system had no idea of the intent of the enterprise, and did what it was programmed to do without any concept of its purpose. As with Dawkins’ evolution simulation (p. 66), the goals of computer neural nets are imposed by human programmers.
Blay Whitby, Artificial Intelligence: A Beginner’s Guide 2012 (pg53):
It is not yet clear how an artificial neural net could be trained to deal with “the world” or any really open-ended sets of problems. Now some readers may feel that this unpredictability is not a problem. After all, we are talking about training not programming and we expect a neural net to behave rather more like a brain than a computer. Given the usefulness of nets in unsupervised learning, it might seem therefore that we do not really need to worry about the problem being of manageable size and the training process being predictable. This is not the case; we really do need a manageable and well-defined problem for the training process to work. A famous AI urban myth may help to make this clearer.
The story goes something like this. A research team was training a neural net to recognize pictures containing tanks. (I’ll leave you to guess why it was tanks and not tea-cups.) To do this they showed it two training sets of photographs. One set of pictures contained at least one tank somewhere in the scene, the other set contained no tanks. The net had to be trained to discriminate between the two sets of photographs. Eventually, after all that back-propagation stuff, it correctly gave the output “tank” when there was a tank in the picture and “no tank” when there wasn’t. Even if, say, only a little bit of the gun was peeping out from behind a sand dune it said “tank”. Then they presented a picture where no part of the tank was visible—it was actually completely hidden behind a sand dune—and the program said “tank”.
Now when this sort of thing happens research labs tend to split along age-based lines. The young hairs say “Great! We’re in line for the Nobel Prize!” and the old heads say “Something’s gone wrong”. Unfortunately, the old heads are usually right—as they were in this case. What had happened was that the photographs containing tanks had been taken in the morning while the army played tanks on the range. After lunch the photographer had gone back and taken pictures from the same angles of the empty range. So the net had identified the most reliable single feature which enabled it to classify the two sets of photos, namely the angle of the shadows. “AM = tank, PM = no tank”. This was an extremely effective way of classifying the two sets of photographs in the training set. What it most certainly was not was a program that recognizes tanks. The great advantage of neural nets is that they find their own classification criteria. The great problem is that it may not be the one you want!
Thom Blake notes in 2011-09-20 that the story is:
Probably apocryphal. I haven’t been able to track this down, despite having heard the story both in computer ethics class and at academic conferences.
“Embarrassing mistakes in perceptron research”, Marvin Minsky, 2011-01-31:
Like I had a friend in Italy who had a perceptron that looked at a visual… it had visual inputs. So, he… he had scores of music written by Bach of chorales and he had scores of chorales written by music students at the local conservatory. And he had a perceptron—a big machine—that looked at these and those and tried to distinguish between them. And he was able to train it to distinguish between the masterpieces by Bach and the pretty good chorales by the conservatory students. Well, so, he showed us this data and I was looking through it and what I discovered was that in the lower left hand corner of each page, one of the sets of data had single whole notes. And I think the ones by the students usually had four quarter notes. So that, in fact, it was possible to distinguish between these two classes of… of pieces of music just by looking at the lower left… lower right hand corner of the page. So, I told this to the… to our scientist friend and he went through the data and he said: ‘You guessed right. That’s… that’s how it happened to make that distinction.’ We thought it was very funny.
A similar thing happened here in the United States at one of our research institutions. Where a perceptron had been trained to distinguish between—this was for military purposes—It could… it was looking at a scene of a forest in which there were camouflaged tanks in one picture and no camouflaged tanks in the other. And the perceptron—after a little training—got… made a 100% correct distinction between these two different sets of photographs. Then they were embarrassed a few hours later to discover that the two rolls of film had been developed differently. And so these pictures were just a little darker than all of these pictures and the perceptron was just measuring the total amount of light in the scene. But it was very clever of the perceptron to find some way of making the distinction.
Eliezer Yudkowsky, 2008-08-24 (similarly quoted in “Artificial Intelligence as a Negative and Positive Factor in Global Risk”, “Artificial Intelligence in global risk” in Global Catastrophic Risks 2011, & “Friendly Artificial Intelligence” in Singularity Hypotheses 2013):
Once upon a time—I’ve seen this story in several versions and several places, sometimes cited as fact, but I’ve never tracked down an original source—once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest. Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that wouldn’t generalize to new problems. Not, “camouflaged tanks versus forest”, but just, “photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive…” But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed! The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos. It turned out that in the researchers’ data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest. This parable—which might or might not be fact—illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence…
Gordon Rugg, Using Statistics: A Gentle Introduction, 2007-10-01 (pg114–115):
Neural nets and genetic algorithms (including the story of the Russian tanks): Neural nets (or artificial neural networks, to give them their full name) are pieces of software inspired by the way the human brain works. In brief, you can train a neural net to do tasks like classifying images by giving it lots of examples, and telling it which examples fit into which categories; the neural net works out for itself what the defining characteristics are for each category. Alternatively, you can give it a large set of data and leave it to work out connections by itself, without giving it any feedback. There’s a story, which is probably an urban legend, which illustrates how the approach works and what can go wrong with it. According to the story, some NATO researchers trained a neural net to distinguish between photos of NATO and Warsaw Pact tanks. After a while, the neural net could get it right every time, even with photos it had never seen before. The researchers had gleeful visions of installing neural nets with miniature cameras in missiles, which could then be fired at a battlefield and left to choose their own targets. To demonstrate the method, and secure funding for the next stage, they organised a viewing by the military. On the day, they set up the system and fed it a new batch of photos. The neural net responded with apparently random decisions, sometimes identifying NATO tanks correctly, sometimes identifying them mistakenly as Warsaw Pact tanks. This did not inspire the powers that be, and the whole scheme was abandoned on the spot. It was only afterwards that the researchers realised that all their training photos of NATO tanks had been taken on sunny days in Arizona, whereas the Warsaw Pact tanks had been photographed on grey, miserable winter days on the steppes, so the neural net had flawlessly learned the unintended lesson that if you saw a tank on a gloomy day, then you made its day even gloomier by marking it for destruction.
N. Katherine Hayles, “Computing the Human” (Inventive Life: Approaches to the New Vitalism, Fraser et al 2006; pg424):
While humans have for millennia used what Cariani calls ‘active sensing’—‘poking, pushing, bending’—to extend their sensory range and for hundreds of years have used prostheses to create new sensory experiences (for example, microscopes and telescopes), only recently has it been possible to construct evolving sensors and what Cariani (1998: 718) calls ‘internalized sensing’, that is, “bringing the world into the device” by creating internal, analog representations of the world out of which internal sensors extract newly-relevant properties’.
…Another conclusion emerges from Cariani’s call (1998) for research in sensors that can adapt and evolve independently of the epistemic categories of the humans who create them. The well-known and perhaps apocryphal story of the neural net trained to recognize army tanks will illustrate the point. For obvious reasons, the army wanted to develop an intelligent machine that could discriminate between real and pretend tanks. A neural net was constructed and trained using two sets of data, one consisting of photographs showing plywood cutouts of tanks and the other actual tanks. After some training, the net was able to discriminate flawlessly between the situations. As is customary, the net was then tested against a third data set showing pretend and real tanks in the same landscape; it failed miserably. Further investigation revealed that the original two data sets had been filmed on different days. One of the days was overcast with lots of clouds, and the other day was clear. The net, it turned out, was discriminating between the presence and absence of clouds. The anecdote shows the ambiguous potential of epistemically autonomous devices for categorizing the world in entirely different ways from the humans with whom they interact. While this autonomy might be used to enrich the human perception of the world by revealing novel kinds of constructions, it also can create a breed of autonomous devices that parse the world in radically different ways from their human trainers.
A counter-narrative, also perhaps apocryphal, emerged from the 1991 Gulf War. US soldiers firing at tanks had been trained on simulators that imaged flames shooting out from the tank to indicate a kill. When army investigators examined Iraqi tanks that were defeated in battles, they found that for some tanks the soldiers had fired four to five times the amount of munitions necessary to disable the tanks. They hypothesized that the overuse of firepower happened because no flames shot out, so the soldiers continued firing. If the hypothesis is correct, human perceptions were altered in accord with the idiosyncrasies of intelligent machines, providing an example of what can happen when human-machine perceptions are caught in a feedback loop with one another.
Linda Null & Julie Lobur, The Essentials of Computer Organization and Architecture (third edition), 2003/2014 (pg439–440 in 1st edition, pg658 in 3rd edition):
Correct training requires thousands of steps. The training time itself depends on the size of the network. As the number of perceptrons increases, the number of possible “states” also increases.
Let’s consider a more sophisticated example, that of determining whether a tank is hiding in a photograph. A neural net can be configured so that each output value correlates to exactly one pixel. If the pixel is part of the image of a tank, the net should output an one; otherwise, the net should output a zero. The input information would most likely consist of the color of the pixel. The network would be trained by feeding it many pictures with and without tanks. The training would continue until the network correctly identified whether the photos included tanks. The U.S. military conducted a research project exactly like the one we just described. One hundred photographs were taken of tanks hiding behind trees and in bushes, and another 100 photographs were taken of ordinary landscape with no tanks. Fifty photos from each group were kept “secret,” and the rest were used to train the neural network. The network was initialized with random weights before being fed one picture at a time. When the network was incorrect, it adjusted its input weights until the correct output was reached. Following the training period, the 50 “secret” pictures from each group of photos were fed into the network. The neural network correctly identified the presence or absence of a tank in each photo. The real question at this point has to do with the training—had the neural net actually learned to recognize tanks? The Pentagon’s natural suspicion led to more testing. Additional photos were taken and fed into the network, and to the researchers’ dismay, the results were quite random. The neural net could not correctly identify tanks within photos. After some investigation, the researchers determined that in the original set of 200 photos, all photos with tanks had been taken on a cloudy day, whereas the photos with no tanks had been taken on a sunny day. The neural net had properly separated the two groups of pictures, but had done so using the color of the sky to do this rather than the existence of a hidden tank. The government was now the proud owner of a very expensive neural net that could accurately distinguish between sunny and cloudy days!
This is a great example of what many consider the biggest issue with neural networks. If there are more than 10 to 20 neurons, it is impossible to understand how the network is arriving at its results. One cannot tell if the net is making decisions based on correct information, or, as in the above example, something totally irrelevant. Neural networks have a remarkable ability to derive meaning and extract patterns from data that are too complex to be analyzed by human beings. However, some people trust neural networks to be experts in their area of training. Neural nets are used in such areas as sales forecasting, risk management, customer research, undersea mine detection, facial recognition, and data validation. Although neural networks are promising, and the progress made in the past several years has led to significant funding for neural net research, many people are hesitant to put confidence in something that no human being can completely understand.
David Gerhard, “Pitch Extraction and Fundamental Frequency: History and Current Techniques”, Technical Report TR-CS 2003–06, November 2003:
The choice of the dimensionality and domain of the input set is crucial to the success of any connectionist model. A common example of a poor choice of input set and test data is the Pentagon’s foray into the field of object recognition. This story is probably apocryphal and many different versions exist on-line, but the story describes a true difficulty with neural nets.
As the story goes, a network was set up with the input being the pixels in a picture, and the output was a single bit, yes or no, for the existence of an enemy tank hidden somewhere in the picture. When the training was complete, the network performed beautifully, but when applied to new data, it failed miserably. The problem was that in the test data, all of the pictures that had tanks in them were taken on cloudy days, and all of the pictures without tanks were taken on sunny days. The neural net was identifying the existence or non-existence of sunshine, not tanks.
- Tanks in Desert Storm
Sometimes you have to be careful what you train on . . .
The problem with neural nets is that you never know what features they’re actually training on. For example:
The US military tried to use neural nets in Desert Storm for tank recognition, so unmanned tanks could identify enemy tanks and destroy them. They trained the neural net on multiple images of “friendly” and enemy tanks, and eventually had a decent program that seemed to correctly identify friendly and enemy tanks.
Then, when they actually used the program in a real-world test phase with actual tanks, they found that the tanks would either shoot at nothing or shoot at everything. They certainly seemed to be incapable of distinguishing friendly or enemy tanks.
Why was this? It turns out that the images they were training on always had glamour-shot type photos of friendly tanks, with an immaculate blue sky, etc. The enemy tank photos, on the other hand, were all spy photos, not very clear, sometimes fuzzy, etc. And it was these characteristics that the neural net was training on, not the tanks at all. On a bright sunny day, the tanks would do nothing. On an overcast, hazy day, they’d start firing like crazy . . .
Andrew Ilachinski, Cellular Automata: A Discrete Universe, 2001 (pg547):
There is an telling story about how the Army recently went about teaching a backpropagating net to identify tanks set against a variety of environmental backdrops. The programmers correctly fed their multi-layer net photograph after photograph of tanks in grasslands, tanks in swamps, no tanks on concrete, and so on. After many trials and many thousands of iterations, their net finally learned all of the images in their database. The problem was that when the presumably “trained” net was tested with other images that were not part of the original training set, it failed to do any better than what would be expected by chance. What had happened was that the input/training fact set was statistically corrupt. The database consisted mostly of images that showed a tank only if there were heavy clouds, the tank itself was immersed in shadow or there was no sun at all. The Army’s neural net had indeed identified a latent pattern, but it unfortunately had nothing to do with tanks: it had effectively learned to identify the time of day! The obvious lesson to be taken away from this amusing example is that how well a net “learns” the desired associations depends almost entirely on how well the database of facts is defined. Just as Monte Carlo simulations in statistical mechanics may fall short of intended results if they are forced to rely upon poorly coded random number generators, so do backpropagating nets typically fail to achieve expected results if the facts they are trained on are statistically corrupt.
Intelligent Data Analysis In Science, Hugh M. Cartwright 2000, pg126, writes (according to Google Books’s snippet view; Cartwright’s version appears to be a direct quote or close paraphrase of an earlier 1994 chemistry paper, Goodacre et al 1994):
…television programme Horizon; a neural network was trained to attempt to distinguish tanks from trees. Pictures were taken of forest scenes lacking military hardware and of similar but perhaps less bucolic landscapes which also contained more-or-less camouflaged battle tanks. A neural network was trained with these input data and found to differentiate successfully between tanks and trees. However, when a new set of pictures was analysed by the network, it failed to detect the tanks. After further investigation, it was found…
Daniel Robert Franklin & Philippe Crochat,
libneural tutorial, 2000-03-23:
A neural network is useless if it only sees one example of a matching input/output pair. It cannot infer the characteristics of the input data for which you are looking for from only one example; rather, many examples are required. This is analogous to a child learning the difference between (say) different types of animals—the child will need to see several examples of each to be able to classify an arbitrary animal… It is the same with neural networks. The best training procedure is to compile a wide range of examples (for more complex problems, more examples are required) which exhibit all the different characteristics you are interested in. It is important to select examples which do not have major dominant features which are of no interest to you, but are common to your input data anyway. One famous example is of the US Army “Artificial Intelligence” tank classifier. It was shown examples of Soviet tanks from many different distances and angles on a bright sunny day, and examples of US tanks on a cloudy day. Needless to say it was great at classifying weather, but not so good at picking out enemy tanks.
“Neural Network Follies”, Neil Fraser, September 1998:
In the 1980s, the Pentagon wanted to harness computer technology to make their tanks harder to attack…The research team went out and took 100 photographs of tanks hiding behind trees, and then took 100 photographs of trees—with no tanks. They took half the photos from each group and put them in a vault for safe-keeping, then scanned the other half into their mainframe computer. The huge neural network was fed each photo one at a time and asked if there was a tank hiding behind the trees. Of course at the beginning its answers were completely random since the network didn’t know what was going on or what it was supposed to do. But each time it was fed a photo and it generated an answer, the scientists told it if it was right or wrong. If it was wrong it would randomly change the weightings in its network until it gave the correct answer. Over time it got better and better until eventually it was getting each photo correct. It could correctly determine if there was a tank hiding behind the trees in any one of the photos…So the scientists took out the photos they had been keeping in the vault and fed them through the computer. The computer had never seen these photos before—this would be the big test. To their immense relief the neural net correctly identified each photo as either having a tank or not having one. Independent testing: The Pentagon was very pleased with this, but a little bit suspicious. They commissioned another set of photos (half with tanks and half without) and scanned them into the computer and through the neural network. The results were completely random. For a long time nobody could figure out why. After all nobody understood how the neural had trained itself. Eventually someone noticed that in the original set of 200 photos, all the images with tanks had been taken on a cloudy day while all the images without tanks had been taken on a sunny day. The neural network had been asked to separate the two groups of photos and it had chosen the most obvious way to do it—not by looking for a camouflaged tank hiding behind a tree, but merely by looking at the color of the sky…This story might be apocryphal, but it doesn’t really matter. It is a perfect illustration of the biggest problem behind neural networks. Any automatically trained net with more than a few dozen neurons is virtually impossible to analyze and understand.
Tom White attributes (in October 2017) to Marvin Minsky some version of the tank story being told in MIT classes 20 years before, ~1997 (but doesn’t specify the detailed story or version other than apparently the results were “classified”).
Vasant Dhar & Roger Stein, Intelligent Decision Support Methods, 1997 (pg98, limited Google Books snippet):
…However, when a new set of photographs were used, the results were horrible. At first the team was puzzled. But after careful inspection of the first two sets of photographs, they discovered a very simple explanation. The photos with tanks in them were all taken on sunny days, and those without the tanks were taken on overcast days. The network had not learned to identify tank like images; instead, it had learned to identify photographs of sunny days and overcast days.
Royston Goodacre, Mark J. Neal, & Douglas B. Kell, “Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”, 1994-04-29:
…As in all other data analysis techniques, these supervised learning methods are not immune from sensitivity to badly chosen initial data (113). [113: Zupan, J. and J. Gasteiger: Neural Networks for Chemists: An Introduction. VCH Verlagsgesellschaft, Weinheim (1993)] Therefore the exemplars for the training set must be carefully chosen; the golden rule is “garbage in—garbage out”. An excellent example of an unrepresentative training set was discussed some time ago on the BBC television programme Horizon; a neural network was trained to attempt to distinguish tanks from trees. Pictures were taken of forest scenes lacking military hardware and of similar but perhaps less bucolic landscapes which also contained more-or-less camouflaged battle tanks. A neural network was trained with these input data and found to differentiate most successfully between tanks and trees. However, when a new set of pictures was analysed by the network, it failed to distinguish the tanks from the trees. After further investigation, it was found that the first set of pictures containing tanks had been taken on a sunny day whilst those containing no tanks were obtained when it was overcast. The neural network had therefore thus learned simply to recognise the weather! We can conclude from this that the training and tests sets should be carefully selected to contain representative exemplars encompassing the appropriate variance over all relevant properties for the problem at hand.
Fernando Pereira, “neural redlining”, RISKS 16(41), 1994-09-12:
Fred’s comments will hold not only of neural nets but of any decision model trained from data (eg. Bayesian models, decision trees). It’s just an instance of the old “GIGO” phenomenon in statistical modeling…Overall, the whole issue of evaluation, let alone certification and legal standing, of complex statistical models is still very much open. (This reminds me of a possibly apocryphal story of problems with biased data in neural net training. Some US defense contractor had supposedly trained a neural net to find tanks in scenes. The reported performance was excellent, with even camouflaged tanks mostly hidden in vegetation being spotted. However, when the net was tested on yet a new set of images supplied by the client, the net did not do better than chance. After an embarrassing investigation, it turned out that all the tank images in the original training and test sets had very different average intensity than the non-tank images, and thus the net had just learned to discriminate between two image intensity levels. Does anyone know if this actually happened, or is it just in the neural net “urban folklore”?)
Erich Harth, The Creative Loop: How the Brain Makes a Mind, 1993/1995 (pg158, limited Google Books snippet):
…55. The net was trained to detect the presence of tanks in a landscape. The training consisted in showing the device many photographs of scene, some with tanks, some without. In some cases—as in the picture on page 143—the tank’s presence was not very obvious. The inputs to the neural net were digitized photographs;
All the “continue this sequence” questions found on intelligence tests, for example, really have more than one possible answer but most human beings share a sense of what is simple and reasonable and therefore acceptable. But when the net produces an unexpected association can one say it has failed to generalize? One could equally well say that the net has all along been acting on a different definition of “type” and that that difference has just been revealed. For an amusing and dramatic case of creative but unintelligent generalization, consider the legend of one of connectionism’s first applications. In the early days of the perceptron the army decided to train an artificial neural network to recognize tanks partly hidden behind trees in the woods. They took a number of pictures of a woods without tanks, and then pictures of the same woods with tanks clearly sticking out from behind trees. They then trained a net to discriminate the two classes of pictures. The results were impressive, and the army was even more impressed when it turned out that the net could generalize its knowledge to pictures from each set that had not been used in training the net. Just to make sure that the net had indeed learned to recognize partially hidden tanks, however, the researchers took some more pictures in the same woods and showed them to the trained net. They were shocked and depressed to find that with the new pictures the net totally failed to discriminate between pictures of trees with partially concealed tanks behind them and just plain trees. The mystery was finally solved when someone noticed that the training pictures of the woods without tanks were taken on a cloudy day, whereas those with tanks were taken on a sunny day. The net had learned to recognize and generalize the difference between a woods with and without shadows! Obviously, not what stood out for the researchers as the important difference. This example illustrates the general point that a net must share size, architecture, initial connections, configuration and socialization with the human brain if it is to share our sense of appropriate generalization
Hubert Dreyfus appears to have told this story earlier in 1990 or 1991, as a similar story appears in episode 4 (German) (starting 33m49s) of the BBC documentary series The Machine That Changed the World, broadcast 1991-11-08. Hubert L. Dreyfus, What Computers Still Can’t Do: A Critique of Artificial Reason, 1992, repeats the story in very similar but not quite identical wording (Jeff Kaufman notes that Dreyfus drops the qualifying “legend of” description):
…But when the net produces an unexpected association, can one say that it has failed to generalize? One could equally well say that the net has all along been acting on a different definition of “type” and that that difference has just been revealed. For an amusing and dramatic case of creative but unintelligent generalization, consider one of connectionism’s first applications. In the early days of this work the army tried to train an artificial neural network to recognize tanks in a forest. They took a number of pictures of a forest without tanks and then, on a later day, with tanks clearly sticking out from behind trees, and they trained a net to discriminate the two classes of pictures. The results were impressive, and the army was even more impressed when it turned out that the net could generalize its knowledge to pictures that had not been part of the training set. Just to make sure that the net was indeed recognizing partially hidden tanks, however, the researchers took more pictures in the same forest and showed them to the trained net. They were depressed to find that the net failed to discriminate between the new pictures of trees with tanks behind them and the new pictures of just plain trees. After some agonizing, the mystery was finally solved when someone noticed that the original pictures of the forest without tanks were taken on a cloudy day and those with tanks were taken on a sunny day. The net had apparently learned to recognize and generalize the difference between a forest with and without shadows! This example illustrates the general point that a network must share our commonsense understanding of the world if it is to share our sense of appropriate generalization.
Dreyfus’s What Computers Still Can’t Do is listed as a revision of his 1972 book, What Computers Can’t Do: A Critique of Artificial Reason, but the tank story is not in the 1972 book, only the 1992 one. (Dreyfus’s version is also quoted in the 2017 NYT article and Hillis 1996’s Geography, Identity, and Embodiment in Virtual Reality, pg346.)
Laveen N. Kanal, Artificial Neural Networks and Statistical Pattern Recognition: Old and New Connections’s Foreword, discusses some early NN/tank research (predating not just LeCun’s convolutions but backpropagation), 1991:
…[Frank] Rosenblatt had not limited himself to using just a single Threshold Logic Unit but used networks of such units. The problem was how to train multilayer perceptron networks. A paper on the topic written by Block, Knight and Rosenblatt was murky indeed, and did not demonstrate a convergent procedure to train such networks. In 1962–63 at Philco-Ford, seeking a systematic approach to designing layered classification nets, we decided to use a hierarchy of threshold logic units with a first layer of “feature logics” which were threshold logic units on overlapping receptive fields of the image, feeding two additional levels of weighted threshold logic decision units. The weights in each level of the hierarchy were estimated using statistical methods rather than iterative training procedures [L.N. Kanal & N.C. Randall, “Recognition System Design by Statistical Analysis”, Proc. 19th Conf. ACM, 1964]. We referred to the networks as two layer networks since we did not count the input as a layer. On a project to recognize tanks in aerial photography, the method worked well enough in practice that the U.S. Army agency sponsoring the project decided to classify the final reports, although previously the project had been unclassified. We were unable to publish the classified results! Then, enamored by the claimed promise of coherent optical filtering as a parallel implementation for automatic target recognition, the funding we had been promised was diverted away from our electro-optical implementation to a coherent optical filtering group. Some years later we presented the arguments favoring our approach, compared to optical implementations and trainable systems, in an article titled “Systems Considerations for Automatic Imagery Screening” by T.J. Harley, L.N. Kanal and N.C. Randall, which is included in the IEEE Press reprint volume titled Machine Recognition of Patterns edited by A. Agrawala 19771. In the years which followed multilevel statistically designed classifiers and AI search procedures applied to pattern recognition held my interest, although comments in my 1974 survey, “Patterns In Pattern Recognition: 1968–1974” [IEEE Trans. on IT, 1974], mention papers by Amari and others and show an awareness that neural networks and biologically motivated automata were making a comeback. In the last few years trainable multilayer neural networks have returned to dominate research in pattern recognition and this time there is potential for gaining much greater insight into their systematic design and performance analysis…
While Kanal & Randall 1964 matches in some ways, including the image counts, there is no mention of failure either in the paper or Kanal’s 1991 reminiscences (rather, Kanal implies it was highly promising), there is no mention of a field deployment or additional testing which could have revealed overfitting, and given their use of binarizing, it’s not clear to me that their 2-layer algorithm even could overfit to global brightness; the photos also appear to have been taken at low enough altitude for there to be no clouds, and to be taken under similar (possibly controlled) lighting conditions. The description in Kanal & Randall 1964 is somewhat opaque to me, particularly of the ‘Laplacian’ they use to binarize or convert to edges, but there’s more background in their “Semi-Automatic Imagery Screening Research Study and Experimental Investigation, Volume 1”, Harley, Bryan, Kanal, Taylor & Grayum 1962 (mirror), which indicates that in their preliminary studies they were already interested in prenormalization/preprocessing images to correct for altitude and brightness, and the Laplacian, along with silhouetting and “lineness editing”, noting that “The Laplacian operation eliminates absolute brightness scale as well as low-spatial frequencies which are of little consequence in screening operations.”2
An anonymous reader says he heard the story in 1990:
I was told about the tank recognition failure by a lecturer on my 1990 Intelligent Knowledge Based Systems MSc, almost certainly Libor Spacek, in terms of being aware of context in data sets; that being from (the former) Czechoslovakia he expected to see tanks on a motorway whereas most British people didn’t. I also remember reading about a project with DARPA funding aimed at differentiating Russian, European and US tanks where what the image recognition learned was not to spot the differences between tanks but to find trees, because of the US tank photos being on open ground and the Russian ones being in forests; that was during the same MSc course—so very similar to predicting tumours by looking for the ruler used to measure them in the photo—but I don’t recall the source (it wasn’t one of the books you cite though, it was either a journal article or another text book).
Chris Brew states (2017-10-16) that he “Heard the story in 1984 with pigeons instead of neural nets”.
Edward Fredkin, in an email to Eliezer Yudkowsky on 2013-02-26, recounts an interesting anecdote about the 1960s claiming to be the grain of truth:
By the way, the story about the two pictures of a field, with and without army tanks in the picture, comes from me. I attended a meeting in Los Angeles [at RAND?], about half a century ago [~1963?] where someone gave a paper showing how a random net could be trained to detect the tanks in the picture. I was in the audience. At the end of the talk I stood up and made the comment that it was obvious that the picture with the tanks was made on a sunny day while the other picture (of the same field without the tanks) was made on a cloudy day. I suggested that the “neural net” had merely trained itself to recognize the difference between a bright picture and a dim picture.
The absence of any hard citations is striking: even when a citation is supplied, it is invariably to a relatively recent source like Dreyfus, and then the chain ends. Typically for a real story, one will find at least one or two hints of a penultimate citation and then a final definitive citation to some very difficult-to-obtain or obscure work (which then is often quite different from the popularized version but still recognizable as the original); for example, another popular cautionary AI urban legend is that the 1956 Dartmouth workshop claimed that a single graduate student working for a summer could solve computer vision (or perhaps AI in general), which is a highly distorted misleading description of the original 1955 proposal’s realistic claim that “a 2 month, 10 man study of artificial intelligence” could yield “a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”3 Instead, everyone either disavows it as an urban legend or possibly apocryphal, or punts to someone else. (Minsky’s 2011 version initially seems concrete, but while he specifically attributes the musical score story to a friend & claims to have found the trick personally, he is then as vague as anyone else about the tank story, saying it just “happened” somewhere “in the United States at one of our research institutes”, at an unmentioned institute by unmentioned people at an unmentioned point in time for an unmentioned branch of the military.)
Question to Radio Yerevan: “Is it correct that Grigori Grigorievich Grigoriev won a luxury car at the All-Union Championship in Moscow?”
Radio Yerevan answered: “In principle, yes. But first of all it was not Grigori Grigorievich Grigoriev, but Vassili Vassilievich Vassiliev; second, it was not at the All-Union Championship in Moscow, but at a Collective Farm Sports Festival in Smolensk; third, it was not a car, but a bicycle; and fourth he didn’t win it, but rather it was stolen from him.”
“Radio Yerevan Jokes” (collected by Allan Stevo)
It is also interesting that not all the stories imply quite the same problem with the hypothetical NN. Dataset bias/selection effects is not the same thing as overfitting or disparate impact, but some of the story tellers don’t realize that. For example, in some stories, the NN fails when it’s tested on additional heldout data (overfitting), not when it’s tested on data from an entire different photographer or field exercise or data source (dataset bias/distributional shift). Or, Alexander Harrowell cites disparate impact in a medical school as if it were an example of the same problem, but it’s not—at least in the USA, a NN would be correct in inferring that white students are more likely to succeed, as that is a real predictor (this would be an example of how people play rather fast and loose with claims of “algorithmic bias”), and it would not necessarily be the case that, say, randomized admission of more non-white students would be certain to increase the number of successful graduates; such a scenario is, however, possible and illustrates the difference between predictive models & causal models for control & optimization, and the need for experiments/reinforcement learning.
A read of all the variants together raises more questions than it answers:
- Did this story happen in the 1960s, 1980s, 1990s, or during Desert Storm in the 1990s?
- Was the research conducted by the US military, or researchers for another NATO country?
- Were the photographs taken by satellite, from the air, on the ground, or by spy cameras?
- Were the photographs of American tanks, plywood cutouts, Soviet tanks, or Warsaw Pact tanks?
- Were the tanks out in the open, under cover, or fully camouflaged?
- Were these photographs taken in forests, fields, deserts, swamps, or all of them?
- Were the photographs taken in same place but different time of day, same place but different days, or different places entirely?
- Were there 100, 200, or thousands of photographs; and how many were in the training vs validation set?
- Was the input in black-and-white binary, grayscale, or color?
- Was the tell-tale feature either field vs forest, bright vs dark, the presence vs absence of clouds, the presence vs absence of shadows, the length of shadows, or an accident in film development unrelated to weather entirely?
- Was the NN to be used for image processing or in autonomous robotic tanks?
- Was it even a NN?
- Was the dataset bias caught quickly within “a few hours”, later by a suspicious team member, later still when applied to an additional set of tank photographs, during further testing producing a new dataset, much later during a live demo for military officers, or only after live deployment in the field?
Almost every aspect of the tank story which could vary does vary.
We could also compare the tank story with many of the characteristics of urban legends (of the sort so familiar from Snopes): they typically have a clear dramatic arc, involve horror or humor while playing on common concerns (distrust of NNs has been a theme from the start of NN research4), make an important didactic or moral point, claim to be true while sourcing remains limited to social proof such as the usual “friend of a friend” attributions, often try to associate with a respected institution (such as the US military), are transmitted primarily orally through social mechanisms & appear spontaneously & independently in many sources without apparent origin (most people seem to hear the tank story in unspecified classes, conferences, personal discussions rather than in a book or paper), exists in many mutually-contradictory variants often with overly-specific details5 spontaneously arising in the retelling, been around for a long time (it appears almost fully formed in Dreyfus 1992, suggesting incubation before then), sometimes have a grain of truth (dataset bias certainly is real), and the full tank story is “too good not to pass along” (even authors who are sure it’s an urban legend can’t resist retelling it yet again for didactic effect or entertainment). The tank story matches almost all the usual criteria for an urban legend.
So where does this urban legend come from? The key anecdote appears to be Edward Fredkin’s as it precedes all other excerpts except perhaps the research Kanal describes; Fredkin’s story does not confirm the tank story as he merely speculates that brightness was driving the results, much less all the extraneous details about photographic film being accidentally overdeveloped or robot tanks going berserk or a demo failing in front of Army brass.
But it’s easy to see how Fredkin’s reasonable question could have memetically evolved into the tank story as finally fixed into published form by Dreyfus’s article:
Setting: Kanal & Randall set up their very small simple early perceptrons on some tiny binary aerial photos of tanks, in interesting early work, and Fredkin attends the talk sometime around 1960–1963
The Question: Fredkin then asks in the Q&A whether the perceptron is not learning square-shapes but brightness
Punting: of course neither Fredkin nor Kanal & Randall can know on the spot whether this critique is right or wrong (perhaps that question motivated the binarized results reported in Kanal & Randall 1964?), and the question remains unanswered
Anecdotizing: but someone in the audience considers that an excellent observation about methodological flaws in NN research, and perhaps they (or Fredkin) repeats the story to others, who find it useful too, and along the way, Fredkin’s question mark gets dropped and the possible flaw becomes an actual flaw, with the punchline: “…and it turned out their NN were just detecting average brightness!”
One might expect Kanal & Randall to rebut these rumors, if only by publishing additional papers on their functioning system, but by a quirk of fate, as Kanal explains in his preface, after their 1964 paper, the Army liked it enough to make it classified and then they were reassigned to an entirely different task, killing progress entirely. (Something similar happened to the best early facial recognition systems.)
Proliferation: In the absence of any counternarrative (silence is considered consent), the tank story continues spreading.
Mutation: but now the story is incomplete, a joke missing most of the setup to its punchline—how did these Army researchers discover the NN had tricked them and what was the brightness difference from? The various versions propose different resolutions, and likewise, appropriate details about the tank data must invented.
Fixation: Eventually, after enough mutations, a version reaches Dreyfus, already a well-known critic of the AI establishment, who then uses it in his article/book, virally spreading it globally to pop up in random places thenceforth, and fixating it as an universally-known ur-text. (Further memetic mutations can and often will occur, but diligent writers & researchers will ‘correct’ variants by returning to the Dreyfus version.)
One might try to write Dreyfus off as a coincidence and argue that the US Army must have had so many neural net research programs going that one of the others is the real origin, but one would expect those programs to result in spinoffs, more reports, reports since declassified, etc. It’s been half a century, after all. And despite the close association of the US military with MIT and early AI work, tanks do not seem to have been a major focus of early NN research—for example, Schmidhuber’s history does not mention tanks at all, and most of my paper searches kept pulling up NN papers about ‘tanks’ as in vats, such as controlling stirring/mixing tanks for chemistry. Nor is it a safe assumption that the military always has much more advanced technology than the public or private sectors; often, they can be quite behind or at the status quo.6
Could something like the tank story (a NN learning to distinguish solely on average brightness levels) happen in 2017 with state-of-the-art techniques like convolutional neural networks (CNNs)? (After all, presumably nobody really cares about what mistakes a crude perceptron may or may not have once made back in the 1960s; most/all of the story-tellers are using it for didactic effect in warning against carelessness in contemporary & future AI research/applications.) I would guess that while it could happen, it would be considerably less likely now than then for several reasons:
a common preprocessing step in computer vision (and NNs in general) is to “whiten” the image by standardizing or transforming pixels to a normal distribution; this would tend to wipe global brightness levels, promoting invariance to illumination
in addition to or instead of whitening, it is also common to use aggressive “data augmentation”: shifting the image by a few pixels in each direction, cropping it randomly, adjusting colors to be slightly more red/green/blue, flipping horizontally, barrel-warping it, adding JPEG compression noise/artifacts, brightening or darkening, etc.
None of these transformations should affect whether an image is classifiable as “dog” or “cat”7, the reasoning goes, so the NN should learn to see past them, and generating variants during training provides additional data for free. Aggressive data augmentation would make it harder to pick up global brightness as a cheap trick.
CNNs have built-in biases (compared to fully-connected neural networks) towards edges and other structures, rather than global averages; convolutions want to find edges and geometric patterns like little squares for tanks. (This point is particularly germane in light of the brain inspiration for convolutions & Dreyfus & Dreyfus 1992’s interpretation of the tank story.)
image classification CNNs, due to their large sizes, are often trained on large datasets with many classes to categorize images into (canonically, ImageNet with 1000 classes over a million images; much larger datasets, such as 300 million images, have been explored and found to still offer benefits). Perforce, most of these images will not be generated by the dataset maintainer and will come from a wide variety of peoples, places, cameras, and settings, reducing any systematic biases. It would be difficult to find a cheap trick which works over many of those categories simultaneously, and the NN training will constantly erode any category-specific tricks in favor of more generalizable pattern-recognition (in part because there’s no inherent ‘modularity’ which could factor a NN into a “tank cheap trick” NN & a “everything else real pattern-recognition” NN). The power of generalizable abstractions will tend to overwhelm the shortcuts, and the more data & tasks a NN is trained on, providing greater supervision & richer insight, the more this will be the case.
- Even in the somewhat unusual case of a special-purpose binary classification CNN being trained on a few hundred images, because of the large sizes of good CNNs, it is typical to at least start with a pretrained ImageNet CNN in order to benefit from all the learned knowledge about edges & whatnot before “finetuning” on the special-purpose small dataset. If the CNN starts with a huge inductive bias towards edges etc, it will have a hard time throwing away its informative priors and focusing purely on global brightness. (Often in finetuning, the lower levels of the CNN aren’t allowed to change at all!)
- Another variant on transfer learning is to use the CNN as a feature-generator, by taking the final layers’ state computed on a specific image and using them as a vector embedding, a sort of summary of everything about the image content relevant to classification; this embedding is useful for other kinds of CNNs for purposes like style transfer (style transfer aims to warp an image towards the appearance of another image while preserving the embedding and thus presumably the content) or for GANs generating images (the discriminator can use the features to detect “weird” images which don’t make sense, thereby forcing the generator to learn what images correspond to realistic embeddings).
CNNs would typically throw warning signs before a serious field deployment, either in diagnostics or failures to extend the results.
- One benefit of the filter setup of CNNs is that it’s easy to visualize what the lower layers are ‘looking at’; typically, CNN filters will look like diagonal or horizontal lines or curves or other simple geometric patterns. In the case of a hypothetical brightness-detector CNN, because it is not recognizing any shapes whatsoever or doing anything but trivial brightness averaging, one would expect its filters to look like random noise and definitely nothing like the usual filter visualizations. This would immediately alarm any deep learning researcher that the CNN is not learning what they thought it was learning.
- Related to filter visualization is input visualization: it’s common to generate some heatmaps of input images to see what regions of the input image are influencing the classification the most. If you are classifying “cats vs dogs”, you expect a heatmap of a cat image to focus on the cat’s head and tail, for example, and not on the painting on the living room wall behind it; if you have an image of a tank in a forest, you expect the heatmap to focus on the tank rather than trees in the corner or nothing in particular, just random-seeming pixels all over the image. If it’s not focusing on the tank at all, how is it doing the classification?, one would then wonder. (“Picasso: A Modular Framework for Visualizing the Learning Process of Neural Network Image Classifiers” (blog), Henderson & Rothe 2017-05-16 quote Yudkowsky 2008’s version of the tank story as a motivation for their heatmap visualization tool and demonstrate that, for example, blocking out the sky in a tank image doesn’t bother a VGG16 CNN image classifier but block the tank’s treads does, and the heatmap focuses on the tank itself.) There are additional methods for trying to understand whether the NN has learned a potentially useful algorithm using other methods such as the previously cited LIME.
Also related to the visualization is going beyond classification to the logical next step of “localization” or “image segmentation”: having detected an image with a tank in it somewhere, it is natural (especially for military purposes) to ask where in the image the tank is?
A CNN which is truly detecting the tank itself will lend itself to image segmentation (eg. CNN success in reaching human levels of ImageNet classification performance have also resulted in extremely good segmentation of an image by categorizing each pixel as human/dog/cat/etc), while one learning the cheap trick of brightness will utterly fail at guessing better than chance which pixels are the tank.
So, it is highly unlikely that a CNN trained via a normal workflow (data-augmented finetuning of a pretrained ImageNet CNN with standard diagnostics) would fail in this exact way or, at least, make it to a deployed system without failing.
Could something like the tank story happen, in the sense of a selection-biased dataset yielding NNs which fail dismally in practice? One could imagine it happening and it surely does at least occasionally, but in practice it doesn’t seem to be a particularly serious or common problem—people routinely apply CNNs to very different contexts with considerable success.8 If it’s such a serious and common problem, one would think that people would be able to provide a wealth of real-world examples of systems deployed with dataset bias making it entirely useless, rather than repeating a fiction from 50 years ago.
One of the most relevant (if unfortunately older & possibly out of date) papers I’ve read on this question of dataset bias is “Unbiased Look at Dataset Bias”, Torralba & Efros 2011:
Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (eg. the Corel world, the Caltech101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose?
The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.
They demonstrate on several datasets (including ImageNet), that it’s possible for a SVM (CNNs were not used) to guess at above chance levels what dataset an image comes from and that there are noticeable drops in accuracy when a classifier trained on one dataset is applied to ostensibly the same category in another dataset (eg. an ImageNet “car” SVM classifier applied to PASCAL’s “car” images will go from 57% to 36% accuracy). But—perhaps the glass is half-full—in none of the pairs does the performance degrade to near-zero, so despite the definite presence of dataset bias, the SVMs are still learning generalizable, transferable image classification (similarly, Jo & Bengio 2017/Recht et al 2018/Recht et al 20199/Yadav & Bottou 2019/Zhang & Davison 2020/Beyer et al 2020 show a generalization gap but only a small one with typically better in-sample classifiers performing better out-of-sample, Kornblith et al 2018 show that ImageNet resnets produce multiple new SOTAs on other image datasets using finetuning transfer learning, Lapuschkin et al 2019 compares Fisher vectors (an SVM trained on SIFT features, & BiT is one of a number of scaling papers showing much better representations & robustness & transfer with extremely large CNNs) to CNNs on PASCAL VOC again, finding the Fishers overfit by eg. classifying horses based on copyright watermarks while the CNN nevertheless classifies it based on the correct parts, although the CNN may succumb to a different dataset bias by classifying airplanes based on having backgrounds of skies10); and I believe we have good reason to expect our CNNs to also work in the wild.
Some real instances of dataset bias, more or less (most of these were caught by standard heldout datasets and arguably aren’t the ‘tank story’ at all):
a particularly appropriate example is the unsuccessful WWII Russian anti-tank dog program: a failure, among several reasons, because the dogs were trained on Russian tanks and sought them out rather than the enemy German tanks because the dogs recognized either the fuel smell or fuel canisters (diesel vs gasoline)
“The person concept in monkeys (Cebus apella)”, D’Amato & Van Sant 1988
Google Photos in June 2015 caused a social-media fuss over mislabeling African-Americans as gorillas; Google did not explain how the Photos app made that mistake but it is presumably using a CNN and an example of either dataset bias (many more Caucasian/Asian faces leading to better performance on them and continued poor performance everywhere else) and/or a mis-specified loss function (the CNN optimizing a standard classification loss and responding to class imbalance or objective color similarity by preferring to guess ‘gorilla’ rather than ‘human’ to minimize loss, despite what ought to be a greater penalty for mistakenly classifying a human as an animal/object rather than vice versa). A similar issue occurred with Flickr in May 2015.
“Gender-From-Iris or Gender-From-Mascara?”, Kuehlkamp et al 2017
Gidi Shperber, “What I’ve learned from Kaggle’s fisheries competition” (2017-05-01): initial application of VGG ImageNet CNNs for transfer solved the fish photograph classification problem almost immediately, but failed on the submission validation set; fish categories could be predicted from the specific boat taking the photographs
“Leakage in data mining: Formulation, detection, and avoidance”, Kaufman et al 2011 discusses the general topic and mentions a few examples from KDD-Cup
Dan Piponi (2017-10-16): “Real world example from work: hospitals specialise in different injuries so CNN for diagnosis used annotations on x-rays to ID hospital.”
- A more detailed examination of X-ray saliencies: “Confounding variables can degrade generalization performance of radiological deep learning models”, Zech et al 2018 (blog)
We made exactly the same mistake in one of my projects on insect recognition. We photographed 54 classes of insects. Specimens had been collected, identified, and placed in vials. Vials were placed in boxes sorted by class. I hired student workers to photograph the specimens. Naturally they did this one box at a time; hence, one class at a time. Photos were taken in alcohol. Bubbles would form in the alcohol. Different bubbles on different days. The learned classifier was surprisingly good. But a saliency map revealed that it was reading the bubble patterns and ignoring the specimens. I was so embarrassed that I had made the oldest mistake in the book (even if it was apocryphal). Unbelievable. Lesson: always randomize even if you don’t know what you are controlling for!
a possible case is Wu & Zhang 2016, “Automated Inference on Criminality using Face Images”, attempt to use CNNs to classify standardized government ID photos of Chinese people by whether the person has been arrested, the source of the criminal IDs being government publications of wanted suspects vs ordinary peoples’ IDs collected online; the photos are repeatedly described as ID photos and implied to be uniform. The use of official government ID photos taken in advance of any crime would appear to eliminate one’s immediate objections about dataset bias—certainly ID photos would be distinct in many ways from ordinary cropped promotional headshots—and so the results seem strong.
In response to harsh criticism (some of which points are more relevant & likely than the others…), Wu & Zhang admit in their response (“Responses to Critiques on Machine Learning of Criminality Perceptions (Addendum of arXiv:1611.04135)”) that the dataset is not quite as implied:
All criminal ID photos are government issued, but not mug shots. To our best knowledge, they are normal government issued ID portraits like those for driver’s license in USA. In contrast, most of the noncriminal ID style photos are taken officially by some organizations (such as real estate companies, law firms, etc.) for their websites. We stress that they are not selfies.
While there is no direct replication testing the Wu & Zhang 2016 results that I know of, the inherent considerable differences between the two classes, which are not homogenous at all, make me highly skeptical.
Possible: Winkler et al 2019 examine a commercial CNN (“Moleanalyzer-Pro”; Haenssle et al 2018) for skin cancer detection. Concerned by the fact that doctors sometimes use purple markers to highlight potentially-malignant skin cancers for easier examination, they compare before/after photographs of skin cancers which have been highlighted, and find that the purple highlighting increases the probability of being classified as malignant.
However, it is unclear that this is a dataset bias problem, as the existing training datasets for skin cancer are realistic and already include purple marker samples11. The demonstrated manipulation may simply reflect the CNN using purple as a proxy for human concern, which is an informative signal and desirable if it improves classification performance in the real world on real medical cases. It is possible that the training datasets are in fact biased to some degree with too much/too little purple or that use of purple differs systematically across hospitals, and those would damage performance to some degree, but that is not demonstrated by their before/after comparison. Ideally, one would run a field trial to test the CNN’s performance as a whole by using it in various hospitals and then following up on all cases to determine benign or malignant; if the classification performance drops considerably from the original training, then that implies something (possibly the purple highlighting) has gone wrong.
Possible: Esteva et al 2011 trains a skin cancer classifier; the final CNN performs well in independent test sets. The paper does not mention this problem but media coverage reported that rulers in photographs served as unintentional features:
He and his colleagues had one such problem in their their study with rulers. When dermatologists are looking at a lesion that they think might be a tumor, they’ll break out a ruler—the type you might have used in grade school—to take an accurate measurement of its size. Dermatologists tend to do this only for lesions that are a cause for concern. So in the set of biopsy images, if an image had a ruler in it, the algorithm was more likely to call a tumor malignant, because the presence of a ruler correlated with an increased likelihood a lesion was cancerous. Unfortunately, as Novoa emphasizes, the algorithm doesn’t know why that correlation makes sense, so it could easily misinterpret a random ruler sighting as grounds to diagnose cancer.
It’s unclear how they detected this problem or how they fixed it. And like Winkler et al 2019, it’s unclear if this was a problem which would reduce real-world performance (are dermatologists going to stop measuring worrisome lesions?).
So the NN tank story probably didn’t happen as described, but something somewhat like it could have happened and things sort of like it could happen now, and it is (as proven by its history) a catchy story to warn students with—it’s not true but it’s “truthy”. Should we still mention it to journalists or in blog posts or in discussions of AI risk, as a noble lie?
I think not. In general, we should promote more epistemic rigor and higher standards in an area where there is already far too much impact of fictional stories (eg. the depressing inevitability of a Terminator allusion in AI risk discussions). Nor do I consider the story particularly effective from a didactic perspective: relegating dataset bias to mythical stories does not inform the listener about how common or how serious dataset bias is, nor is it helpful for researchers investigating countermeasures and diagnostics—the LIME developers, for example, are not helped by stories about Russian tanks, but need real testcases to show that their interpretability tools work & would help machine learning developers diagnose & fix dataset bias.
I also fear that telling the tank story tends to promote complacency and underestimation of the state of the art by implying that NNs and AI in general are toy systems which are far from practicality & cannot work in the real world (particularly the story variants which date it relatively recently), or that such systems when they fail will fail in easily diagnosed, visible, sometimes amusing ways, ways which can be diagnosed by a human comparing the photos or applying some political reasoning to the outputs; but modern NNs are powerful, are often deployed to the real world despite the spectre of dataset bias, and do not fail in blatant ways—what we actually see with deep learning are far more concerning failure modes like “adversarial examples” which are quite as inscrutable as the neural nets themselves (or AlphaGo’s one misjudged move resulting in its only loss to Lee Sedol). Adversarial examples are particularly insidious as the NN will work flawlessly in all the normal settings and contexts, only to fail totally when exposed to a custom adversarial input. More importantly, dataset bias and failure to transfer tends to be a self-limiting problem, particularly when embedded in an ongoing system or reinforcement learning agent, since if the NN is making errors based on dataset bias, it will in effect be generating new counterexample datapoints for its next iteration.
“There is nothing so useless as doing efficiently that which should not be done at all.”
The more troubling errors are ones where the goal itself, the reward function, is mis-specified or wrong or harmful. I am less worried about algorithms learning to do poorly the right thing for the wrong reasons because humans are sloppy in their data collection than I am about them learning to do well the wrong thing for the right reasons despite perfect data collection. With errors or inefficiencies in the rest of the algorithm, training may simply be slower, or there may be more local optima which may temporarily trap the agent, or its final performance may be worse than it could be; these are bad things, but normal enough. But when the reward function is wrong, the better the algorithm is, the more useless (or dangerous) it becomes at pursuing the wrong objective! Using losses which have little to do with the true human utility function or decision context is far more common than serious dataset bias: people think about where their data is coming from, but they tend not to think about what the consequences of wrong classifications are. Such reward function problems cannot be fixed by collecting any amount of data or making data more representative of the real world, and for large-scale systems will be more harmful. And it can be hard to avoid errors: sure, in hindsight, once you’ve seen the converged reward hack, you can laugh and say “of course that particular bit of reward-shaping was wrong, how obvious now!”—but only in hindsight. Before then, it’s just common sense.
Unfortunately, I know of no particularly comprehensive lists of examples of mis-specified rewards/unexpectedly bad proxy objective functions/“reward hacking”/“wireheading”/“perverse instantiation”12; perhaps people can make suggestions, but a few examples I have found or recall include:
linear programming optimization for nutritious (not necessarily palatable!) low-cost diets: “The cost of subsistence”, Stigler 1945, “The Diet Problem”, Dantzig 1990, “Stigler’s Diet Problem Revisited”, Garille & Gass 2001
- SMT/SAT solvers are likewise infamous for finding strictly valid yet surprising or useless, which perversity is exactly what makes them so invaluable in security/formal-verification research (for example, in RISC-V verification of exceptions, discovering that it can trigger an exception by turning on a debug unit & setting a breakpoint, or using an obscure memory mode setting)
boat race reward-shaping for picking up targets results in not finish race at all but going in circles to hit targets: “Faulty Reward Functions in the Wild”, OpenAI
a classic 3D robot-arm NN agent, in a somewhat unusual setup where the evaluator/reward function is another NN trained to predict human evaluations, learns to move the arm to a position which looks like it is positioned at the goal but is actually just in between the ‘camera’ and the goal: “Learning from Human Preferences”, Christiano et al 2017, OpenAI
reward-shaping a bicycle agent for not falling over & making progress towards a goal point (but not punishing for moving away) leads it to learn to circle around the goal in a physically stable loop: “Learning to Drive a Bicycle using Reinforcement Learning and Shaping”, Randlov & Alstrom 1998; similar difficulties in avoiding pathological optimization were experienced by Cook 2004 (video of policy-iteration learning to spin handle-bar to stay upright).
reward-shaping a soccer robot for touching the ball caused it to learn to get to the ball and “vibrate” touching it as fast as possible: David Andre & Astro Teller in Ng et al 1999, “Policy invariance under reward transformations: theory and application to reward shaping”
environments involving walking/running/movement and rewarding movement seem to often result in the agents learning to fall over as a local optima of speed generation, possibly bouncing around; for example, Sims notes in one paper (Sims 1994) that “It is important that the physical simulation be reasonably accurate when optimizing for creatures that can move within it. Any bugs that allow energy leaks from non-conservation, or even round-off errors, will inevitably be discovered and exploited by the evolving creatures. …speed is used as the selection criteria, but the vertical component of velocity is ignored. For land environments, it can be necessary to prevent creatures from generating high velocities by simply falling over.” Combined with “3-D Morphology”, Sims discovered that without height limits, the creatures just became as tall as possible and fell over; and if the conservation-of-momentum was not exact, creatures could evolve ‘paddles’ and paddle themselves at high velocity. (Evolving similar exploitation of rounding-off has been done by OpenAI in 2017 to turn apparently linear neural networks into nonlinear ones; Jaderberg et al 2019 appears to have had a similar momentum bug in its Quake simulator: “In one test, the bots invented a completely novel strategy, exploiting a bug that let teammates give each other a speed boost by shooting them in the back.”)
Popov et al 2017, training a simulated robot gripper arm to stack objects like Legos, included reward shaping; pathologies included “hovering” and for a reward-shaping for lifting the bottom face of the top block upwards, DDPG learned to knock the blocks over, thereby (temporarily) elevating the bottom of the top block and receiving the reward:
We consider three different composite rewards in additional to the original sparse task reward:
- Grasp shaping: Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.25 when the brick 1 has been grasped and a reward of 1.0 after completion of the full task.
- Reach and grasp shaping: Reach brick 1, Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.125 when being close to brick 1, a reward of 0.25 when brick 1 has been grasped, and a reward of 1.0 after completion of the full task.
- Full composite shaping: the sparse reward components as before in combination with the distance-based smoothly varying components.
Figure 5 shows the results of learning with the above reward functions (blue traces). The figure makes clear that learning with the sparse reward only does not succeed for the full task. Introducing an intermediate reward for grasping allows the agent to learn to grasp but learning is very slow. The time to successful grasping can be substantially reduced by giving a distance based reward component for reaching to the first brick, but learning does not progress beyond grasping. Only with an additional intermediate reward component as in continuous reach, grasp, stack the full task can be solved.
Although the above reward functions are specific to the particular task, we expect that the idea of a composite reward function can be applied to many other tasks thus allowing learning for to succeed even for challenging problems. Nevertheless, great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing the reward function components: eg. reach and grasp components leading to a grasp unsuitable for stacking, agent not stacking the bricks because it will stop receiving the grasping reward before it receives reward for stacking and the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick. We show examples of these in the video.
RL agents using learned model-based planning paradigms such as the model predictive control are noted to have issues with the planner essentially exploiting the learned model by choosing a plan going through the worst-modeled parts of the environment and producing unrealistic plans using teleportation, eg. Mishra et al 2017, “Prediction and Control with Temporal Segment Models” who note:
If we attempt to solve the optimization problem as posed in (2), the solution will often attempt to apply action sequences outside the manifold where the dynamics model is valid: these actions come from a very different distribution than the action distribution of the training data. This can be problematic: the optimization may find actions that achieve high rewards under the model (by exploiting it in a regime where it is invalid) but that do not accomplish the goal when they are executed in the real environment.
…Next, we compare our method to the baselines on trajectory and policy optimization. Of interest is both the actual reward achieved in the environment, and the difference between the true reward and the expected reward under the model. If a control algorithm exploits the model to predict unrealistic behavior, then the latter will be large. We consider two tasks….Under each model, the optimization finds actions that achieve similar model-predicted rewards, but the baselines suffer from large discrepancies between model prediction and the true dynamics. Qualitatively, we notice that, on the pushing task, the optimization exploits the LSTM and one-step models to predict unrealistic state trajectories, such as the object moving without being touched or the arm passing through the object instead of colliding with it. Our model consistently performs better, and, with a latent action prior, the true execution closely matches the model’s prediction. When it makes inaccurate predictions, it respects physical invariants, such as objects staying still unless they are touched, or not penetrating each other when they collide
This is similar to Sims’s issues, or current issues in training walking or running agents in environments like MuJoCo where it is easy for them to learn odd gaits like hopping (Lillicrap et al 2016 adds extra penalties for impacts to try to avoid this) or jumping (eg. Stelmaszczyk’s attempts at reward shaping a skeleton agent) or flailing around wildly (Heess et al 2017 add random pushes/shoves to the environment to try to make the agent learn more generalizable policies) which may work quite well in the specific simulation but not elsewhere. (To some degree this is beneficial for driving exploration in poorly-understood regions, so it’s not all bad.) Christine Barron, working on a pancake-cooking robot-arm simulation, ran into reward-shaping problems: rewarding for each timestep without the pancake on the floor teaches the agent to hurl the pancake into the air as hard as possible; and for the passing-the-butter agent, rewarding for getting close to the goal produces the same close-approach-but-avoidance behavior to maximize reward.
A curious lexicographic-preference raw-RAM NES AI algorithm learns to pause the game to never lose at Tetris: Murphy 2013, “The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel… after that it gets a little tricky”
RL agent in Udacity self-driving car rewarded for speed learns to spin in circles: Matt Kelcey
NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible: iand675
Doug Lenat’s Eurisko famously had issues with “parasitic” heuristics, due to the self-modifying ability, edited important results to claim credit and be rewarded, part of a class of such wireheading heuristics that Lenat made the Eurisko core unmodifiable: “EURISKO: A program that learns new heuristics and domain concepts: the nature of heuristics III: program design and results”, Lenat 1983 (pg90)
genetic algorithms for image classification evolves timing-attack to infer image labels based on hard drive storage location: https://news.ycombinator.com/item?id = 6269114
training a dog to roll over results in slamming against the wall; dolphins rewarded for finding trash & dead seagulls in their tank learned to manufacture trash & hunt living seagulls for more rewards
circuit design with genetic/evolutionary computation:
- an attempt to evolve a circuit on an FPGA, to discriminate audio tones of 1kHz & 10kHz without using any timing elements, evolved a design which depended on disconnected circuits in order to work: “An evolved circuit, intrinsic in silicon, entwined with physics”, Thompson 1996. (“Possible mechanisms include interactions through the power-supply wiring, or electromagnetic coupling.” The evolved circuit is sensitive to room temperature variations 23–43C, only working perfectly over the 10C range of room temperature it was exposed to during the 2 weeks of evolution. It is also sensitive to the exact location on the FPGA, degrading when shifted to a new position; further finetuning evolution fixes that, but then is vulnerable when shifted back to the original location.)
- an attempt to evolve an oscillator or a timer wound up evolving a circuit which picked up radio signals from the lab PCs (although since the circuits did work at their assigned function as the human intended, should we consider this a case of ‘dataset bias’ where the ‘dataset’ is the local lab environment?): “The evolved radio and its implications for modelling the evolution of novel sensors”, Jon Bird and Paul Layzell 2002
training a “minitaur” bot in simulation to carry a ball or duck on its back, CMA-ES discovers it can drop the ball into a leg joint and then wiggle across the floor without the ball ever dropping
CycleGAN, a cooperative GAN architecture for converting images from one genre to another (eg. horses⟺zebras), has a loss function that rewards accurate reconstruction of images from its transformed version; CycleGAN turns out to partially solve the task by, in addition to the cross-domain analogies it learns, steganographically hiding autoencoder-style data about the original image invisibly inside the transformed image to assist the reconstruction of details (Chu et al 2017)
A researcher in 2020 working on art colorization told me of an interesting similar behavior: his automatically-grayscaled images were failing to train the NN well, and he concluded that this was because grayscaling a color image produces many shades of gray in a way that human artists do not, and that the formula used by OpenCV for RGB → grayscale permits only a few colors to map onto any given shade of gray, enabling accurate guessing of the original color! Such issues might require learning a grayscaler, similar to superresolution needing learned downscalers (Sun & Chen 2019).
the ROUGE machine translation metric, based on matching sub-phrases, is typically used with RL techniques since it is a non-differentiable loss; Salesforce (Paulus et al 2017) notes that an effort at a ROUGE-only summarization NN produced largely gibberish summaries, and had to add in another loss function to get high-quality results
Alex Irpan writes of 3 anecdotes:
In talks with other RL researchers, I’ve heard several anecdotes about the novel behavior they’ve seen from improperly defined rewards.
- A coworker is teaching an agent to navigate a room. The episode terminates if the agent walks out of bounds. He didn’t add any penalty if the episode terminates this way. The final policy learned to be suicidal, because negative reward was plentiful, positive reward was too hard to achieve, and a quick death ending in 0 reward was preferable to a long life that risked negative reward.
- A friend is training a simulated robot arm to reach towards a point above a table. It turns out the point was defined with respect to the table, and the table wasn’t anchored to anything. The policy learned to slam the table really hard, making the table fall over, which moved the target point too. The target point just so happened to fall next to the end of the arm.
- A researcher gives a talk about using RL to train a simulated robot hand to pick up a hammer and hammer in a nail. Initially, the reward was defined by how far the nail was pushed into the hole. Instead of picking up the hammer, the robot used its own limbs to punch the nail in. So, they added a reward term to encourage picking up the hammer, and retrained the policy. They got the policy to pick up the hammer…but then it threw the hammer at the nail instead of actually using it.
Admittedly, these are all secondhand accounts, and I haven’t seen videos of any of these behaviors. However, none of it sounds implausible to me. I’ve been burned by RL too many times to believe otherwise…I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima. It’s a bit ridiculous, but I’ve found it’s actually a productive mindset to have.
Chrabaszcz et al 2018: an evolutionary strategies RL in the ALE game Q*bert finds that it can steadily earn points by committing ‘suicide’ to lure an enemy into following it; more interestingly, it also discovers what appears to be a previously unknown bug where a sequence of jumps will, semi-randomly, permanently force the game into a state where the entire level begins flashing and the score increases rapidly & indefinitely until the game is reset (video)
Lapuschkin et al 2019 notes a borderline case in the ALE pinball game where the ‘nudge’ ability is unlimited (unlike all real pinball machines) and a DQN can learn to score arbitrarily by the ball budging over a switch repeatedly:
The second showcase example studies neural network models (see Figure 5 for the network architecture) trained to play Atari games, here Pinball. As shown in , the DNN achieves excellent results beyond human performance. Like for the previous example, we construct LRP heatmaps to visualize the DNN’s decision behavior in terms of pixels of the pinball game. Interestingly, after extensive training, the heatmaps become focused on few pixels representing high-scoring switches and loose track of the flippers. A subsequent inspection of the games in which these particular LRP heatmaps occur, reveals that DNN agent firstly moves the ball into the vicinity of a high-scoring switch without using the flippers at all, then, secondly, “nudges” the virtual pinball table such that the ball infinitely triggers the switch by passing over it back and forth,without causing a tilt of the pinball table (see Figure 2b and Figure 6 for the heatmaps showing this point, and also Supplementary Video 1). Here, the model has learned to abuse the “nudging” threshold implemented through the tilting mechanism in the Atari Pinball software. From a pure game scoring perspective, it is indeed a rational choice to exploit any game mechanism that is available. In a real pinball game, however, the player would go likely bust since the pinball machinery is programmed to tilt after a few strong movements of the whole physical machine.
“Trial without Error: Towards Safe Reinforcement Learning via Human Intervention”, Saunders et al 2017; the blog writeup notes:
The Road Runner results are especially interesting. Our goal is to have the agent learn to play Road Runner without losing a single life on Level 1 of the game. Deep RL agents are known to discover a ‘Score Exploit’ in Road Runner: they learn to intentionally kill themselves in a way that (paradoxically) earns greater reward. Dying at a precise time causes the agent to repeat part of Level 1, where it earns more points than on Level 2. This is a local optimum in policy space that a human gamer would never be stuck in.
Ideally, our Blocker would prevent all deaths on Level 1 and hence eliminate the Score Exploit. However, through random exploration the agent may hit upon ways of dying that “fool” our Blocker (because they look different from examples in its training set) and hence learn a new version of the Score Exploit. In other words, the agent is implicitly performing a random search for adversarial examples for our Blocker (which is a convolutional neural net)…In Road Runner we did not achieve zero catastrophes but were able to reduce the rate of deaths per frame from 0.005 (with no human oversight at all) to 0.0001.
Toromanoff et al 2019 note various bugs in the ALE games, but also a new infinite loop for maximizing scores:
Finally, we discovered that on some games the actual optimal strategy is by doing a loop over and over giving a small amount of reward. In Elevator Action the agent learn to stay at the first floor and kill over and over the first enemy. This behavior cannot be seen as an actual issue as the agent is basically optimizing score but this is definitely not the intended goal. A human player would never perform this way.
Wall Sensor Stack: The original Wall Sensor Stack environment had a bug that the R2D3 agent was able to exploit. We fixed the bug and verified the agent can learn the proper stacking behavior.
…Another desirable property of our approach is that our agents are able to learn to outperform the demonstrators, and in some cases even to discover strategies that the demonstrators were not aware of. In one of our tasks the agent is able to discover and exploit a bug in the environment in spite of all the demonstrators completing the task in the intended way…R2D3 performed better than our average human demonstrator on Baseball, Drawbridge, Navigate Cubes and the Wall Sensor tasks. The behavior on Wall Sensor Stack in particular is quite interesting. On this task R2D3 found a completely different strategy than the human demonstrators by exploiting a bug in the implementation of the environment. The intended strategy for this task is to stack two blocks on top of each other so that one of them can remain in contact with a wall mounted sensor, and this is the strategy employed by the demonstrators. However, due to a bug in the environment the strategy learned by R2D3 was to trick the sensor into remaining active even when it is not in contact with the key by pressing the key against it in a precise way.
“Emergent Tool Use From Multi-Agent Autocurricula”, Baker et al 2019:
We originally believed defending against ramp use would be the last stage of emergence in this environment; however, we were surprised to find that yet two more qualitatively new strategies emerged. After 380 million total episodes of training, the seekers learn to bring a box to the edge of the play area where the hiders have locked the ramps. The seekers then jump on top of the box and surf it to the hiders’ shelter; this is possible because the environment allows agents to move together with the box regardless of whether they are on the ground or not. In response, the hiders learn to lock all of the boxes in place before building their shelter.
The OA blog post expands on the noted exploits:
Surprising behaviors: We’ve shown that agents can learn sophisticated tool use in a high fidelity physics simulator; however, there were many lessons learned along the way to this result. Building environments is not easy and it is quite often the case that agents find a way to exploit the environment you build or the physics engine in an unintended way.
- Box surfing: Since agents move by applying forces to themselves, they can grab a box while on top of it and “surf” it to the hider’s location.
- Endless running: Without adding explicit negative rewards for agents leaving the play area, in rare cases hiders will learn to take a box and endlessly run with it.
- Ramp exploitation (hiders): Reinforcement learning is amazing at finding small mechanics to exploit. In this case, hiders abuse the contact physics and remove ramps from the play area.
- Ramp exploitation (seekers): In this case, seekers learn that if they run at a wall with a ramp at the right angle, they can launch themselves upward.
“Fine-Tuning Language Models from Human Preferences”, Ziegler et al 2019 (blog), fine-tune trained an English text generation model based on human ratings; they provide a curious example of a reward specification bug. Here, the reward was accidentally negated; this reversal, rather than resulting in nonsense, resulted in (literally) perversely coherent behavior of emitting obscenities to maximize the new score:
Bugs can optimize for bad behavior: One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.
The paper in question discusses general questions of necessary resolution, computing requirements, optics, necessary error rates, and algorithms, but doesn’t describe any implemented systems, much less experiences which resemble the tank story.↩︎
Another interesting detail from Harley et al 1962 about their tank study: in discussing designing their computer ‘simulation’ of their quasi-NN algorithms, their description of the photographs on pg133 makes it sound as if the dataset was constructed from the same photographs by using large-scale aerial footage and then cropping out the small squares with tanks and then corresponding small squares without tanks—so they only had to process one set of photographs, and the resulting tank/non-tank samples are inherently matched on date, weather, time of day, lighting, general location, roll of film, camera, and photographer. If true, that would make almost all the various suggested tank problem shortcuts impossible, and would be further evidence that Kanal’s project was not & could not have been a true origin of the tank story (although if it was simply misunderstood and erroneously critiqued, then it could be a tiny kernel of truth from which the urban legend sprang).↩︎
This seems entirely reasonable to me, given that hardly any AI research existed at that point. While it’s unclear what results were accomplished immediately thanks to the 1956 workshop, many of the attendees would make major discoveries in AI. Attendee Ray Solomonoff’s wife, Grace Solomonoff (“Ray Solomonoff and the Dartmouth Summer Research Project in Artificial Intelligence, 1956”, 2016) describes the workshop as having vivid discussions but was compromised by getting only half its funding (so it didn’t last the summer) and attendees showing up sporadically & for short times (“Many participants only showed up for a day or even less.”); no agreement was reached on a specific project to try to tackle, although Solomonoff did write a paper there he considered important.↩︎
One commenter observes that the NN tank story and ilk appears to almost always be told about neural networks, and wonders why when dataset bias ought to be just as much a problem for other statistical/machine-learning methods like decision trees, which are capable of learning complex nonlinear problems. I could note that these anecdotes also get routinely told about genetic algorithms & evolutionary methods, so it’s not purely neural, and it might be that NNs are victims of their own success: particularly as of 2017, NNs are so powerful & flexible in some areas (like computer vision) there is little competition, and so any horror stories will probably involve NNs.↩︎
Here, the number of photographs and exactly how they were divided into training/validation sets is an oddly specific detail. This is reminiscent of religions or novels, where originally sparse and undetailed stories become elaborated and ever more detailed, with striking details added to catch the imagination. For example, the Three Magi in the Christian Gospels are unnamed, but have been given by later Christians extensive fictional biographies of names (“Names for the Nameless in the New Testament”; one of many given names), symbolism, kingdoms, contemporary successors/descendants, martyrdoms & locations of remains…↩︎
One memorable example of this for me was when the Edward Snowden NSA leaks began.
Surely, given previous instances like differential cryptanalysis or public-key cryptography, the NSA had any number of amazing technologies and moon math beyond the ken of the rest of us? I read many of the presentations with great interest, particularly about how they searched for individuals or data—cutting edge neural networks? Evolutionary algorithms? Even more exotic techniques? Nope—regexps, linear models, and random forests. Practical but boring. Nor did any major cryptographic breakthroughs become exposed via Snowden.
Overall, the NSA corpus indicates that they had the abilities you would expect from a large group of patient programmers with no ethics given a budget of billions of dollars to spend on a mission whose motto was “hack the planet” using a comprehensive set of methods ranging from physical breakins & bugs, theft of private keys, bribery, large-scale telecommunications tapping, implanting backdoors, purchase & discovery of unpatched vulnerabilities, & standards process subversion. Highly effective in the aggregate but little that people hadn’t expected or long speculated about in the abstract.↩︎
It amuses me to note when websites or tools are clearly using ImageNet CNNs, because they assume ImageNet categories or provide annotations in their metadata, or because they exhibit uncannily good recognition of dogs. Sometimes CNNs are much better than they are given credit for being and they are assumed by commenters to fail on problems they actually succeed on; for example, some meme images have circulated claiming that CNNs can’t distinguish fried chickens from Labradoodle dogs, chihuahuas from muffins, or sleeping dogs from bagels—but as amusing as the image-sets are, Miles Brundage reports that Clarifai’s CNN API has little trouble accurately distinguishing man’s worst food from man’s best friend.↩︎
Recht et al 2019’s ImageNet-v2 turns out to illustrate some subtle issues in measuring dataset bias (Engstrom et al 2020): because of measurement error in the labels of images causing errors in the final dataset, simply comparing a classifier trained on one with its performance on the other and noting that performance fell by X% yields a misleadingly inflated estimate of ‘bias’ by attributing the combined error of both datasets to the bias. A Rip Van Winkle estimate of CNN overfitting indicates it must be mild—CNNs just aren’t all that algorithmically complex and thus unable to be overly-tailored to ImageNet. For much more theory on covariate shift impacts and decreases/increases in performance of NNs, see Tripuraneni et al 2021.↩︎
Lapuschkin et al 2019:
The first learning machine is a model based on Fisher vectors (FV) [31, 32] trained on the PASCAL VOC 2007 image dataset  (see Section E). The model and also its competitor, a pretrained Deep Neural Network (DNN) that we fine-tune on PASCAL VOC, show both excellent state-of-the-art test set accuracy on categories such as ‘person’, ‘train’, ‘car’, or ‘horse’ of this benchmark (see Table 3). Inspecting the basis of the decisions with LRP, however, reveals for certain images substantial divergence, as the heatmaps exhibiting the reasons for the respective classification could not be more different. Clearly, the DNN’s heatmap points at the horse and rider as the most relevant features (see Figure 14). In contrast, FV’s heatmap is most focused onto the lower left corner of the image,which contains a source tag. A closer inspection of the data set (of 9963 samples ) that typically humans never look through exhaustively, shows that such source tags appear distinctively on horse images; a striking artifact of the dataset that so far had gone unnoticed . Therefore, the FV model has ‘overfitted’ the PASCAL VOC dataset by relying mainly on the easily identifiable source tag, which incidentally correlates with the true features, a clear case of ‘Clever Hans’ behavior. This is confirmed by observing that artificially cutting the source tag from horse images significantly weakens the FV model’s decision while the decision of the DNN stays virtually unchanged (see Figure 14). If we take instead a correctly classified image of a Ferrari and then add to it a source tag, we observe that the FV’s prediction swiftly changes from ‘car’ to ‘horse’ (cf. Figure 2a) a clearly invalid decision (see Section E and Figures 15–20 for further examples and analyses)… For the classification of ships the classifier is mostly focused on the presence of water in the bottom half of an image. Removing the copyright tag or the background resultsin a drop of predictive capabilities. A deep neural network, pre-trained in the ImageNet dataset , instead shows none of these shortcomings.
The airplane example is a little more debatable—the presence of a lot of blue sky in airplane images seems like a valid cue to me and not necessarily cheating:
…The SpRAy analysis could furthermore reveal another ‘Clever Hans’ type behavior in our fine-tuned DNN model, which had gone unnoticed in previous manual analysis of the relevance maps. The large eigengaps in the eigenvalue spectrum of the DNN heatmaps for class “aeroplane” indicate that the model uses very distinct strategies for classifying aeroplane images (see Figure 26). A t-SNE visualization (Figure 28) further highlights this cluster structure. One unexpected strategy we could discover with the help of SpRAy is to identify aeroplane images by looking at the artificial padding pattern at the image borders, which for aeroplane images predominantly consists of uniform and structureless blue background. Note that padding is typically introduced for technical reasons (the DNN model only accepts square shaped inputs), but unexpectedly (and unwantedly) the padding pattern became part of the model’s strategy to classify aeroplane images. Subsequently we observe that changing the manner in which padding is performed has a strong effect on the output of the DNN classifier (see Figures 29–32).
Winkler et al 2019: “When reviewing the open-access International Skin Imaging Collaboration database, which is a source of training images for research groups, we found that a similar percent-age of melanomas (52 of 2169 [2.4%]) and nevi (214 of 9303 [2.3%]) carry skin markings. Nevertheless, it seems conceivable that either an imbalance in the distribution of skin markings in thousands of other training images that were used in the CNN tested herein or the assignment of higher weights to blue markings only in lesions with specific (though unknown) accompanying features may induce a CNN to associate skin markings with the diagnosis of melanoma. The latter hypothesis may also explain why melanoma probability scores remained almost unchanged in many marked nevi while being increased in others.”↩︎
Getting into more general economic, behavioral, or human situations would be going too far afield, but the relevant analogues are “principal-agent problem”, “perverse incentives”, “law of unintended consequences”, “Lucas critique”, “Goodhart’s law”, or “Campbell’s law”; such alignment problems are only partially dealt with by having ground-truth evolutionary ‘outer’ losses, and avoiding reward hacking remains an open problem (even in theory). Speedrun gaming communities frequently provide examples of reward-hacking, particularly when games are finished faster by exploiting bugs to sequence break; particularly esoteric techniques require outright hacking the “weird machines” present in many games/devices—for example, pannenkoek2012’s ‘parallel universes’ Super Mario 64 hack which avoids using any jumps by exploiting an integer overflow bug & modulo wraparound to accelerate Mario to near-infinite speed, passing through the entire map multiple times, in order to stop at the right place. ↩︎