[R] Adversarial Examples Aren't Bugs, They're Features

jklaise · 2019-05-08T22:18:17+00:00

[deleted]

Superdooper234yf6 · 2019-05-09T04:27:17+00:00

Can I see these non robust features? Can you generate some examples so I can try to catch a glimpse of them. I want to see what dark matter features look like...

gwern · 2019-05-08T17:37:18+00:00

[deleted]

aboveaveragebatman · 2019-05-08T19:18:00+00:00

This is quite a unique interpretation and can directly be related to adversarial training which improves generalisation ability by including adversarial examples in the training set. I have a couple of questions. (1) By this interpretation, would you say that adversarial examples are an inherent weakness of representation learning, and that we would need some new form of learning capable of learning these "adversarial features" without being explicitly added to the training set? (2) Also do you think that there might be ways of extracting much more meaningful information from the network to increase robustness rather than adversrarial training the model?

jklaise · 2019-05-08T20:31:02+00:00

Very interesting work!

I note one thing, if adversarial training works as well as it can on robust features (as defined by humans), it should be able to approach human performance on these datasets, indeed it's interesting to see from Figure 2.b that the adversarially trained model on CIFAR-10 achieves ~91-92% accuracy on the test set, not far from Karpathy's manual attempt on a subset of the test set (94%).

It would be interesting to have more rigorous experiments of this - for example, what if it turns out that adversarially trained networks cannot match human performance on other datasets? Does that mean that the dataset is unusual (e.g. has a relatively large amount of non-robust features) or the model is not sophisticated enough to be as good as humans by only observing robust features? Or maybe humans do use non-robust features when making their classification decisions it's just not perceptible to the decision maker.

Reiinakano · 2019-05-09T14:06:33+00:00

Very cool paper and nice blog post. I like the story of planet Erm :)

I find particularly interesting your graph showing correlation between ability to pick up nonrobust features VS adversarial transferability and how VGG is so far back the other models. VGG is also special in style transfer because other architectures don't work as well as VGG without some sort of parameterization trick (https://distill.pub/2018/differentiable-parameterizations). I think your results might give an alternative explanation of why. Since VGG is unable to capture non robust features, when using it for perceptual loss, it actually looks more correct to humans! I wonder if there is something in VGG that's closer to human priors than other SOTA architectures.

radarsat1 · 2019-05-08T21:21:52+00:00

I'd like to see someone take a frequency-based perspective on adversarial examples and these "robust" features. I really suspect that if there are robust features that are "imperceptible" it is because they are really small and high-frequency, and generally not something that a human will focus on to identify global shape.

I'm not sure where I'm going with that, but it seems to me that for example that a trained network could easily focus on eg. a cat's stripe patterns rather than their "cat shape" to help identify it apart from dogs. That's of course just an example, I mean that maybe similar things happen in a more general sense. So how can shape over pattern be emphasized for classification? It would seem to have something to do with size invariance, which itself is related to sensitivity to spatial frequencies.

yldedly · 2019-05-08T18:25:03+00:00

Very cool! I'm a little confused about the implication that adversarial examples are purely human-centric. What would happen if you did your first experiment on a new, separate test set, rather than the original one? Would the non-robust features still help generalize to the new test set, or are they specific to the distribution of the original training and test set?

avaxzat · 2019-05-08T21:15:22+00:00

This is reminiscent of the FDROP algorithm by Globerson & Roweis (2006). Ideally, a good classifier should perform well even after deletion of such "non-robust" features.

FliesMoreCeilings · 2019-05-08T22:07:20+00:00

Interesting stuff. Do you think this suggests there is some sort of genuine information that our brains are somehow completely missing out on? And if so, is that information also present in regular light, or is it more of an artifact of the way pixels work?

Taleuntum · 2019-05-10T23:10:52+00:00

How is it possible that the classifier in my head ignores some useful features? After a few billion years of training I wouldn't expect to be beaten by some toddler computer model. Isn't it more probable that the non-robust features are features of the data-gathering process only? Eg: to photograph insects, you set your camera to macro mode which may introduce some subtle, (to human eyes) imperceptible perturbation compared to the non-macro mode which is not filtered out by the quality-assurance process in camera factories, because of the same reason: it is imperceptible to the human eye.

I'm a complete layman, sorry if my question is idiotic.

fisforfoxes · 2019-05-08T22:33:50+00:00

Appreciate the well written blog and interesting paper.

One aspect I was curious about, was are these imperceptible features a function of the actual probability distribution of the class, or merely a result of the dataset upon which the model was trained?

For instance, are these same features present if a model was trained on half of cifar/imagenet, and another model was trained on the other half, and then the imperceptible features for the adversarial atack were generated using one model and then tested on another.

Or a model (resnet50) trained on imagenet and another on another image dataset.

What is the universality of these imperceptible features?

MohKohn · 2019-05-09T00:55:20+00:00

Interesting result, this is definitely going to be percolating for a while. As /u/kindlyBasket mentions with the tank example, when we use non-interpretable features, it's difficult to eliminate the possibility that we've accidentally found spurious statistical trends to our data. Do you have any thoughts on how to determine whether any of these non-robust features are still "true" features, i.e. ones we should generally expect to see in contexts where we would actually use dnns?

Towram · 2019-05-09T08:56:59+00:00

Does the transferability part explains that in order to build an adversarial example for one NN, one doesn't need access to the prediction of the NN but only the training data would be enough?

Seerdecker · 2019-05-08T21:20:30+00:00

Nice paper. From figure 2b, it looks like the robust features are less useful overall than the non-robust features for standard accuracy. This is somewhat surprising to me. If this is actually the case and you combine that property with the tendency of neural networks to rely largely on the most predictive features, then you get that neural networks will naturally tend to evolve non-robust features in priority. Does that conclusion hold?

EDIT

I don't think that conclusion actually holds. My mistake is considering that the "non-robust" dataset contains only non-robust features, like in section 3.2. That's not the case here. The "non-robust" dataset actually contains both the robust and non-robust features evolved by the standard classifier.

anarchistruler · 2019-05-09T10:03:59+00:00

I really like the experiments done in the paper, I think it's especially interesting that we can train a generalizing model only from adversarial examples!

However, I'm not sure I would interpret these results the same way. First, I want to note that features as defined in the paper can also be the trained classifiers themselves. This already proves that there are non-robust, generalizable features by itself (from the existence of non-robust classifiers).

Also, I don't agree that the results imply new issues with interpretability. Adversarial examples were already an issue for interpretability before and are hard (or even impossible) to explain to a human. Like adversarial examples, non-robust features don't seem to be a problem for naturally occuring images (as the classifier can still generalize), in the sense that they are unlikely to change the prediction for an image (sampled from a "natural" distribution).

Lastly, one could also argue that adversarial examples and what's shown in the paper are artifacts of the CNN architectures we're using today. Then the paper shows that these artifacts are enough to train a CNN without needing correctly labeled "natural" data. In that interpretation, the non-robust features are not particularly important for a "correct" model (whatever that would mean).

PeterPrinciplePro · 2019-05-09T10:16:45+00:00

Great work. Goodfellow et al. leaped to the conclusion that adversarial examples are an out-of-distribution issue, while they are actually an issue of weighting perceptible and imperceptible features. Imperceptible (high frequency?) features get a lot of importance in computating the outputs which is counter-intuitive to humans because they presumably have a prior to ignore those features.

I cannot quite follow your conclusion that imperceptible features would be helpful for generalization though. You have only proven that one can use them for training to great effect, but it is also known that NNs are somewhat brittle and make wrong predictions with high confidence, which may as well be due to imperceptible features in novel input being similar to some previously seen classes. You have not proven that this is not the case.

An interesting question seems to be whether deep learning can learn well at all without clinging to such imperceptible features. It has often been suggested that neural networks mostly focus on textures, which seems to be very related.

cpury · 2019-05-09T12:55:53+00:00

That's a great blog post! Thanks. It would be amazing if every paper would be accompanied by such a well-written and easy-to-understand text.

The whole concept of "meaningful non-robust features" is new to me, and blowing my mind. Are there any intuitive examples of what such a feature could be? E.g. maybe some subtle pattern of hair that is overlaying a cat and that we ignore? Or are they really so imperceptible to us that there is no example we can understand?

ianismean · 2019-05-09T13:17:01+00:00

Pain is not a bug, it's a feature.

andrew_ilyas · 2019-05-09T15:32:50+00:00

No problem! I’m not sure if we can certainly conclude that though. For example, it could be that the natural data distribution does have these features (after all, these datasets are just pictures of natural objects), but humans have just learned/evolved/whatever to not be able to see these features. In that sense, since classifiers definitely can see these features, we need to rethink how we train classifiers to be invariant to things that humans are definitely invariant to. Happy to discuss this more!

nian_si · 2019-05-10T04:05:45+00:00

Very interesting paper! But I have 2 questions,

I think you did experiments in multi-class setting, right? All your definitions and theory are in the binary setting, which somehow confused me.
Robust features are those that preserve under small perturbations, while non-robust features change drastically under such perturbations? How do you explain such a phenomenon? In other words, although robust and non-robust features are both features, there are also differences between good and bad features even without human's assessment.

Thanks!

TotesMessenger · 2019-05-10T21:58:14+00:00

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

^{If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.} ^(Info ^/ ^Contact)

SadPaperMachine · 2019-05-11T13:02:46+00:00

Does this theory also applies to binarized data, such as MNIST?

MNIST seems hard to encode any sort of non-robust features?

Also, how do you explain the adversarial examples generated by spatial transformation [1]? Is it also a part of the non-robust features?

[1] "Spatially Transformed Adversarial Examples" https://arxiv.org/abs/1801.02612

nian_si · 2019-05-20T17:57:49+00:00

How would your method in section 3.2 perform in the untargeted setting? Specifically, find untargeted adversarial examples and train a network using those examples and their "wrong" labels (output from the standard network).

zergling103 · 2019-05-23T09:27:38+00:00

When you say "adversarial examples...are instead actually meaningful but imperceptible features of the data distribution", you say that these patterns are not unique to the data it was trained on in that they transfer to the test set. But, often times an adversarial perturbations are so tiny that they would be quantized away by the 24-bit RGB format (where each color channel has only 255 steps of granularity). With this being the case, how could non-robust features be part of the dataset when image data is not saved in a format that could represent them (at least not within individual images)?

Would a network trained on an endless live feed of real-world imagery (such as that we are naturally exposed to, with all its variations in pose, lighting, etc.) be sensitive to adversarial perturbations? If so, it would follow with your claims that non-robust features are patterns that would be found in the real world (assuming we can rule out camera or compression artifacts as a cause); they are objectively there in that being sensitive to these patterns would give you a measurable competative edge at detecting, for example, prey. Or that they could be exploited by prey to act as camoflauge. Perhaps this warrants research as to whether or not animals exploit non-robust features in nature as it seems to run contrary to our experiences: You wouldn't expect a bright red insect on green leaf to remain hidden from a bird because it had a particular pattern of noisy fuzz on it that caused the bird to misclassify it as a dangerous animal. In cases where animals do use camo or disguises, they arguably use what could be defined as "robust" features: false eye dots on a butterfly's wing; the color, shape and texture of a leaf; etc.

hillhe2019 · 2019-06-21T07:49:01+00:00

Hi, firstly sorry for my weak literal skill..still deeply inspired by your research, but according to 'We will construct a training set for Dr via a one-to-one mapping x -> xr from the original training set for D ...‘, i guess you actually 'generated' a Dnr which is from your alternative regime where examples labeled('mapped') corrensponding to the semantically new rules-to represent the adv corrensponces, therefore i assume that your methods will eventually meets overfitting- on a reverse fashion...

And since your alternative network probably has even worse non-linear nature, your method will ask for larger scope of independent distributions for training otherwise it will never be well fitted inversely, to a specific quesition domain.

sia_rezaei · 2019-07-13T18:14:52+00:00

I am trying to clarify my understanding. Is the following statement correct? In section 3.2 the perturbation method must be amplifying the non-robust features that are already in the dataset. That is when perturbing a dog image to be classified as a cat, you are amplifying non-robust cat features that are already present in the cat images of the dataset.

In other words, the features that are added by the perturbation are not just any features that make the dog image to be classified as a cat, but are non-robust features that are found in the cat images in the dataset.

learning-new-thingz · 2019-08-07T22:41:49+00:00

I really enjoyed this work, and it definitely puts adversarial examples into more perspective for me. I have two questions, which are slightly different in nature:

It seems like you put all features into three bins:

Robust features - Ones that are useful for generalization and also robust to perturbations in x.
Non-robust but still good features - Useful for generalization but not robust to perturbations .... in the paper you define these by saying that the sign of the correlation with the labels flip.
Bad features - Not useful for generalization or robustness.. artifacts of the training set. Like the ones that dropout/l2 regularization etc. try to prevent.

If this is correct, my question is why the discrete division of features in this way rather than a more continuous characterization? Would it not be possible that the correlation does not flip but the normalized magnitude reduces or something? Or that features are robust for a certain norm ball but non-robust when the ball becomes bigger?

As a budding researcher, I am really curious about the line of investigation that led you to this exploration. If you would be kind enough to reveal what prompted this hypothesis, I would be very interested to hear.

PackedSnacks · 2019-05-08T23:55:22+00:00

[Edit: I guess my question is, if they are features why are we trying to hard to remove them?]

You write:

"Overall, attaining models that are robust and interpretable will require explicitly encoding human priors into the training process."

Sure, but since when in machine learning do we base the entire evaluation of a model prior by how well the model satisfies the prior. Smoothness has long been pitched as an important model prior well before the deep learning boom, but methods which motivate smoothness never based their entire evaluation of the model by measuring how smooth the model was. You would show the smoothness was useful for something, like better generalization or out-of-distribution robustness. So why then in the adversarial example literature do you evaluate the lp-robustness prior with just lp-robustness?

Don't get me wrong, I think there are some interesting ideas in this paper. But you haven't demonstrated that the adversarially trained model is more "robust". The "robust" model is just be using the remaining features in way that also lacks robustness (e.g. the severe drop in performance in fog for the adversarially trained models observed in https://arxiv.org/abs/1901.10513).

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS