×
all 115 comments

[–][deleted]  (23 children)

[deleted]

    [–][deleted]  (16 children)

    [deleted]

      [–][deleted]  (15 children)

      [deleted]

        [–]jklaise 6 points7 points  (7 children)

        To be fair (and if I'm not misinterpreting this) Figure 2.b shows that robust features are by far the biggest contributor to model accuracy. Given that, it might still be sensible to seek sparse (i.e. human-interpretable) instances for interpretability. I do agree that there is a fundamental trade-off between explanations being human-interpretable and faithful to the model decision making process, the real question is if there is a good balance between the two, e.g. is an explanation method still worthwhile if it's not faithful 10% of the time?

        [–]andrew_ilyas[S] 7 points8 points  (6 children)

        I'm not sure why 2b would indicate that robust features are more important, actually. Figure 2b shows that if I restrict my classifier to only *see* robust features, it will use these to make classifications and yield a robust classifier. It doesn't really say anything about the case where both robust and non-robust features are available to the classifier.

        EDIT: I think a relevant experiment to look at would be Appendix D.6, where we show that for the experiment described in 3.2, it seems as though the non-robust features actually led to better generalization than the robust features for a standard model (top row of the table)---happy to explain the experiment more if needed!

        [–]jklaise 1 point2 points  (5 children)

        What I meant was that the marginal gain in accuracy from including non-robust features to a robust-feature dataset (either via adversarial training on the original dataset or the robustified dataset as in Fig 2.b) is much smaller than the other way round (as a corollary from the results in 3.2) or have I misinterpreted this?

        EDIT: I think that in the case of D_rand what I said holds - because robust features are "absent" (in expectation uncorrelated with the class), the CIFAR accuracy of 63.3% can be purely attributed to the non-robust features, but the marginal gain in accuracy by "adding" robust features back in would be much bigger (~30%) than the other way round (5-15% from Fig. 2b). Now the case of D_det is certainly more interesting as there are robust features now present and anti-correlated with the class and the generalizability is pretty remarkable. What I would like to see is also how it would look if the roles of the non-robust and robust features in D_det were flipped - what would the accuracy look like if non-robust features pointed away towards as different class. If the classifier trained on this "flipped" D_det achieved higher accuracy I would say that again the robust features would seem more important.

        [–]andrew_ilyas[S] 1 point2 points  (1 child)

        I am not sure I am fully interpreting your point correctly, but in general the accuracy of adversarially trained classifiers is pretty dependent on the size of the perturbation set (in this case, the radius of the l2 ball that you want to be robust in). Figure (2b) is for epsilon=0.25, but you can easily pick some larger but still imperceptible epsilon where robust classifiers get much worse robustness. So in that sense it is hard to make comparisons directly based on (2b). In appendix D.6, we introduce non-robust features of norm eps=0.5 and these look like they generalize pretty well. While we can't be 100% certain that non-robust features are universally more important, I don't think we can conclusively claim that robust features are more important either.

        [–]jklaise 0 points1 point  (0 children)

        I am certainly no expert in the area so apologies for any misinterpretation on my part! I will have to spend more time with the paper for sure, but I agree that it's probably hard to claim conclusively which (if any) set of features are "more important" especially as it's tricky to disentangle them in the first place.

        [–]andrew_ilyas[S] 1 point2 points  (2 children)

        Just saw your edit: so the interesting thing is, what you suggest for D_det is actually exactly what we try to capture in Appendix D.6: in particular, since the classes are consistently mislabeled, a classifier relying on robust features should generalize to standard relabeled CIFAR (x, y+1)---(I can explain this further if needed). Interestingly, our results show that the generalization to (x, y) CIFAR is higher than the generalization to (x, y+1) CIFAR---suggesting that the classifier is paying more attention to the non-robust features than the robust ones.

        [–]jklaise 0 points1 point  (1 child)

        I might be missing something here as I can't seem to find this in the paper, what is the deterministic rule for picking t as a function of y for constructing D_det? Is it the the same relabelling rule, i.e. t=y+1 (mod C)? In that case this makes sense to me and the lack of generalizability is then a very cool result.

        [–]andrew_ilyas[S] 1 point2 points  (0 children)

        Yep! In appendix D.6 we are testing the generalization of D_det on standard CIFAR labeled correctly, vs. standard CIFAR labeled using the relabeling rule used to make D_det (but without changing the images). Non-robust features help you with the former and hurt the latter, and vice-versa for robust features.

        [–]maxToTheJ -1 points0 points  (6 children)

        yes. if these non-robust features are truly important as this paper demonstrates

        Define “important”?

        Arent those non-robust features more about helping your particular classifier that is also defined by the way it was trained not hit spots that make it less effective in generalization.

        [–][deleted]  (5 children)

        [deleted]

          [–]maxToTheJ 0 points1 point  (4 children)

          I still dont see how it isn’t possible to have a set of human interpretatable features which are useful and human uninterpretable features that are useful . The “None” in your comment implies one has to take away from the other?

          [–][deleted]  (3 children)

          [deleted]

            [–]maxToTheJ -1 points0 points  (2 children)

            so how can we believe that they are actually doing what they claim?

            How can you believe the inverse claim?

            [–][deleted]  (1 child)

            [deleted]

              [–]romansocks 0 points1 point  (0 children)

              Is there a strict definition of 'explain' that I don't know? I don't mean to be trite but it seems that we manage to explain lots of things that we understand with less than 90% accuracy

              [–]SedditorX 12 points13 points  (3 children)

              The snark here is a bit stupid.

              The proper, non childish, takeaway is that more work is needed on both modeling and interpretation.

              Be constructive. The opposite is cheap but contributes nothing.

              [–]dunomaybe 1 point2 points  (1 child)

              Sorry, what does interpretability mean?

              [–]MohKohn 4 points5 points  (0 children)

              roughly, the way that the classification occurs is interpretable by a human. For example, if certain pixels are more important than others, or if the presence of a particular edge matters more. One of the big problems of DNNs is that what exactly the classification hangs on is very difficult to pin down well. This paper is claiming, among other things, that the important features are too strange and statistical in nature to be explained in English.

              [–]Superdooper234yf6 8 points9 points  (2 children)

              Can I see these non robust features? Can you generate some examples so I can try to catch a glimpse of them. I want to see what dark matter features look like...

              [–]ryanbuck_ 0 points1 point  (0 children)

              Underrated comment

              [–]HaohanWang 0 points1 point  (0 children)

              Maybe you will want to see my recent paper then, we noticed one example of the non-robust features from another perspective, someone also posted on Reddit here

              [–][deleted]  (9 children)

              [deleted]

                [–]gwern 7 points8 points  (8 children)

                The tank story isn't real, though.

                If I'm understanding this paper correctly, and going off OP's further comments here, the point is that these 'non-robust' features are real: they exist in the held-out test set and, unlike the apocryphal tank story where the tank detector supposedly failed the instant it was tested on some new field data, really do predict in the wild, predict across architectures, and predict across newly-collected datasets as well ("3.3: Transferability can arise from non-robust features"). ie if you collected some brandnew images to test your non-robust CIFAR-10 model, it'd work just fine.

                [–][deleted]  (3 children)

                [deleted]

                  [–]gwern 7 points8 points  (2 children)

                  I know what I linked in my writeup. And the reason I linked that fish detector was that it demonstrated how the tank story might not happen: if you read to the end, you find that the Kagglers' overfitting was detected by the... heldout dataset. That's not the tank story (which is about failing in the real world on freshly collected data which lacks a systemic dataset bias), but quite the opposite. Generic overfitting is not news.

                  [–]dfarber 1 point2 points  (1 child)

                  No, it's exactly the same as in the apocryphal tank story. Kaggle sucks at dataset curation and their "holdout" was from a different distribution than the train/val, just like in the tank story.

                  [–]gwern 2 points3 points  (0 children)

                  That doesn't seem to be the case. The description is of non-iid photos, which was clustered by boats, and the heldout simply pulled from additional boats. That's not a 'different distribution': it's all photos of fish nested within boats. They just overfit by pseudoreplication (ironically, fishery data is one of the common examples of 'pseudoreplication'...).

                  [–]MohKohn 1 point2 points  (3 children)

                  The transferability is about whether the adversarial examples transfer to different models, not whether they transfer to different datasets. At least, that is what they demonstrate in the paper; perhaps one of the other papers they cite demonstrates that they are quite robust to different datasets.

                  Good to know the tank example is not the one to reference from here on out though.

                  [–]gwern 0 points1 point  (2 children)

                  At least, that is what they demonstrate in the paper; perhaps one of the other papers they cite demonstrates that they are quite robust to different datasets.

                  If they aren't robust to different datasets, you'd better tell OP:

                  3.3 Transferability can arise from non-robust features

                  One of the most intriguing properties of adversarial examples is that they transfer across models with different architectures and independently sampled training sets [Sze+14; PMG16; CRP19].

                  [–]MohKohn 2 points3 points  (1 child)

                  I think you're misreading that (it's a bit of a subtle point, I also misread it as the stronger claim on my first readthrough). What they're saying there is that no matter how you do the train-test split, you get adversarial examples. To speak to the references specifically:

                  • PMG16 only works with MNIST.
                  • Sze+14 is the paper that started the field, and demonstrates that the concept works on several datasets, but doesn't transfer the exact examples
                  • CRP19 demonstrates that the examples transfer from dnns to linear models on the same dataset (haven't yet read it, so there's some other stuff going on, probably)

                  [–]andrewilyas 2 points3 points  (0 children)

                  Yes, thank you for the clarification! Our model can't really say anything about transfer between different datasets, as it is possible that there are higher-order "features" (e.g. "eyes", or "line") that are composed of these non-robust features that in turn transfer across datasets----I'm not actually aware of any papers that have done a thorough study of transfer across datasets though.

                  When we say independently sampled, we're talking about the split.

                  [–]aboveaveragebatman 5 points6 points  (4 children)

                  This is quite a unique interpretation and can directly be related to adversarial training which improves generalisation ability by including adversarial examples in the training set. I have a couple of questions. (1) By this interpretation, would you say that adversarial examples are an inherent weakness of representation learning, and that we would need some new form of learning capable of learning these "adversarial features" without being explicitly added to the training set? (2) Also do you think that there might be ways of extracting much more meaningful information from the network to increase robustness rather than adversrarial training the model?

                  [–]andrew_ilyas[S] 10 points11 points  (3 children)

                  So actually, it's been shown that in the normal sample regime, adversarial training actually *hurts* generalization (https://arxiv.org/abs/1805.12152). This makes perfect sense under our model, since you're essentially stopping the model from learning any non-robust features (even if those features are actually useful!). As for the questions:

                  (1) I would not say this is a weakness of representation learning so much as a weakness of our current methods for representation learning. In particular, out failure to encode meaningful priors into the training process and ensure that classifiers don't pick up features we don't like.

                  (2) Getting better adversarial robustness is still very much an open problem!

                  [–]Inefraspa 1 point2 points  (1 child)

                  I’m not sure this makes sense. Usually, stopping the model from learning non-robust features can help with generalization (dropout, convolutions, data augmentation by translations/flips/rotations).

                  [–]andrew_ilyas[S] 11 points12 points  (0 children)

                  So that is in the case where non-robust features are not actually features, but kind of just artifacts on the training set. What we are showing is that the non-robust features corresponding to adversarial examples are actually *generalizing features,* so stopping a classifier from learning them actually hurts generalization (whereas DA, dropout, etc. are designed to prevent spurious overfitting)

                  [–]aboveaveragebatman 0 points1 point  (0 children)

                  Thanks for the reply! A slightly off-topic question here since you are involved in this domain: What is the state of the art defense against adversaries on ImageNet? I have skimmed through some papers and even saw the robustml website(which stated that 4 defenses on ImageNet were broken) but I am unable to find a definitive answer. I can only see that Madry's adversarial training has been verified as the white box defense that Nicolas Carlini was unable to break, but that was on CIFAR-10.

                  [–]jklaise 4 points5 points  (2 children)

                  Very interesting work!

                  I note one thing, if adversarial training works as well as it can on robust features (as defined by humans), it should be able to approach human performance on these datasets, indeed it's interesting to see from Figure 2.b that the adversarially trained model on CIFAR-10 achieves ~91-92% accuracy on the test set, not far from Karpathy's manual attempt on a subset of the test set (94%).

                  It would be interesting to have more rigorous experiments of this - for example, what if it turns out that adversarially trained networks cannot match human performance on other datasets? Does that mean that the dataset is unusual (e.g. has a relatively large amount of non-robust features) or the model is not sophisticated enough to be as good as humans by only observing robust features? Or maybe humans do use non-robust features when making their classification decisions it's just not perceptible to the decision maker.

                  [–]andrew_ilyas[S] 5 points6 points  (1 child)

                  Thank you!

                  This is quite interesting, and very related to another recent paper from some of my coauthors: https://arxiv.org/abs/1805.12152. I would note, however, that there are several reasons why we might not be able to get robust neural networks that are as accurate as humans aside from humans using non-robust features:

                  - Neural networks might not actually be able to express some of the features that humans use, due to architecture, capacity etc.

                  - We might need more data for adversarial training: papers have shown (e.g. https://arxiv.org/abs/1804.11285) that adversarial training accuracy doesn't seem to "plateau" for the number of samples we have in our datasets, so it's likely that more samples will give more robustness.

                  [–]ianismean 1 point2 points  (0 children)

                  "express some of the features that humans use, due to architecture, capacity etc." -- Is the claim here that you cannot represent a robust classifier using any of the current architectures? I highly doubt that is true. Just training them seems to be hard (and starting with the right inductive biases/priors).

                  [–]Reiinakano 5 points6 points  (2 children)

                  Very cool paper and nice blog post. I like the story of planet Erm :)

                  I find particularly interesting your graph showing correlation between ability to pick up nonrobust features VS adversarial transferability and how VGG is so far back the other models. VGG is also special in style transfer because other architectures don't work as well as VGG without some sort of parameterization trick (https://distill.pub/2018/differentiable-parameterizations). I think your results might give an alternative explanation of why. Since VGG is unable to capture non robust features, when using it for perceptual loss, it actually looks more correct to humans! I wonder if there is something in VGG that's closer to human priors than other SOTA architectures.

                  [–]andrew_ilyas[S] 1 point2 points  (1 child)

                  Thanks for the kind comment, and a really interesting point! I hadn’t thought of that/made that connection before, it would be interesting to see if it can be tested.

                  [–]gwern 0 points1 point  (0 children)

                  Presumably you could test it by looking at the other successful style transfer architectures: if they too pick up fewer nonrobust features after the necessary modifications, and the modified versions perform worse on non-styletransfer tasks, that is evidence for this theory that good perceptual losses must avoid nonrobust features.

                  This would, aside from helping shed considerable light on what was going on with style transfer, an important finding on its own given everything else we might use perceptual losses for, like pretraining GANs or for computing good embedding features - implying it's really important to make them use only robust features for the downstream results to be any good.

                  [–]radarsat1 7 points8 points  (4 children)

                  I'd like to see someone take a frequency-based perspective on adversarial examples and these "robust" features. I really suspect that if there are robust features that are "imperceptible" it is because they are really small and high-frequency, and generally not something that a human will focus on to identify global shape.

                  I'm not sure where I'm going with that, but it seems to me that for example that a trained network could easily focus on eg. a cat's stripe patterns rather than their "cat shape" to help identify it apart from dogs. That's of course just an example, I mean that maybe similar things happen in a more general sense. So how can shape over pattern be emphasized for classification? It would seem to have something to do with size invariance, which itself is related to sensitivity to spatial frequencies.

                  [–]samtrano 1 point2 points  (2 children)

                  So how can shape over pattern be emphasized for classification?

                  Would blurring the images help?

                  [–]Pentabarfnord 2 points3 points  (0 children)

                  See this paper https://openreview.net/forum?id=Bygh9j09KX They trained on a style transfered version of ImageNet

                  [–]radarsat1 0 points1 point  (0 children)

                  I was thinking that an interesting measure would be how classification suffers as a function of a blur kernel size (ie frequency cut-off), or something similar. It would be similar to shrinking the image I suppose, but without actually changing spatial sizes allowing to keep the same architecture.

                  But not sure how that plays into the ideas presented in the paper under discussion... perhaps looking for "robust features" that are robust to different bandpass filters?

                  [–]Nimitz14 1 point2 points  (0 children)

                  You're completely right, there was recent paper which showed that CNNs use texture and not shapes to classify images. Which makes total sense given how CNNs work (feature maps only take small patches of an image into account).

                  [–]yldedly 3 points4 points  (13 children)

                  Very cool! I'm a little confused about the implication that adversarial examples are purely human-centric. What would happen if you did your first experiment on a new, separate test set, rather than the original one? Would the non-robust features still help generalize to the new test set, or are they specific to the distribution of the original training and test set?

                  [–]andrew_ilyas[S] 4 points5 points  (12 children)

                  The test set already consists of unseen samples, so it's unclear what testing on a different test set would give---the fact that it generalizes to the test set indicates that the adversarial examples are useful in the distributional sense, and not just on finite samples.

                  When we say human-defined, we mean that without humans in the loop, there is no good reason why a classifier should distinguish between robust and non-robust features---it just cares about classification accuracy. It is humans that come in and say that the l2/linf/lp metric is important and that classifiers should not use features that change too much within an lp ball. You can think of this as basically: robustness is defined by the perturbation set, and it is humans that decide the lp perturbation set is meaningful.

                  [–]Fedzbar 1 point2 points  (0 children)

                  Very very interesting perspective. Thank you for sharing this!

                  [–]Pfohlol 1 point2 points  (8 children)

                  I interpret this question as, "how well do these non-robust features generalize under domain shift?". Here, the CIFAR-10 test set is still drawn from the same distribution as the training set. We generally have some notion that some types of features should transfer to related domains and problems despite distributional differences (see literature on domain adaptation). Do you have any thoughts on whether robust and non-robust features, as you define, them are more likely to generalize in this setting?

                  [–]andrew_ilyas[S] 3 points4 points  (7 children)

                  Yeah, this is an interesting question! I don't actually have a good intuition for which features are more robust to covariate shift---our paper only tries to establish that they work on the true distribution. This would be an interesting thing to look at though!

                  [–]yldedly 0 points1 point  (6 children)

                  As far as I can see, your experiment doesn't distinguish between generelization from training to test set (ie an estimate of generalization error), and generalization to the true distribution. It's possible, if not likely, that the non robust features are specific to the original training and test set. It's easy to check, by varying the test set size. Even though the model never sees the samples in the test set, it can still learn features that don't generalize beyond it - when we select the model that minimizes test error, it's based on the assumption that the test error is a good estimate of the generalization error. But it's presumably easier for a model to learn features that generalize to a single unseen finite sample, than features that generalize to all possible test sets. Or am I missing something?

                  [–]andrew_ilyas[S] 0 points1 point  (5 children)

                  We don’t perform any cross validation type stuff in selecting what models to train, so performance on the test set is exactly generalization.

                  [–]yldedly 0 points1 point  (4 children)

                  But performance on the test set is only the generalization in the limit of infinite data... If your test set consisted of one observation, and you stop training your net when the test error is smallest (no cross validation), surely you wouldn't expect it to generalize to the true distribution?

                  [–]andrew_ilyas[S] 0 points1 point  (3 children)

                  Sure but you could make the same argument about any supervised learning algorithm whose performance is measured on a test set. E.g. how do we know if ResNets trained on the training set generalize? I think the community has generally accepted that performance on a 10k-image test set is a good indicator of generalization ability. While in general collecting bigger test sets and getting better estimates of generalization error is a worthwhile direction, it seems kind of orthogonal to what's in this paper (i.e. I can't follow an argument that says that the classifiers trained in this paper are overfitting to the test set, but not normal classifiers trained on the training set---they are using the same amount of information).

                  [–]yldedly 0 points1 point  (2 children)

                  I am saying that normal classifiers, with enough capacity, will overfit to the test set; I don't think that's especially controversial. You're right that the community doesn't usually worry about test error variance. But you claim that non-robust features, which are meaningless to the human eye, generalize as well as robust features that we recognize as e.g. belonging to all cats. Extraordinary claims require extraordinary evidence. You produce a dataset which has robust features pointing to a changed label, and non-robust features pointing to the original label, and show that it still performs on the original test set. How do you know these unusual (to the human eye) features generalize as well as robust ones? Since there are far more parameter settings that generalize to a finite test set, than those that generalize to the true distribution, overfitting to the test set seems like the null hypothesis of choice. You have only shown that they generalize to the original test set, so you can't reject it. If you did the experiment on varying test set sizes, it would strengthen your claim, imo.

                  [–]andrew_ilyas[S] 2 points3 points  (1 child)

                  I'm not sure I'm convinced by your argument, but as a sanity check I just tested it on CIFAR-10.1 (https://github.com/modestyachts/CIFAR-10.1) which was an independently collected CIFAR dataset meant to test whether standard networks were overfitting to the test set (in this paper https://arxiv.org/abs/1806.00451). We still get non-trivial generalization performance on this new set for the experiments described in section 3.1 and 3.2. Hopefully this addresses the concern!

                  [–]davidyan9 0 points1 point  (0 children)

                  I’m wondering how about other animals with intelligent object recognition abilities (dogs, chimpanzees I guess). How would they classify an adversarial example? If they would do the same as human, then non-robust feature may not be entirely human defined.

                  [–]yldedly -1 points0 points  (0 children)

                  I was thinking that maybe the non-robust features generalize from the training set to the test set, but not to the true data distribution. Maybe there's a relation between reliance on non-robust features and the size of the test set, making adversarial examples a subtle form of overfitting to the test set? I honestly don't know much about adversarial examples, but it seems like a useful thing to check on general grounds.

                  [–]avaxzat 3 points4 points  (1 child)

                  This is reminiscent of the FDROP algorithm by Globerson & Roweis (2006). Ideally, a good classifier should perform well even after deletion of such "non-robust" features.

                  [–]andrew_ilyas[S] 1 point2 points  (0 children)

                  I agree! And thank you for the reference!

                  [–]FliesMoreCeilings 2 points3 points  (4 children)

                  Interesting stuff. Do you think this suggests there is some sort of genuine information that our brains are somehow completely missing out on? And if so, is that information also present in regular light, or is it more of an artifact of the way pixels work?

                  [–]andrew_ilyas[S] 2 points3 points  (3 children)

                  Yes, our thesis is that there is lots of genuinely useful information that we cannot see. I'm actually not sure why we don't see these features though (and I suspect this might be more of a study of humans than of ML models)!

                  [–]FliesMoreCeilings 1 point2 points  (2 children)

                  How large are the pixel changes in the non-robust features exactly? If classifications can flip based on tiny color differences alone I could imagine our eyes simply aren't sensitive enough to find these patterns.

                  [–]andrew_ilyas[S] 0 points1 point  (1 child)

                  they are really small! the norm of the perturbation is bounded to 0.5 in most of our experiments, which means that you can only change a single pixel by 1/2 (pixels are scaled to [0,1]), or change every pixel by less than 0.01. Definitely imperceptible, actually.

                  [–]gwern 2 points3 points  (0 children)

                  My mind is definitely kind of blown that these tiny adversarial perturbations are actually perturbing 'real' features; but on the other hand, we live in a world where Eulerian flow lets you measure heartrate and other things from tiny pixel-level aspects of normal camera footage despite being completely invisible to humans, so maybe this should not be too surprising. (Paging Peter Watts...)

                  [–]Taleuntum 2 points3 points  (1 child)

                  How is it possible that the classifier in my head ignores some useful features? After a few billion years of training I wouldn't expect to be beaten by some toddler computer model. Isn't it more probable that the non-robust features are features of the data-gathering process only? Eg: to photograph insects, you set your camera to macro mode which may introduce some subtle, (to human eyes) imperceptible perturbation compared to the non-macro mode which is not filtered out by the quality-assurance process in camera factories, because of the same reason: it is imperceptible to the human eye.

                  I'm a complete layman, sorry if my question is idiotic.

                  [–]lahwran_ 0 points1 point  (0 children)

                  From this thread, my current thinking is that the human vision system optimizes perception primarily for compression and prediction, rather than classification - classification is a subtask of many things humans do, but my current thinking is that it's a secondary objective rather than a primary one. I've been thinking for a while that someone should plug together a really big actual-compression autoencoder network, multi-objective with GAN training on the generator, multi-objective with classifier training the compressor side. Unfortunately I don't have the resources to do this myself at the moment.

                  [–]fisforfoxes 1 point2 points  (3 children)

                  Appreciate the well written blog and interesting paper.

                  One aspect I was curious about, was are these imperceptible features a function of the actual probability distribution of the class, or merely a result of the dataset upon which the model was trained?

                  For instance, are these same features present if a model was trained on half of cifar/imagenet, and another model was trained on the other half, and then the imperceptible features for the adversarial atack were generated using one model and then tested on another.

                  Or a model (resnet50) trained on imagenet and another on another image dataset.

                  What is the universality of these imperceptible features?

                  [–]andrew_ilyas[S] 1 point2 points  (2 children)

                  The universality we are looking at here is just over the data distribution, so in this case, all CIFAR-10 images. That means that splitting 1/2-1/2 would work (and indeed adversarial examples tend to transfer between half of datasets to the other half), but we can't say anything about the universality of these features beyond the distribution (i.e. between datasets).

                  Crucially though, this is very different from overfitting, since we show that these features can be found on only training set and generalize to a completely unseen test set---so it really is about the distribution, rather than about the specific training set.

                  [–]fisforfoxes 0 points1 point  (1 child)

                  Interesting to think about. I would be curious to know the performance and effects of various experiments on two completely different comprehensive image datasets, as surely the more robust features would be universal in some sense, but at what level would the imperceptible features translate across datasets, and how effective would the adversarial perturbation be across the datasets/models.

                  [–]andrew_ilyas[S] 0 points1 point  (0 children)

                  I agree, this would definitely be something interesting to look into in future work!

                  [–]MohKohn 1 point2 points  (1 child)

                  Interesting result, this is definitely going to be percolating for a while. As /u/kindlyBasket mentions with the tank example, when we use non-interpretable features, it's difficult to eliminate the possibility that we've accidentally found spurious statistical trends to our data. Do you have any thoughts on how to determine whether any of these non-robust features are still "true" features, i.e. ones we should generally expect to see in contexts where we would actually use dnns?

                  [–]andrew_ilyas[S] 2 points3 points  (0 children)

                  Actually the entire thesis of our paper is that these are not simply spurious correlations on the training set, and are actually helpful on the distribution (as they help you generalize to unseen data). In that sense, these are indeed true features. An interesting future direction would be to see if these features are also useful for different distributions—but we haven’t looked into this too much.

                  [–]Towram 1 point2 points  (0 children)

                  Does the transferability part explains that in order to build an adversarial example for one NN, one doesn't need access to the prediction of the NN but only the training data would be enough?

                  [–]Seerdecker 0 points1 point  (3 children)

                  Nice paper. From figure 2b, it looks like the robust features are less useful overall than the non-robust features for standard accuracy. This is somewhat surprising to me. If this is actually the case and you combine that property with the tendency of neural networks to rely largely on the most predictive features, then you get that neural networks will naturally tend to evolve non-robust features in priority. Does that conclusion hold?

                  EDIT

                  I don't think that conclusion actually holds. My mistake is considering that the "non-robust" dataset contains only non-robust features, like in section 3.2. That's not the case here. The "non-robust" dataset actually contains both the robust and non-robust features evolved by the standard classifier.

                  [–]andrew_ilyas[S] 1 point2 points  (2 children)

                  Thank you! I'm not sure if that is the right conclusion to draw from Figure (2b), since in that case we are actually restricting the classifier to *only see* robust features. We did try to investigate the strength of robust vs non-robust features in Appendix D.6 and the results did suggest that perhaps non-robust features are actually more important (happy to explain this more in depth if you want)!

                  So, right conclusion but different figure I think :)

                  [–]Seerdecker 0 points1 point  (1 child)

                  Thanks. I realized this as you were typing your answer (see the edit) :-)

                  [–]andrew_ilyas[S] 1 point2 points  (0 children)

                  Nice :) Hopefully D.6 sheds some more light on this too.

                  [–]anarchistruler 0 points1 point  (3 children)

                  I really like the experiments done in the paper, I think it's especially interesting that we can train a generalizing model only from adversarial examples!

                  However, I'm not sure I would interpret these results the same way. First, I want to note that features as defined in the paper can also be the trained classifiers themselves. This already proves that there are non-robust, generalizable features by itself (from the existence of non-robust classifiers).

                  Also, I don't agree that the results imply new issues with interpretability. Adversarial examples were already an issue for interpretability before and are hard (or even impossible) to explain to a human. Like adversarial examples, non-robust features don't seem to be a problem for naturally occuring images (as the classifier can still generalize), in the sense that they are unlikely to change the prediction for an image (sampled from a "natural" distribution).

                  Lastly, one could also argue that adversarial examples and what's shown in the paper are artifacts of the CNN architectures we're using today. Then the paper shows that these artifacts are enough to train a CNN without needing correctly labeled "natural" data. In that interpretation, the non-robust features are not particularly important for a "correct" model (whatever that would mean).

                  [–]andrew_ilyas[S] 1 point2 points  (2 children)

                  Thanks for the comment and questions! Answers below:

                  1. Yes, features can be trained classifiers, and thus the existence of both classes of features is guaranteed (since in turn, robust classifiers can also be robust features)---this was actually our sanity check for existence of a dichotomy. However, the existence of a feature is different from a network actually using it. In fact, as far as I know, all of the previous work treated adversarial examples as weird statistical phenomena, training-set overfitting, etc. So, even though our point of view seems pretty natural in hindsight, this was not so clear initially (even to us). Our point, which seems to be in agreement with you, is that adversarial examples actually have a really natural interpretation as generalizing features, and are not just meaningless changes to the input.
                  2. I would actually disagree about the interpretability point. Before this, one could argue (and in fact, many do argue) that adversarial examples pose an *optimization* problem for interpretability---that is, the weird brittleness of neural networks prevents you from being able to find the "true features" that it is using. What we are showing here is that in fact, adversarial examples *are* some of the true features, and are useful beyond flipping the classification for a single image, thus post-hoc methods for finding better-looking features are actually purposefully hiding features that the model uses to make classifications.
                  3. I don't think our paper disagrees with this last point at all! Our point is kind of: these examples might be CNN artifacts, or results of overparameterization, etc., but there is not really a fundamental reason why we shouldn't use these features, given that they exist. Thus, it could be that we can design architectures that are unable to even see these features (in a similar fashion to humans)---this would correspond exactly to "building in priors into classifier training," which is what our paper aims to suggest.

                  [–]anarchistruler 0 points1 point  (1 child)

                  Thanks for your answer :)

                  Yes, I guess my disagreement was mostly based on how we interpret what is a "true feature" or what it means to "use a feature". This is not really clear to me yet, how to interpret that...

                  But for point 2 again: I still don't think that not showing these non-robust features are really a problem for interpretability in most cases, because they cause differing predictions only for adversarial examples, which are unlikely in reality. There is an infinite amount of features anyway, and choosing a subset of these will always leave out some interesting insights. The question is really what the goals of having an explanation of the model are, and I think in many cases ignoring these non-robust features is not problematic.

                  [–]andrew_ilyas[S] 0 points1 point  (0 children)

                  Glad I could clear up some of the points! For (2), I think whether one believes it’s a problem is sort of an orthogonal to what our statement is. All we are trying to show is that these are not optimization barriers or nuisances,they are actually features. So when we design interpretability techniques to get around them, we are really hiding something, and not just doing better optimization (now, whether we think it’s good or bad to hide these is a different subject of debate :) )

                  [–]PeterPrinciplePro 0 points1 point  (4 children)

                  Great work. Goodfellow et al. leaped to the conclusion that adversarial examples are an out-of-distribution issue, while they are actually an issue of weighting perceptible and imperceptible features. Imperceptible (high frequency?) features get a lot of importance in computating the outputs which is counter-intuitive to humans because they presumably have a prior to ignore those features.

                  I cannot quite follow your conclusion that imperceptible features would be helpful for generalization though. You have only proven that one can use them for training to great effect, but it is also known that NNs are somewhat brittle and make wrong predictions with high confidence, which may as well be due to imperceptible features in novel input being similar to some previously seen classes. You have not proven that this is not the case.

                  An interesting question seems to be whether deep learning can learn well at all without clinging to such imperceptible features. It has often been suggested that neural networks mostly focus on textures, which seems to be very related.

                  [–]andrew_ilyas[S] 1 point2 points  (3 children)

                  Thank you for the kind comment! I have answered the questions below, let me know if they need further clarification.

                  Our claim that they are helpful for generalization is supported by the fact that models trained on adversarial examples actually generalize and do well on unseen samples from the regular test set (3.2). If adversarial examples were just random brittleness artifacts, this would not happen.

                  Since we are defining a feature as anything helpful for generalization, adversarial examples correspond to features by definition.

                  As for your last point, one can think of robust training as trying to accomplish this very task! If we really want to accomplish this though, we need to come up with better notions of robustness than just the lp ball (for example, robustness to texture as you mention, or robustness to rotation, etc). This is a great question and definitely an important research direction.

                  [–]PeterPrinciplePro 0 points1 point  (2 children)

                  Since we are defining a feature as anything helpful for generalization, adversarial examples correspond to features by definition.

                  I see. It makes sense based on this definition. But is it not conceivable that feature detectors of subtle features may more likely be elicited by random noise or far-out-of-distribution inputs, leading to highly confident wrong predictions? Then, on balance, imperceptible features may harm generalizability more than they help. That was my point.

                  [–]andrew_ilyas[S] 1 point2 points  (1 child)

                  Yes, that makes sense—I think our paper is showing that these features actually help generalization on the distribution (so by definition, out of distribution inputs are not considered). I think you are referring to the phenomenon of covariate shift—there, we still don’t have a good intuition of how non robust features work (an interesting future direction though!)

                  [–]PeterPrinciplePro 0 points1 point  (0 children)

                  Right, I meant covariate shift. Thanks for your answers.

                  [–]cpury 0 points1 point  (2 children)

                  That's a great blog post! Thanks. It would be amazing if every paper would be accompanied by such a well-written and easy-to-understand text.

                  The whole concept of "meaningful non-robust features" is new to me, and blowing my mind. Are there any intuitive examples of what such a feature could be? E.g. maybe some subtle pattern of hair that is overlaying a cat and that we ignore? Or are they really so imperceptible to us that there is no example we can understand?

                  [–]andrew_ilyas[S] 2 points3 points  (1 child)

                  Thank you so much! Glad you enjoyed the blog post.

                  The concept was really surprising to us too! In general though, non-robust features are defined by the perturbation set that you're trying to be robust to. So if my perturbation set is a really small lp ball, then chances are you can't really see the patterns corresponding to non-robust features. On the other hand, for different perturbation sets these things can be quite visible---as a kind of exaggerated/contrived example, if you want to be robust to changes in background (your adversarial perturbation is the ability to change the background), then "sky" is actually a non-robust feature, since you can flip this within your perturbation set, even if "sky" is genuinely useful for telling apart say, planes vs dogs. Hopefully that makes things clearer, otherwise happy to elaborate further!

                  [–]cpury 0 points1 point  (0 children)

                  Thank you! I will need some time to process and think about this. The sky example definitely helps.

                  Is it safe to conclude from this that the real, natural data distribution does not contain adversarial examples? In other words, unless someone crafts them on purpose (to mess with us humans), they are not a problem?

                  [–]ianismean 0 points1 point  (0 children)

                  Pain is not a bug, it's a feature.

                  [–]andrew_ilyas[S] 0 points1 point  (0 children)

                  No problem! I’m not sure if we can certainly conclude that though. For example, it could be that the natural data distribution does have these features (after all, these datasets are just pictures of natural objects), but humans have just learned/evolved/whatever to not be able to see these features. In that sense, since classifiers definitely can see these features, we need to rethink how we train classifiers to be invariant to things that humans are definitely invariant to. Happy to discuss this more!

                  [–]nian_si 0 points1 point  (4 children)

                  Very interesting paper! But I have 2 questions,

                  1. I think you did experiments in multi-class setting, right? All your definitions and theory are in the binary setting, which somehow confused me.

                  2. Robust features are those that preserve under small perturbations, while non-robust features change drastically under such perturbations? How do you explain such a phenomenon? In other words, although robust and non-robust features are both features, there are also differences between good and bad features even without human's assessment.

                  Thanks!

                  [–]nian_si 0 points1 point  (0 children)

                  Also, it seems magic that standard training on the D_rand also generalizes to the original test set, since the classifier has no clue about how to "flip back".

                  [–]andrew_ilyas[S] 0 points1 point  (2 children)

                  Thanks for the comment!

                  1. It’s really atraightforward to generalize the definitions to multiclass (just a matter of making scalars into vectors), and it’s a pretty well-known argument in math. We just used binary classification to make things clearer.

                  2. The idea is that there are infinite features F, and then the dichotomy between robust and non-robust is created when we choose a perturbation set \Delta. Thus, since humans choose the perturbation set \Delta, the dichotomy is basically human determined.

                  [–]nian_si 0 points1 point  (1 child)

                  Another thing is about theorem 1 in section 4. It seems that L_adv - L = -d <0, if C=0, which is odd. And could you please explain more on how your theorems relate to the existence of non-robust features?

                  [–]andrew_ilyas[S] 0 points1 point  (0 children)

                  Sorry, theorem 1 has a typo, and it should be for any C > \sigma_{min}(\Sigma^*)---will fix this in the next revision, thanks!

                  So, informally, in theory section you have some input space with a weird unknown geometry (Mahanalobis distance)---you can think of that geometry as being the geometry induced by the features. The point is that huge misalignment between this distance and the l2 metric corresponds exactly to a non-robust feature, since it literally means that "moving a little in l2 distance means moving a lot in feature distance." The more misaligned the metrics are, the higher your vulnerability.

                  [–]TotesMessenger 0 points1 point  (0 children)

                  I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

                   If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

                  [–]SadPaperMachine 0 points1 point  (0 children)

                  Does this theory also applies to binarized data, such as MNIST?

                  MNIST seems hard to encode any sort of non-robust features?

                  Also, how do you explain the adversarial examples generated by spatial transformation [1]? Is it also a part of the non-robust features?

                  [1] "Spatially Transformed Adversarial Examples" https://arxiv.org/abs/1801.02612

                  [–]nian_si 0 points1 point  (1 child)

                  How would your method in section 3.2 perform in the untargeted setting? Specifically, find untargeted adversarial examples and train a network using those examples and their "wrong" labels (output from the standard network).

                  [–]andrew_ilyas[S] 0 points1 point  (0 children)

                  Hi sorry for such a late response (I only realized I missed some comments because I got a reddit notification :P).

                  Yep, the experiment still works with untargeted attacks.

                  [–]zergling103 0 points1 point  (0 children)

                  When you say "adversarial examples...are instead actually meaningful but imperceptible features of the data distribution", you say that these patterns are not unique to the data it was trained on in that they transfer to the test set. But, often times an adversarial perturbations are so tiny that they would be quantized away by the 24-bit RGB format (where each color channel has only 255 steps of granularity). With this being the case, how could non-robust features be part of the dataset when image data is not saved in a format that could represent them (at least not within individual images)?

                  Would a network trained on an endless live feed of real-world imagery (such as that we are naturally exposed to, with all its variations in pose, lighting, etc.) be sensitive to adversarial perturbations? If so, it would follow with your claims that non-robust features are patterns that would be found in the real world (assuming we can rule out camera or compression artifacts as a cause); they are objectively there in that being sensitive to these patterns would give you a measurable competative edge at detecting, for example, prey. Or that they could be exploited by prey to act as camoflauge. Perhaps this warrants research as to whether or not animals exploit non-robust features in nature as it seems to run contrary to our experiences: You wouldn't expect a bright red insect on green leaf to remain hidden from a bird because it had a particular pattern of noisy fuzz on it that caused the bird to misclassify it as a dangerous animal. In cases where animals do use camo or disguises, they arguably use what could be defined as "robust" features: false eye dots on a butterfly's wing; the color, shape and texture of a leaf; etc.

                  [–]hillhe2019 0 points1 point  (0 children)

                  Hi, firstly sorry for my weak literal skill..still deeply inspired by your research, but according to 'We will construct a training set for Dr via a one-to-one mapping x -> xr from the original training set for D ...‘, i guess you actually 'generated' a Dnr which is from your alternative regime where examples labeled('mapped') corrensponding to the semantically new rules-to represent the adv corrensponces, therefore i assume that your methods will eventually meets overfitting- on a reverse fashion...

                  And since your alternative network probably has even worse non-linear nature, your method will ask for larger scope of independent distributions for training otherwise it will never be well fitted inversely, to a specific quesition domain.

                  [–]sia_rezaei 0 points1 point  (0 children)

                  I am trying to clarify my understanding. Is the following statement correct? In section 3.2 the perturbation method must be amplifying the non-robust features that are already in the dataset. That is when perturbing a dog image to be classified as a cat, you are amplifying non-robust cat features that are already present in the cat images of the dataset.

                  In other words, the features that are added by the perturbation are not just any features that make the dog image to be classified as a cat, but are non-robust features that are found in the cat images in the dataset.

                  [–]learning-new-thingz 0 points1 point  (1 child)

                  I really enjoyed this work, and it definitely puts adversarial examples into more perspective for me. I have two questions, which are slightly different in nature:

                  1. It seems like you put all features into three bins:
                  • Robust features - Ones that are useful for generalization and also robust to perturbations in x.
                  • Non-robust but still good features - Useful for generalization but not robust to perturbations .... in the paper you define these by saying that the sign of the correlation with the labels flip.
                  • Bad features - Not useful for generalization or robustness.. artifacts of the training set. Like the ones that dropout/l2 regularization etc. try to prevent.

                  If this is correct, my question is why the discrete division of features in this way rather than a more continuous characterization? Would it not be possible that the correlation does not flip but the normalized magnitude reduces or something? Or that features are robust for a certain norm ball but non-robust when the ball becomes bigger?

                  1. As a budding researcher, I am really curious about the line of investigation that led you to this exploration. If you would be kind enough to reveal what prompted this hypothesis, I would be very interested to hear.

                  [–]andrew_ilyas[S] 0 points1 point  (0 children)

                  1. Yes, I think you understood correctly! The reason we care about "flipping" is that for any given image, conceptually what we care about is which class a feature is "pointing to," i.e. which class does this feature add evidence for? That said, note that the Robust/Non-robust feature dichotomy is actually determined by the human-chosen perturbation budget \epsilon. This dichotomy is a natural way to think about things because for any given choice of epsilon, there are some features that we can still use (the robust ones, which still provide evidence for the right class regardless of perturbation), and some features that will actually hurt us if we use them (even if they might help if epsilon was smaller)
                  2. In this case, we were partly thinking along the lines of previous work from our lab: https://arxiv.org/pdf/1805.12152.pdf, which provides a theoretical setting where non-robust features arise. When we were thinking about it more, we realized that if this theoretical model was correct, then adversarial examples should be "features" instead of how we typically thought of them as "bugs." We actually designed our experiments (in particular Section 3.2) in order to see if this conceptual model held up (we were initially *very* skeptical). Surprisingly, it did! So we came up with a tighter conceptual model, and tried to build a series of experiments that would test whether the model was predictive. Hopefully this helped!

                  [–]PackedSnacks 0 points1 point  (9 children)

                  [Edit: I guess my question is, if they are features why are we trying to hard to remove them?]

                  You write:

                  "Overall, attaining models that are robust and interpretable will require explicitly encoding human priors into the training process."

                  Sure, but since when in machine learning do we base the entire evaluation of a model prior by how well the model satisfies the prior. Smoothness has long been pitched as an important model prior well before the deep learning boom, but methods which motivate smoothness never based their entire evaluation of the model by measuring how smooth the model was. You would show the smoothness was useful for something, like better generalization or out-of-distribution robustness. So why then in the adversarial example literature do you evaluate the lp-robustness prior with just lp-robustness?

                  Don't get me wrong, I think there are some interesting ideas in this paper. But you haven't demonstrated that the adversarially trained model is more "robust". The "robust" model is just be using the remaining features in way that also lacks robustness (e.g. the severe drop in performance in fog for the adversarially trained models observed in https://arxiv.org/abs/1901.10513).

                  [–]andrew_ilyas[S] 1 point2 points  (5 children)

                  I would refer you to our results about interpretability—while lp robustness is not the only important prior, we show that having this robustness is essential to having machine learning models that are more interpretable. In particular, classifiers that are not lp robust will use features that are not human-perceptible.

                  In that sense, I think that your claim that “smoothness is only useful for smoothness” is incorrect.

                  Now, I agree that finding better notions of robustness is an important problem, and specifically as we attain these better notions, we’ll be able to eliminate even more non-robust features. But our work is showing precisely that unless you encode these priors, models are not going to be interpretable.

                  [–]PackedSnacks 0 points1 point  (4 children)

                  I'm not arguing that smoothness is only useful for smoothness. If adversarial example papers motivate smoothness for interpretability, and focus their paper on interpretability as in your work then that's fantastic. But you write in your introduction that "From this point of view, it is natural to treat adversarial robustness as a goal that can be disentangled and pursued independently from maximizing accuracy". This seems to suggest that you want smoothness to be the goal, not interpretability.

                  On the interpretability side did you investigate any relationship with the DeepViz techniques? The visualizations performaned on trained models seem similar to your robust dataset: https://distill.pub/2017/feature-visualization/.

                  Did you try testing the naturally trained model on on the robust data distribution? Maybe the naturally trained model uses both the "non-robust" features and the "robust" features. It could be that it is using both, and that feature viz is restricting the optimization to correlate along what you call the "robust" direction.

                  [–]andrew_ilyas[S] 0 points1 point  (3 children)

                  Glad to hear the interpretability implications are interesting!

                  I think the introduction is rather clear? There is a phenomenon of adversarial vulnerability (whether you believe it is an interesting phenomenon or not is up to you, but it is a sufficiently interesting one to us and the community—see for example https://arxiv.org/abs/1801.02774). We explain it, and then show that this explanation has implications for interpretability.

                  Regarding the link you posted, one of the implications that I mentioned earlier is that these methods are giving us nice pictures at the expense of kind of misleading us about what the model is actually depending on (the non robust features), so it’s debatable if the resulting images are semantically meaningful with respect to how the model is making its decisions.

                  [–]PackedSnacks 0 points1 point  (2 children)

                  Does the naturally trained model depend only on the non-robust features? What happens if you measure the accuracy of the clean model on the robust data distribution? If it's still accurate on the robust distribution then I think what you are showing is that deepviz fails to identify all of the relevant features the model is using, but it could be that the features it does visualize are still relevant to classification. This would be a different conclusion than the visualization is completely misleading us.

                  [–]andrew_ilyas[S] 0 points1 point  (1 child)

                  Our claim is never that the natural model *only* uses non-robust features, but clearly (adversarial examples) these non-robust features are important enough that they flip classification, so I would say that not presenting them as any of the features is not representative of what the model is actually learning.

                  [–]PackedSnacks 0 points1 point  (0 children)

                  I'm skeptical about this measure of feature importance. Take for example logistic regression on 100 features that have each been normalized to have unit variance. One simple interpretability method might be to visualize the top features based on the magnitude of the model coefficient. The model technically depends on all the features with non-zero weight, but the features with small coefficient intuitively do not matter as much for classification. What you are doing is analogous to assigning unusually high values on these less significant features then pointing out that the interpretability method didn't identify them. I don't think you could conclude from that analysis that the interpretability method isn't representative of what the model is learning.

                  If you want to argue that these features are important you should remove them and measure the performance drop of the resulting model.

                  [–]andrew_ilyas[S] 0 points1 point  (2 children)

                  Replying to the edit: This is a neat question! In our paper, we aren't arguing that we *have to* remove them per se, but seeing as how one of the goals of ML is to get a human-meaningful yet accurate description of what a model depends on, these features seem like a fundamental barrier.

                  If indeed all we care about is high accuracy and we don't really care about human-meaningful yet accurate explanations, then these features can (and should) be used.

                  [–]PackedSnacks 1 point2 points  (1 child)

                  Makes sense though it's difficult to define what it means to have accurately "interpreted" the model.

                  Forgive the rest of my comment, I overall found the work really insightful. Glad we are in agreement about the need for better robustness metrics.

                  [–]andrew_ilyas[S] 1 point2 points  (0 children)

                  Yeah, we definitely agree! We need both better notions of robustness, and better definitions of interpretability in general. And no worries, hard to convey these sorts of things over reddit :) thanks for the kind comments.