Skip to main content

AI/​adversarial directory

Links

“WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation”, Liu et al 2022

“WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation”⁠, Alisa Liu, Swabha Swayamdipta, Noah A. Smith, Yejin Choi (2022-01-16; ; similar):

[Twitter] A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity.

We introduce a novel paradigm for dataset creation based on human and machine collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI⁠, our approach uses “dataset cartography” to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers to ensure quality.

The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths over existing NLI datasets. Remarkably, training a model on WANLI instead of MNLI (which is 4× larger) improves performance on 7 out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI. Moreover, combining MNLI with WANLI is more effective than combining with other augmentation sets that have been introduced.

Our results demonstrate the potential of natural language generation techniques to curate NLP datasets of enhanced quality and diversity.

“CommonsenseQA 2.0: Exposing the Limits of AI through Gamification”, Talmor et al 2022

“CommonsenseQA 2.0: Exposing the Limits of AI through Gamification”⁠, Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, Jonathan Berant et al (2022-01-14; ; similar):

Constructing benchmarks that test the abilities of modern natural language understanding models is difficult—pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense.

In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale.

Using our method we create CommonsenseQA 2.0 [to replace CommonsenseQA], which includes 14,343 yes/​no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself.

Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%.

“Models in the Loop: Aiding Crowdworkers With Generative Annotation Assistants”, Bartolo et al 2021

“Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants”⁠, Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, Douwe Kiela (2021-12-16; similar):

In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more costly per example. In this work, we examine if we can maintain the advantages of DADC, without suffering the additional cost. To that end, we introduce Generative Annotation Assistants (GAAs), generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely. We collect training datasets in twenty experimental settings and perform a detailed analysis of this approach for the task of extractive question answering (QA) for both standard and adversarial data collection. We demonstrate that GAAs provide significant efficiency benefits in terms of annotation speed, while leading to improved model fooling rates. In addition, we show that GAA-assisted data leads to higher downstream model performance on a variety of question answering tasks.

“Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs”, Korkmaz 2021

“Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs”⁠, Ezgi Korkmaz (2021-12-16; ; similar):

The use of deep neural networks as function approximators has led to striking progress for reinforcement learning algorithms and applications. Yet the knowledge we have on decision boundary geometry and the loss landscape of neural policies is still quite limited.

In this paper we propose a framework to investigate the decision boundary and loss landscape similarities across states and across MDPs. We conduct experiments in various games from Arcade Learning Environment, and discover that high sensitivity directions for neural policies are correlated across MDPs⁠. We argue that these high sensitivity directions support the hypothesis that non-robust features are shared across training environments of reinforcement learning agents.

We believe our results reveal fundamental properties of the environments used in deep reinforcement learning training, and represent a tangible step towards building robust and reliable deep reinforcement learning agents.

“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts”, Khashabi et al 2021

“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts”⁠, Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh, Sean Welleck, Hannaneh Hajishirzi et al (2021-12-15; ; similar):

Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a “wayward” behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (eg. definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.

“TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems”, Doan et al 2021

“TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems”⁠, Bao Gia Doan, Minhui Xue, Shiqing Ma, Ehsan Abbasnejad, Damith C. Ranasinghe (2021-11-19; ; similar):

Deep neural networks are vulnerable to attacks from adversarial inputs and, more recently, Trojans to misguide or hijack the decision of the model. We expose the existence of an intriguing class of bounded adversarial examples—Universal NaTuralistic adversarial paTches—we call TnTs, by exploring the superset of the bounded adversarial example space and the natural input space within generative adversarial networks. Now, an adversary can arm themselves with a patch that is naturalistic, less malicious-looking, physically realizable, highly effective—achieving high attack success rates, and universal. A TnT is universal because any input image captured with a TnT in the scene will: (1) misguide a network (untargeted attack); or (2) force the network to make a malicious decision (targeted attack). Interestingly, now, an adversarial patch attacker has the potential to exert a greater level of control—the ability to choose a location independent, natural-looking patch as a trigger in contrast to being constrained to noisy perturbations—an ability is thus far shown to be only possible with Trojan attack methods needing to interfere with the model building processes to embed a backdoor at the risk discovery; but, still realize a patch deployable in the physical world. Through extensive experiments on the large-scale visual classification task, ImageNet with evaluations across its entire validation set of 50,000 images, we demonstrate the realistic threat from TnTs and the robustness of the attack. We show a generalization of the attack to create patches achieving higher attack success rates than existing state-of-the-art methods. Our results show the generalizability of the attack to different visual classification tasks (CIFAR-10, GTSRB, PubFig) and multiple state-of-the-art deep neural networks such as WideResnet50, Inception-V3 and VGG-16⁠.

“AugMax: Adversarial Composition of Random Augmentations for Robust Training”, Wang et al 2021

“AugMax: Adversarial Composition of Random Augmentations for Robust Training”⁠, Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, Zhangyang Wang (2021-10-26; similar):

Data augmentation is a simple yet effective way to improve the robustness of deep neural networks (DNNs). Diversity and hardness are two complementary dimensions of data augmentation to achieve robustness. For example, AugMix explores random compositions of a diverse set of augmentations to enhance broader coverage, while adversarial training generates adversarially hard samples to spot the weakness. Motivated by this, we propose a data augmentation framework, termed AugMax, to unify the two aspects of diversity and hardness. AugMax first randomly samples multiple augmentation operators and then learns an adversarial mixture of the selected operators. Being a stronger form of data augmentation, AugMax leads to a significantly augmented input distribution which makes model training more challenging. To solve this problem, we further design a disentangled normalization module, termed DuBIN (Dual-Batch-and-Instance Normalization), that disentangles the instance-wise feature heterogeneity arising from AugMax. Experiments show that AugMax-DuBIN leads to significantly improved out-of-distribution robustness, outperforming prior arts by 3.03%, 3.49%, 1.82% and 0.71% on CIFAR10-C, CIFAR100-C, Tiny ImageNet-C and ImageNet-C. Codes and pretrained models are available: Github⁠.

“Unrestricted Adversarial Attacks on ImageNet Competition”, Chen et al 2021

“Unrestricted Adversarial Attacks on ImageNet Competition”⁠, Yuefeng Chen, Xiaofeng Mao, Yuan He, Hui Xue, Chao Li, Yinpeng Dong, Qi-An Fu, Xiao Yang, Wenzhao Xiang et al (2021-10-17; similar):

Many works have investigated the adversarial attacks or defenses under the settings where a bounded and imperceptible perturbation can be added to the input. However in the real-world, the attacker does not need to comply with this restriction. In fact, more threats to the deep model come from unrestricted adversarial examples, that is, the attacker makes large and visible modifications on the image, which causes the model classifying mistakenly, but does not affect the normal observation in human perspective. Unrestricted adversarial attack is a popular and practical direction but has not been studied thoroughly. We organize this competition with the purpose of exploring more effective unrestricted adversarial attack algorithm, so as to accelerate the academical research on the model robustness under stronger unbounded attacks. The competition is held on the TianChi platform (https: /  ​ /  ​tianchi.aliyun.com /  ​competition /  ​entrance /  ​531853 /  ​introduction) as one of the series of AI Security Challengers Program.

“Partial Success in Closing the Gap between Human and Machine Vision”, Geirhos et al 2021

“Partial success in closing the gap between human and machine vision”⁠, Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann et al (2021-06-14; ⁠, ⁠, ; backlinks; similar):

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines “in the wild” and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, adding the “missing human baseline” by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (eg. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorization errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark here: https://github.com/bethgelab/model-vs-human/

“A Universal Law of Robustness via Isoperimetry”, Bubeck & Sellke 2021

“A Universal Law of Robustness via Isoperimetry”⁠, Sébastien Bubeck, Mark Sellke (2021-05-26; ; backlinks; similar):

[Video: long⁠/​short] Classically, data interpolation with a parameterized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest.

We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparameterization is necessary if one wants to interpolate the data smoothly. Namely, we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parameterized function class with polynomial size weights, and any covariate distribution verifying isoperimetry [“having the same perimetry”].

In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck et al 2021⁠. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

…To put Theorem 1 in context, we compare to the empirical results presented in [MMS+18]. In the latter work, they consider the MNIST dataset which consists of n = 6×104 images in dimension 282 = 784. They trained robustly different architectures, and reported in Figure 4 the size of the architecture versus the obtained robust test accuracy (third plot from the left). One can see a sharp transition from roughly 10% accuracy to roughly 90% accuracy at around 2×105 parameters (capacity scale 4 in their notation). Moreover the robust accuracy keeps climbing up with more parameters, to roughly 95% accuracy at roughly 3×106 parameters…In addition to [MMS+18], several recent works have experimentally studied the relationship between a neural network scale and its achieved robustness, see eg. [NBA+18⁠, XY20⁠, GQU+20]. It has been consistently reported that larger networks help tremendously for robustness, beyond what is typically seen for classical non-robust accuracy

…With all the caveats described above, we can now look at the numbers as follows: in the [MMS+18] experiments, smooth models with accuracy below the noise level are attained with a number of parameters somewhere in the range 2×105–3×106 parameters (possibly even larger depending on the interpretation of the noise level), while the law of robustness would predict any such model must have at least nd parameters, and this latter quantity should be somewhere in the range 106–107 (corresponding to an effective dimension between 15 and 150). While far from perfect, the law of robustness prediction is far more accurate than the classical rule of thumb # parameters ≃ # equations (which here would predict a number of parameters of the order 104).

Perhaps more interestingly, one could apply a similar reasoning to the ImageNet dataset, which consists of 1.4×107 images of size roughly 2×105. Estimating that the effective dimension is a couple of order of magnitudes smaller than this size, the law of robustness predicts that to obtain good robust models on ImageNet one would need at least 1010–1011 parameters. This number is larger than the size of current neural networks trained robustly for this task, which sports between 108–109 parameters. Thus, we arrive at the tantalizing possibility that robust models for ImageNet do not exist yet simply because we are a couple orders of magnitude off in the current scale of neural networks trained for this task.

“A Law of Robustness for Two-layers Neural Networks”, Bubeck et al 2021

“A law of robustness for two-layers neural networks”⁠, Sebastien Bubeck, Yuanzhi Li, Dheeraj Mysore Nagaraj (2021-03-05; ; backlinks; similar):

We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant⁠.

We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with k neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than √nk where n is the number of datapoints. In particular, this conjecture implies that overparameterization is necessary for robustness, since it means that one needs roughly one neuron per datapoint to ensure a 𝒪(1)-Lipschitz network, while mere data fitting of d-dimensional data requires only one neuron per d datapoints. We prove a weaker version of this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix. We also prove the conjecture in the high-dimensional regime nd (which we also refer to as the undercomplete case, since only kd is relevant here). Finally we prove the conjecture for polynomial activation functions of degree p when ndp.

We complement these findings with experimental evidence supporting the conjecture.

[Keywords: neural networks, approximation theory, robust machine learning]

“Multimodal Neurons in Artificial Neural Networks [CLIP]”, Goh et al 2021

“Multimodal Neurons in Artificial Neural Networks [CLIP]”⁠, Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford et al (2021-03-04; ; similar):

[Investigation of CLIP activations: CLIP detects a wide variety of entities, like Spiderman, Lady Gaga, or Halle Berry, in a variety of media, such as photos, (images of) text, people in costumes, drawings, or just similar terms; previous cruder smaller NNs lacked this ‘conceptual’ level, only responding to the exact person’s photograph.

CLIP neurons further specialize in regions, famous individual, human emotions, religions, human attributes such as age/​gender/​facial-features, geographic regions (down to specific cities), holidays, art styles (such as anime vs painting), media franchises (Pokemon, Star Wars, Minecraft⁠, Batman etc), brands, images of text, and abstract concepts like ‘star’ or ‘LGBTQ+’ or numbers or time or color. Such conceptual neurons also have ‘opposite’ neurons, like Donald Trump vs “musicians like Nicky Minaj and Eminem, video games like Fortnite, civil rights activists like Martin Luther King Jr., and LGBT symbols like rainbow flags.” The capabilities are best with the English language, but there is limited foreign-language capabilities as well.

Given the ‘conceptual’ level of neurons, it’s not too surprising that the overloaded/​entangled/​“polysemantic” neurons that Distill.pub has documented in VGG16 (which appear undesirable and to reflect the crudity of the NN’s knowledge) are much less present in CLIP, and the neurons appear to learn much cleaner concepts.

The power of the zero-shot classification, and the breadth of CLIP’s capabilities, can lead to some counterintuitive results, like their discovery of what they dub typographic attacks: writing “iPod” on a piece of paper and sticking it on the front of a Granny Smith apple can lead to the text string “iPod” being much more ‘similar’ to the image than the text string “Granny Smith”.

Perhaps even more surprising is that the multimodal conceptual capability leads to a Stroop effect! (And also bouba /  ​kiki⁠.) All in all, CLIP is remarkable.]

“Adversarial Images for the Primate Brain”, Yuan et al 2020

“Adversarial images for the primate brain”⁠, Li Yuan, Will Xiao, Gabriel Kreiman, Francis E. H. Tay, Jiashi Feng, Margaret S. Livingstone (2020-11-11; similar):

Deep artificial neural networks have been proposed as a model of primate vision. However, these networks are vulnerable to adversarial attacks, whereby introducing minimal noise can fool networks into misclassifying images. Primate vision is thought to be robust to such adversarial images. We evaluated this assumption by designing adversarial images to fool primate vision. To do so, we first trained a model to predict responses of face-selective neurons in macaque inferior temporal cortex. Next, we modified images, such as human faces, to match their model-predicted neuronal responses to a target category, such as monkey faces. These adversarial images elicited neuronal responses similar to the target category. Remarkably, the same images fooled monkeys and humans at the behavioral level. These results challenge fundamental assumptions about the similarity between computer and primate vision and show that a model of neuronal activity can selectively direct primate visual behavior.

Figure 1: Overview of adversarial attack. (A) A substitute model was fit on IT neuron responses. The substitute model consisted of a pre-trained ResNet-101 (excluding the final fully-connected layer) and a linear mapping model. Adversarial images were generated by gradient-based optimization of the image to create the desired neuronal response pattern as predicted by the substitute model. (B, left) The adversarial images were tested in monkeys in neuron-level experiments. Monkeys fixated on a red fixation point while images were presented in random order and neuronal responses were recorded. (B, right) The images were tested in behavioral experiments with monkeys and human subjects. Each image was presented for 1000ms. For monkeys, 2 choice buttons were presented (text for illustration only). Monkeys were rewarded for touching the correct button for training images and a random button for test images. Humans were instructed to press a key to indicate the correct option. (C) Example human → monkey attack images, based on 2 original human faces, are shown for different noise levels.
Figure 2: Neuron-level results of adversarial attack. (A–G), Human → monkey attack. (A) UMAP visualization of neuronal representation of images in monkey P. ‘Gray box attack human face’ corresponds to noise level 10. Inset shows average distances from adversarial images to clean human faces and clean monkey faces, along the direction that best separates the latter 2 and normalized to the distance between them. Points in inset show centers of mass of UMAP points for illustration only, and do not correspond to the distance quantification. (B) Success rates of attack and control images (pure model, merged, Gaussian noise, and PS noise images) at different noise levels. Legend and example images are in (C). Shading and error bars show s.e.m. over bootstrap samples. *, p < 0.05 and ✱✱, p < 0.01. (D, E), Same as (A, B) for monkey R. (F, G), Same as (A, B) for monkey B1. (H–L), Same as (A–G) for non-face → face attack in monkeys P and B1.

“Adversarial Vulnerabilities of Human Decision-making”, Dezfouli et al 2020

“Adversarial vulnerabilities of human decision-making”⁠, Amir Dezfouli, Richard Nock, Peter Dayan (2020-11-04; ⁠, ⁠, ; similar):

“What I cannot efficiently break, I cannot understand.” Understanding the vulnerabilities of human choice processes allows us to detect and potentially avoid adversarial attacks. We develop a general framework for creating adversaries for human decision-making. The framework is based on recent developments in deep reinforcement learning models and recurrent neural networks and can in principle be applied to any decision-making task and adversarial objective. We show the performance of the framework in 3 tasks involving choice, response inhibition, and social decision-making. In all of the cases the framework was successful in its adversarial attack. Furthermore, we show various ways to interpret the models to provide insights into the exploitability of human choice.


Adversarial examples are carefully crafted input patterns that are surprisingly poorly classified by artificial and/​or natural neural networks. Here we examine adversarial vulnerabilities in the processes responsible for learning and choice in humans. Building upon recent recurrent neural network models of choice processes, we propose a general framework for generating adversarial opponents that can shape the choices of individuals in particular decision-making tasks toward the behavioral patterns desired by the adversary. We show the efficacy of the framework through 3 experiments involving action selection, response inhibition, and social decision-making. We further investigate the strategy used by the adversary in order to gain insights into the vulnerabilities of human choice. The framework may find applications across behavioral sciences in helping detect and avoid flawed choice.

[Keywords: decision-making, recurrent neural networks, reinforcement learning]

“Concealed Data Poisoning Attacks on NLP Models”, Wallace et al 2020

“Concealed Data Poisoning Attacks on NLP Models”⁠, Eric Wallace, Tony Z. Zhao, Shi Feng, Sameer Singh (2020-10-23; similar):

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains “James Bond”. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (“Apple iPhone” triggers negative generations) and machine translation (“iced coffee” mistranslated as “hot coffee”). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

“Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations”, Dapello et al 2020

“Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations”⁠, Joel Dapello, Tiago Marques, Martin Schrimpf, Franziska Geiger, David D. Cox, James J. DiCarlo (2020-10-22; ; similar):

Current state-of-the-art object recognition models are largely based on convolutional neural network (CNN) architectures, which are loosely inspired by the primate visual system. However, these CNNs can be fooled by imperceptibly small, explicitly crafted perturbations, and struggle to recognize objects in corrupted images that are easily recognized by humans. Here, by making comparisons with primate neural data, we first observed that CNN models with a neural hidden layer that better matches primate primary visual cortex (V1) are also more robust to adversarial attacks. Inspired by this observation, we developed VOneNets, a new class of hybrid CNN vision models. Each VOneNet contains a fixed weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models. The VOneBlock is based on a classical neuroscientific model of V1: the linear-nonlinear-Poisson model, consisting of a biologically-constrained Gabor filter bank, simple and complex cell nonlinearities, and a V1 neuronal stochasticity generator. After training, VOneNets retain high ImageNet performance, but each is substantially more robust, outperforming the base CNNs and state-of-the-art methods by 18% and 3%, respectively, on a conglomerate benchmark of perturbations comprised of white box adversarial attacks and common image corruptions. Finally, we show that all components of the VOneBlock work in synergy to improve robustness. While current CNN architectures are arguably brain-inspired, the results presented here demonstrate that more precisely mimicking just one stage of the primate visual system leads to new gains in ImageNet-level computer vision applications.

“Recipes for Safety in Open-domain Chatbots”, Xu et al 2020

“Recipes for Safety in Open-domain Chatbots”⁠, Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan (2020-10-14; similar):

Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (1) safer than existing models as measured by automatic and human evaluations while (2) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our models.

“Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples”, Gowal et al 2020

“Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples”⁠, Sven Gowal, Chongli Qin, Jonathan Uesato, Timothy Mann, Pushmeet Kohli (2020-10-07; ; backlinks; similar):

Adversarial training and its variants have become de facto standards for learning robust deep neural networks. In this paper, we explore the landscape around adversarial training in a bid to uncover its limits. We systematically study the effect of different training losses, model sizes, activation functions, the addition of unlabeled data (through pseudo-labeling) and other factors on adversarial robustness. We discover that it is possible to train robust models that go well beyond state-of-the-art results by combining larger models, Swish/​SiLU activations and model weight averaging. We demonstrate large improvements on CIFAR-10 and CIFAR-100 against 𝓁 and 𝓁2 norm-bounded perturbations of size 8⁄255 and 128⁄255, respectively. In the setting with additional unlabeled data, we obtain an accuracy under attack of 65.88% against 𝓁 perturbations of size 8⁄255 on CIFAR-10 (+6.35% with respect to prior art). Without additional data, we obtain an accuracy under attack of 57.20% (+3.46%). To test the generality of our findings and without any additional modifications, we obtain an accuracy under attack of 80.53% (+7.62%) against 𝓁2 perturbations of size 128⁄255 on CIFAR-10, and of 36.88% (+8.46%) against 𝓁 perturbations of size 8⁄255 on CIFAR-100. All models are available at https://github.com/deepmind/deepmind-research/tree/master/adversarial_robustness⁠.

“Dataset Cartography: Mapping and Diagnosing Datasets With Training Dynamics”, Swayamdipta et al 2020

“Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics”⁠, Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi et al (2020-09-22; ; backlinks; similar):

Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps—a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example—the model’s confidence in the true class, and the variability of this confidence across epochs—obtained in a single run of training. Experiments across four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of “ambiguous” regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are “easy to learn” for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds “hard to learn”; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

“Do Adversarially Robust ImageNet Models Transfer Better?”, Salman et al 2020

“Do Adversarially Robust ImageNet Models Transfer Better?”⁠, Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, Aleksander Madry (2020-07-16; similar):

Transfer learning is a widely-used paradigm in deep learning, where models pre-trained on standard datasets can be efficiently adapted to downstream tasks. Typically, better pre-trained models yield better transfer results, suggesting that initial accuracy is a key aspect of transfer learning performance. In this work, we identify another such aspect: we find that adversarially robust models, while less accurate, often perform better than their standard-trained counterparts when used for transfer learning. Specifically, we focus on adversarially robust ImageNet classifiers, and show that they yield improved accuracy on a standard suite of downstream classification tasks. Further analysis uncovers more differences between robust and standard models in the context of transfer learning. Our results are consistent with (and in fact, add to) recent hypotheses stating that robustness leads to improved feature representations. Our code and models are available at Github⁠.

“Smooth Adversarial Training”, Xie et al 2020

“Smooth Adversarial Training”⁠, Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, Quoc V. Le (2020-06-25; backlinks; similar):

It is commonly believed that networks cannot be both accurate and robust, that gaining robustness means losing accuracy. It is also generally believed that, unless making networks larger, network architectural elements would otherwise matter little in improving adversarial robustness. Here we present evidence to challenge these common beliefs by a careful study about adversarial training. Our key observation is that the widely-used ReLU activation function significantly weakens adversarial training due to its non-smooth nature. Hence we propose smooth adversarial training (SAT), in which we replace ReLU with its smooth approximations to strengthen adversarial training. The purpose of smooth activation functions in SAT is to allow it to find harder adversarial examples and compute better gradient updates during adversarial training.

Compared to standard adversarial training, SAT improves adversarial robustness for “free”, i.e., no drop in accuracy and no increase in computational cost. For example, without introducing additional computations, SAT significantly enhances ResNet-50’s robustness from 33.0% to 42.3%, while also improving accuracy by 0.9% on ImageNet. SAT also works well with larger networks: it helps EfficientNet-L1 to achieve 82.2% accuracy and 58.6% robustness on ImageNet, outperforming the previous state-of-the-art defense by 9.5% for accuracy and 11.6% for robustness. Models are available at https:/​/​github.com/​cihangxie/​SmoothAdversarialTraining.

“Improving the Interpretability of FMRI Decoding Using Deep Neural Networks and Adversarial Robustness”, McClure et al 2020

“Improving the Interpretability of fMRI Decoding using Deep Neural Networks and Adversarial Robustness”⁠, Patrick McClure, Dustin Moraczewski, Ka Chun Lam, Adam Thomas, Francisco Pereira (2020-04-23; ; similar):

Deep neural networks (DNNs) are being increasingly used to make predictions from functional magnetic resonance imaging (fMRI) data. However, they are widely seen as uninterpretable “black boxes”, as it can be difficult to discover what input information is used by the DNN in the process, something important in both cognitive neuroscience and clinical applications. A saliency map is a common approach for producing interpretable visualizations of the relative importance of input features for a prediction. However, methods for creating maps often fail due to DNNs being sensitive to input noise, or by focusing too much on the input and too little on the model. It is also challenging to evaluate how well saliency maps correspond to the truly relevant input information, as ground truth is not always available. In this paper, we review a variety of methods for producing gradient-based saliency maps, and present a new adversarial training method we developed to make DNNs robust to input noise, with the goal of improving interpretability. We introduce two quantitative evaluation procedures for saliency map methods in fMRI, applicable whenever a DNN or linear model is being trained to decode some information from imaging data. We evaluate the procedures using a synthetic dataset where the complex activation structure is known, and on saliency maps produced for DNN and linear models for task decoding in the Human Connectome Project (HCP) dataset. Our key finding is that saliency maps produced with different methods vary widely in interpretability, in both in synthetic and HCP fMRI data. Strikingly, even when DNN and linear models decode at comparable levels of performance, DNN saliency maps score higher on interpretability than linear model saliency maps (derived via weights or gradient). Finally, saliency maps produced with our adversarial training method outperform those from other methods.

“Adversarial Examples Improve Image Recognition”, Xie et al 2019

“Adversarial Examples Improve Image Recognition”⁠, Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, Quoc V. Le (2019-11-21; similar):

Adversarial examples are commonly viewed as a threat to ConvNets. Here we present an opposite perspective: adversarial examples can be used to improve image recognition models if harnessed in the right manner. We propose AdvProp, an enhanced adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to our method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.

We show that AdvProp improves a wide range of models on various image recognition tasks and performs better when the models are bigger. For instance, by applying AdvProp to the latest EfficientNet-B7 [28] on ImageNet, we achieve significant improvements on ImageNet (+0.7%), ImageNet-C (+6.5%), ImageNet-A (+7.0%), Stylized-ImageNet (+4.8%). With an enhanced EfficientNet-B8, our method achieves the state-of-the-art 85.5% ImageNet top-1 accuracy without extra data. This result even surpasses the best model in [20] which is trained with 3.5B Instagram images (~3000× more than ImageNet) and ~9.4× more parameters. Models are available at https:/​/​github.com/​tensorflow/​tpu/​tree/​master/​models/​official/​efficientnet.

“Playing Magic Tricks to Deep Neural Networks Untangles Human Deception”, Zaghi-Lara et al 2019

“Playing magic tricks to deep neural networks untangles human deception”⁠, Regina Zaghi-Lara, Miguel Ángel Gea, Jordi Camí, Luis M. Martínez, Alex Gomez-Marin (2019-08-20; ; similar):

Magic is the art of producing in the spectator an illusion of impossibility. Although the scientific study of magic is in its infancy, the advent of recent tracking algorithms based on deep learning allow now to quantify the skills of the magician in naturalistic conditions at unprecedented resolution and robustness.

In this study, we deconstructed stage magic into purely motor maneuvers and trained an artificial neural network (DeepLabCut) to follow coins as a professional magician made them appear and disappear in a series of tricks. Rather than using AI as a mere tracking tool, we conceived it as an “artificial spectator”. When the coins were not visible, the algorithm was trained to infer their location as a human spectator would (ie. in the left fist).

This created situations where the human was fooled while AI (as seen by a human) was not, and vice versa.

Magic from the perspective of the machine reveals our own cognitive biases.

[ML techniques, compared to humans, look a bit like autistic or idiot savants: skilled at fine detail but missing the big picture. Another example of CNNs being ‘smarter by being stupider’ is “Humans, but Not Deep Neural Networks, Often Miss Giant Targets in Scenes”, Eckstein et al 2017⁠.]

“Intriguing Properties of Adversarial Training at Scale”, Xie & Yuille 2019

“Intriguing properties of adversarial training at scale”⁠, Cihang Xie, Alan Yuille (2019-06-10; ; backlinks; similar):

Adversarial training is one of the main defenses against adversarial attacks. In this paper, we provide the first rigorous study on diagnosing elements of adversarial training, which reveals two intriguing properties.

First, we study the role of normalization. Batch normalization (BN) is a crucial element for achieving state-of-the-art performance on many vision tasks, but we show it may prevent networks from obtaining strong robustness in adversarial training. One unexpected observation is that, for models trained with BN, simply removing clean images from training data largely boosts adversarial robustness, i.e., 18.3%. We relate this phenomenon to the hypothesis that clean images and adversarial images are drawn from two different domains. This two-domain hypothesis may explain the issue of BN when training with a mixture of clean and adversarial images, as estimating normalization statistics of this mixture distribution is challenging. Guided by this two-domain hypothesis, we show disentangling the mixture distribution for normalization, i.e., applying separate BNs to clean and adversarial images for statistics estimation, achieves much stronger robustness. Additionally, we find that enforcing BNs to behave consistently at training and testing can further enhance robustness.

Second, we study the role of network capacity. We find our so-called “deep” networks are still shallow for the task of adversarial learning. Unlike traditional classification tasks where accuracy is only marginally improved by adding more layers to “deep” networks (eg. ResNet-152), adversarial training exhibits a much stronger demand on deeper networks to achieve higher adversarial robustness. This robustness improvement can be observed substantially and consistently even by pushing the network capacity to an unprecedented scale, i.e., ResNet-638.

“Adversarially Robust Generalization Just Requires More Unlabeled Data”, Zhai et al 2019

“Adversarially Robust Generalization Just Requires More Unlabeled Data”⁠, Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, Liwei Wang (2019-06-03; ; similar):

Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and statistically-significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model with better adversarially robust generalization. The key insight of our results is based on a risk decomposition theorem, in which the expected robust risk is separated into two parts: the stability part which measures the prediction stability in the presence of perturbations, and the accuracy part which evaluates the standard classification accuracy. As the stability part does not depend on any label information, we can optimize this part using unlabeled data. We further prove that for a specific Gaussian mixture problem, adversarially robust generalization can be almost as easy as the standard generalization in supervised learning if a sufficiently large amount of unlabeled data is provided. Inspired by the theoretical findings, we further show that a practical adversarial training algorithm that leverages unlabeled data can improve adversarial robust generalization on MNIST and Cifar-10.

“Are Labels Required for Improving Adversarial Robustness?”, Uesato et al 2019

“Are Labels Required for Improving Adversarial Robustness?”⁠, Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, Pushmeet Kohli et al (2019-05-31; similar):

Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification. This result is a key hurdle in the deployment of robust machine learning models in many real world applications where labeled data is expensive. Our main insight is that unlabeled data can be a competitive alternative to labeled data for training adversarially robust models. Theoretically, we show that in a simple statistical setting, the sample complexity for learning an adversarially robust model from unlabeled data matches the fully supervised case up to constant factors. On standard datasets like CIFAR-10, a simple Unsupervised Adversarial Training (UAT) approach using unlabeled data improves robust accuracy by 21.7% over using 4K supervised examples alone, and captures over 95% of the improvement from the same number of labeled examples. Finally, we report an improvement of 4% over the previous state-of-the-art on CIFAR-10 against the strongest known attack by using additional unlabeled data from the uncurated 80 Million Tiny Images dataset. This demonstrates that our finding extends as well to the more realistic case where unlabeled data is also uncurated, therefore opening a new avenue for improving adversarial training.

“Adversarial Policies: Attacking Deep Reinforcement Learning”, Gleave et al 2019

“Adversarial Policies: Attacking Deep Reinforcement Learning”⁠, Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell (2019-05-25; ⁠, ; similar):

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent’s observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial?

We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent.

Videos are available at https://adversarialpolicies.github.io/⁠.

“Neural Population Control via Deep Image Synthesis”, Bashivan et al 2019

2019-bashivan.pdf: “Neural population control via deep image synthesis”⁠, Pouya Bashivan, Kohitij Kar, James J. DiCarlo (2019; ; similar):

Some deep artificial neural networks (ANNs) are today’s most accurate models of the primate brain’s ventral visual stream⁠.

Using an ANN-driven image synthesis method, we found that luminous power patterns (ie. images) can be applied to primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. This method, although not yet perfect, achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields.

These results show how the knowledge embedded in today’s ANN models might be used to noninvasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control.

“Neural Population Control via Deep Image Synthesis”, Bashivan et al 2018

“Neural Population Control via Deep Image Synthesis”⁠, Pouya Bashivan, Kohitij Kar, James J. DiCarlo (2018-11-04; ; similar):

Particular deep artificial neural networks (ANNs) are today’s most accurate models of the primate brain’s ventral visual stream. Here we report that, using a targeted ANN-driven image synthesis method, new luminous power patterns (ie. images) can be applied to the primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. More importantly, this method, while not yet perfect, already achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields. These results show how the knowledge embedded in today’s ANN models might be used to non-invasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control.

“Humans Can Decipher Adversarial Images”, Zhou & Firestone 2018

“Humans can decipher adversarial images”⁠, Zhenglong Zhou, Chaz Firestone (2018-09-11; ⁠, ; similar):

How similar is the human mind to the sophisticated machine-learning systems that mirror its performance? Models of object categorization based on convolutional neural networks (CNNs) have achieved human-level benchmarks in assigning known labels to novel images. These advances promise to support transformative technologies such as autonomous vehicles and machine diagnosis; beyond this, they also serve as candidate models for the visual system itself—not only in their output but perhaps even in their underlying mechanisms and principles. However, unlike human vision, CNNs can be “fooled” by adversarial examples—carefully crafted images that appear as nonsense patterns to humans but are recognized as familiar objects by machines, or that appear as one object to humans and a different object to machines. This seemingly extreme divergence between human and machine classification challenges the promise of these new advances, both as applied image-recognition systems and also as models of the human mind. Surprisingly, however, little work has empirically investigated human classification of such adversarial stimuli: Does human and machine performance fundamentally diverge? Or could humans decipher such images and predict the machine’s preferred labels? Here, we show that human and machine classification of adversarial stimuli are robustly related: In eight experiments on five prominent and diverse adversarial imagesets, human subjects reliably identified the machine’s chosen label over relevant foils. This pattern persisted for images with strong antecedent identities, and even for images described as “totaly unrecognizable to human eyes”. We suggest that human intuition may be a more reliable guide to machine (mis)classification than has typically been imagined, and we explore the consequences of this result for minds and machines alike.

“Adversarial Reprogramming of Text Classification Neural Networks”, Neekhara et al 2018

“Adversarial Reprogramming of Text Classification Neural Networks”⁠, Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar (2018-09-06; ⁠, ⁠, ⁠, ; backlinks; similar):

Adversarial Reprogramming has demonstrated success in utilizing pre-trained neural network classifiers for alternative classification tasks without modification to the original network. An adversary in such an attack scenario trains an additive contribution to the inputs to repurpose the neural network for the new classification task. While this reprogramming approach works for neural networks with a continuous input space such as that of images, it is not directly applicable to neural networks trained for tasks such as text classification, where the input space is discrete. Repurposing such classification networks would require the attacker to learn an adversarial program that maps inputs from one discrete space to the other. In this work, we introduce a context-based vocabulary remapping model to reprogram neural networks trained on a specific sequence classification task, for a new sequence classification task desired by the adversary. We propose training procedures for this adversarial program in both white-box and black-box settings. We demonstrate the application of our model by adversarially repurposing various text-classification models including LSTM⁠, bi-directional LSTM and CNN for alternate classification tasks.

“Adversarial Reprogramming of Neural Networks”, Elsayed et al 2018

“Adversarial Reprogramming of Neural Networks”⁠, Gamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-Dickstein (2018-06-28; ⁠, ⁠, ; backlinks; similar):

Deep neural networks are susceptible to adversarial attacks. In computer vision, well-crafted perturbations to images can cause neural networks to make mistakes such as confusing a cat with a computer. Previous adversarial attacks have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduce attacks that instead reprogram the target model to perform a task chosen by the attacker—without the attacker needing to specify or compute the desired output for each test-time input. This attack finds a single adversarial perturbation, that can be added to all test-time inputs to a machine learning model in order to cause the model to perform a task chosen by the adversary—even if the model was not trained to do this task. These perturbations can thus be considered a program for the new task. We demonstrate adversarial reprogramming on six ImageNet classification models, repurposing these models to perform a counting task, as well as classification tasks: classification of MNIST and CIFAR-10 examples presented as inputs to the ImageNet model.

“Towards the First Adversarially Robust Neural Network Model on MNIST”, Schott et al 2018

“Towards the first adversarially robust neural network model on MNIST”⁠, Lukas Schott, Jonas Rauber, Matthias Bethge, Wiel, Brendel (2018-05-23; similar):

Despite much effort, deep neural networks remain highly susceptible to tiny input perturbations and even for MNIST, one of the most common toy datasets in computer vision, no neural network model exists for which adversarial perturbations are large and make semantic sense to humans. We show that even the widely recognized and by far most successful defense by Madry et al 1 overfits on the L metric (it’s highly susceptible to L2 and L0 perturbations), (2) classifies unrecognizable images with high certainty, (3) performs not much better than simple input binarization and (4) features adversarial perturbations that make little sense to humans. These results suggest that MNIST is far from being solved in terms of adversarial robustness. We present a novel robust classification model that performs analysis by synthesis using learned class-conditional data distributions. We derive bounds on the robustness and go to great length to empirically evaluate our model using maximally effective adversarial attacks by (a) applying decision-based, score-based, gradient-based and transfer-based attacks for several different Lp norms, (b) by designing a new attack that exploits the structure of our defended model and (c) by devising a novel decision-based attack that seeks to minimize the number of perturbed pixels (L0). The results suggest that our approach yields state-of-the-art robustness on MNIST against L0, L2 and L perturbations and we demonstrate that most adversarial examples are strongly perturbed towards the perceptual boundary between the original and the adversarial class.

“Sensitivity and Generalization in Neural Networks: an Empirical Study”, Novak et al 2018

“Sensitivity and Generalization in Neural Networks: an Empirical Study”⁠, Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein (2018-02-23; ; backlinks; similar):

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of 2 natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets.

We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization—such as full-batch training or using random labels—correspond to lower robustness, while factors associated with good generalization—such as data augmentation and ReLU non-linearities—give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

“Adversarial Vulnerability for Any Classifier”, Fawzi et al 2018

“Adversarial vulnerability for any classifier”⁠, Alhussein Fawzi, Hamza Fawzi, Omar Fawzi (2018-02-23; similar):

Despite achieving impressive performance, state-of-the-art classifiers remain highly vulnerable to small, imperceptible, adversarial perturbations. This vulnerability has proven empirically to be very intricate to address. In this paper, we study the phenomenon of adversarial perturbations under the assumption that the data is generated with a smooth generative model. We derive fundamental upper bounds on the robustness to perturbations of any classification function, and prove the existence of adversarial perturbations that transfer well across different classifiers with small risk. Our analysis of the robustness also provides insights onto key properties of generative models, such as their smoothness and dimensionality of latent space. We conclude with numerical experimental results showing that our bounds provide informative baselines to the maximal achievable robustness on several datasets.

“Adversarial Examples That Fool Both Computer Vision and Time-Limited Humans”, Elsayed et al 2018

“Adversarial Examples that Fool both Computer Vision and Time-Limited Humans”⁠, Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein et al (2018-02-22; ; similar):

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.

“Intriguing Properties of Adversarial Examples”, Cubuk et al 2018

“Intriguing Properties of Adversarial Examples”⁠, Ekin Dogus Cubuk, Barret Zoph, Samuel Stern Schoenholz, Quoc V. Le (2018-02-15; backlinks; similar):

Adversarial error has similar power-law form for all datasets and models studied, and architecture matters.

It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we show that distributions of logit differences have an universal functional form. This functional form is independent of architecture, dataset, and training protocol; nor does it change during training. This leads to adversarial error having an universal scaling, as a power-law, with respect to the size of the adversarial perturbation. We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-the-art deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD). Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white and black box attacks compared to previous attempts.

[Keywords: adversarial examples, universality, neural architecture search]

“First-order Adversarial Vulnerability of Neural Networks and Input Dimension”, Simon-Gabriel et al 2018

“First-order Adversarial Vulnerability of Neural Networks and Input Dimension”⁠, Carl-Johann Simon-Gabriel, Yann Ollivier, Léon Bottou, Bernhard Schölkopf, David Lopez-Paz (2018-02-05; similar):

Over the past few years, neural networks were proven vulnerable to adversarial images: targeted but imperceptible image perturbations lead to drastically different predictions.

We show that adversarial vulnerability increases with the gradients of the training objective when viewed as a function of the inputs. Surprisingly, vulnerability does not depend on network topology: for many standard network architectures, we prove that at initialization, the 𝓁1-norm of these gradients grows as the square root of the input dimension, leaving the networks increasingly vulnerable with growing image size.

We empirically show that this dimension dependence persists after either usual or robust training, but gets attenuated with higher regularization.

“Adversarial Spheres”, Gilmer et al 2018

“Adversarial Spheres”⁠, Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, Ian Goodfellow et al (2018-01-09; similar):

State of the art computer vision models have been shown to be vulnerable to small adversarial perturbations of the input. In other words, most images in the data distribution are both correctly classified by the model and are very close to a visually similar misclassified image. Despite substantial research interest, the cause of the phenomenon is still poorly understood and remains unsolved. We hypothesize that this counter intuitive behavior is a naturally occurring result of the high dimensional geometry of the data manifold.

As a first step towards exploring this hypothesis, we study a simple synthetic dataset of classifying between two concentric high dimensional spheres. For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size 𝑂(1⁄√d) Surprisingly, when we train several different architectures on this dataset, all of their error sets naturally approach this theoretical bound. As a result of the theory, the vulnerability of neural networks to small adversarial perturbations is a logical consequence of the amount of test error observed.

We hope that our theoretical analysis of this very simple case will point the way forward to explore how the geometry of complex real-world data sets leads to adversarial examples.

“Adversarial Phenomenon in the Eyes of Bayesian Deep Learning”, Rawat et al 2017

“Adversarial Phenomenon in the Eyes of Bayesian Deep Learning”⁠, Ambrish Rawat, Martin Wistuba, Maria-Irina Nicolae (2017-11-22; similar):

Deep Learning models are vulnerable to adversarial examples, i.e. images obtained via deliberate imperceptible perturbations, such that the model misclassifies them with high confidence. However, class confidence by itself is an incomplete picture of uncertainty. We therefore use principled Bayesian methods to capture model uncertainty in prediction for observing adversarial misclassification.

We provide an extensive study with different Bayesian neural networks attacked in both white-box and black-box setups. The behaviour of the networks for noise, attacks and clean test data is compared. We observe that Bayesian neural networks are uncertain in their predictions for adversarial perturbations, a behaviour similar to the one observed for random Gaussian perturbations. Thus, we conclude that Bayesian neural networks can be considered for detecting adversarial examples.

“Intriguing Properties of Adversarial Examples”, Cubuk et al 2017

“Intriguing Properties of Adversarial Examples”⁠, Ekin D. Cubuk, Barret Zoph, Samuel S. Schoenholz, Quoc V. Le (2017-11-08; similar):

It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear.

Here we argue that the origin of adversarial examples is primarily due to an inherent uncertainty that neural networks have about their predictions. We show that the functional form of this uncertainty is independent of architecture, dataset, and training protocol; and depends only on the statistics of the logit differences of the network, which do not change significantly during training. This leads to adversarial error having an universal scaling, as a power-law, with respect to the size of the adversarial perturbation.

We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-the-art deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD).

Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white and black box attacks compared to previous attempts.

“Mitigating Adversarial Effects Through Randomization”, Xie et al 2017

“Mitigating Adversarial Effects Through Randomization”⁠, Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, Alan Yuille (2017-11-06; similar):

Convolutional neural networks have demonstrated high accuracy on various tasks in recent years. However, they are extremely vulnerable to adversarial examples. For example, imperceptible perturbations added to clean images can cause convolutional neural networks to fail.

In this paper, we propose to utilize randomization at inference time to mitigate adversarial effects. Specifically, we use two randomization operations: random resizing, which resizes the input images to a random size, and random padding, which pads zeros around the input images in a random manner. Extensive experiments demonstrate that the proposed randomization method is very effective at defending against both single-step and iterative attacks. Our method provides the following advantages: (1) no additional training or fine-tuning, (2) very few additional computations, (3) compatible with other adversarial defense methods.

By combining the proposed randomization method with an adversarially trained model, it achieves a normalized score of 0.924 (ranked No.2 among 107 defense teams) in the NIPS 2017 adversarial examples defense challenge, which is far better than using adversarial training alone with a normalized score of 0.773 (ranked No.56).

The code is public available at Github⁠.

“Learning Universal Adversarial Perturbations With Generative Models”, Hayes & Danezis 2017

“Learning Universal Adversarial Perturbations with Generative Models”⁠, Jamie Hayes, George Danezis (2017-08-17; ; similar):

Neural networks are known to be vulnerable to adversarial examples, inputs that have been intentionally perturbed to remain visually similar to the source input, but cause a misclassification. It was recently shown that given a dataset and classifier, there exists so called universal adversarial perturbations, a single perturbation that causes a misclassification when applied to any input. In this work, we introduce universal adversarial networks, a generative network that is capable of fooling a target classifier when it’s generated output is added to a clean sample from a dataset. We show that this technique improves on known universal adversarial attacks.

“Robust Physical-World Attacks on Deep Learning Models”, Eykholt et al 2017

“Robust Physical-World Attacks on Deep Learning Models”⁠, Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno et al (2017-07-27; similar):

Recent studies show that the state-of-the-art deep neural networks (DNNs) are vulnerable to adversarial examples, resulting from small-magnitude perturbations added to the input. Given that emerging physical systems are using DNNs in safety-critical situations, adversarial examples could mislead these systems and cause dangerous situations. Therefore, understanding adversarial examples in the physical world is an important step towards developing resilient learning algorithms.

We propose a general attack algorithm, Robust Physical Perturbations (RP2), to generate robust visual adversarial perturbations under different physical conditions. Using the real-world case of road sign classification, we show that adversarial examples generated using RP2 achieve high targeted misclassification rates against standard-architecture road sign classifiers in the physical world under various environmental conditions, including viewpoints.

Due to the current lack of a standardized testing method, we propose a two-stage evaluation methodology for robust physical adversarial examples consisting of lab and field tests. Using this methodology, we evaluate the efficacy of physical adversarial manipulations on real objects. With a perturbation in the form of only black and white stickers, we attack a real stop sign, causing targeted misclassification in 100% of the images obtained in lab settings, and in 84.8% of the captured video frames obtained on a moving vehicle (field test) for the target classifier.

“Towards Deep Learning Models Resistant to Adversarial Attacks”, Madry et al 2017

“Towards Deep Learning Models Resistant to Adversarial Attacks”⁠, Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu (2017-06-19; ; backlinks; similar):

Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples—inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models.

To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee.

We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https:/​/​github.com/​MadryLab/​mnist_challenge and https://github.com/MadryLab/cifar10_challenge⁠.

“Ensemble Adversarial Training: Attacks and Defenses”, Tramèr et al 2017

“Ensemble Adversarial Training: Attacks and Defenses”⁠, Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel (2017-05-19; similar):

Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model’s loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with strong robustness to black-box attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks. However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.

“The Space of Transferable Adversarial Examples”, Tramèr et al 2017

“The Space of Transferable Adversarial Examples”⁠, Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel (2017-04-11; similar):

Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time. They often transfer: the same adversarial example fools more than one model.

In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large (~25) dimensionality. Adversarial subspaces with higher dimensionality are more likely to intersect. We find that for two different models, a substantial fraction of their subspaces is shared, thus enabling transferability.

In the first quantitative analysis of the similarity of different models’ decision boundaries, we show that these boundaries are actually close in arbitrary directions, whether adversarial or benign. We conclude by formally studying the limits of transferability. We derive (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of scenarios in which transfer does not occur. These findings indicate that it may be possible to design defenses against transfer-based attacks, even for models that are vulnerable to direct attacks.

“Adversarial Examples in the Physical World”, Kurakin et al 2016

“Adversarial examples in the physical world”⁠, Alexey Kurakin, Ian Goodfellow, Samy Bengio (2016-07-08; similar):

Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. Up to now, all previous work have assumed a threat model in which the adversary can feed data directly into the machine learning classifier. This is not always the case for systems operating in the physical world, for example those which are using signals from cameras and other sensors as an input. This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from cell-phone camera to an ImageNet Inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera.

“Foveation-based Mechanisms Alleviate Adversarial Examples”, Luo et al 2015

“Foveation-based Mechanisms Alleviate Adversarial Examples”⁠, Yan Luo, Xavier Boix, Gemma Roig, Tomaso Poggio, Qi Zhao (2015-11-19; ; backlinks; similar):

We show that adversarial examples, i.e., the visually imperceptible perturbations that result in Convolutional Neural Networks (CNNs) fail, can be alleviated with a mechanism based on foveations—applying the CNN in different image regions. To see this, first, we report results in ImageNet that lead to a revision of the hypothesis that adversarial perturbations are a consequence of CNNs acting as a linear classifier: CNNs act locally linearly to changes in the image regions with objects recognized by the CNN, and in other regions the CNN may act non-linearly. Then, we corroborate that when the neural responses are linear, applying the foveation mechanism to the adversarial example tends to significantly reduce the effect of the perturbation. This is because, hypothetically, the CNNs for ImageNet are robust to changes of scale and translation of the object produced by the foveation, but this property does not generalize to transformations of the perturbation. As a result, the accuracy after a foveation is almost the same as the accuracy of the CNN without the adversarial perturbation, even if the adversarial perturbation is calculated taking into account a foveation.

Adversarial machine learning

Wikipedia

Active learning (machine learning)

Wikipedia

Miscellaneous