‘adversarial examples’ directory

Gwern

‘adversarial examples’ directory

See Also
Links
Miscellaneous
Bibliography

Links

“Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs ”, Betley et al 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

“Cryptographers Show That AI Protections Will Always Have Holes ”, Hall 2025

Cryptographers Show That AI Protections Will Always Have Holes

“Bypassing Prompt Guards in Production With Controlled-Release Prompting ”, Fairoze et al 2025

Bypassing Prompt Guards in Production with Controlled-Release Prompting

“Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet [Brave] ”, Chaikin & Sahib 2025

Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet [Brave]

View HTML:

/doc/www/brave.com/93cf5be2a247d64028a46868cb3bfec46e69aa98.html

elder_plinius @ "2025-08-06"

[Claude-4.1 jailbreak]

https://x.com/elder_plinius/status/1952829605653749768

“[Infectious Prompt Memes] ”, HumanAIBlueprint 2025

[Infectious prompt memes]

“On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment ”, Ball et al 2025

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

“Time Blindness: Why Video-Language Models Can’t See What Humans Can? ”, Upadhyay et al 2025

Time Blindness: Why Video-Language Models Can’t See What Humans Can?

“Claude Has Learned How to Jailbreak Cursor! [Working around `rm` Restrictions Using a Shell Script] ”, dogberry 2025

Claude has learned how to jailbreak Cursor! [working around rm restrictions using a shell script]

“Testing the Limit of Atmospheric Predictability With a Machine Learning Weather Model ”, Vonich & Hakim 2025

Testing the Limit of Atmospheric Predictability with a Machine Learning Weather Model

“Watching GPT-O3 Guess a Photo’s Location Is Surreal, Dystopian and Wildly Entertaining ”, Willison 2025

Watching GPT-o3 guess a photo’s location is surreal, dystopian and wildly entertaining

“LLM Multiplication Task: Synonyms Repeatedly Hack Our Regex Monitor ”, McCarthy et al 2025

LLM Multiplication Task: Synonyms repeatedly hack our regex monitor

“[Pseudo-Jailbreaks] ”

[Pseudo-jailbreaks]

“Career Update: Google DeepMind → Anthropic ”, Carlini 2025

Career Update: Google DeepMind → Anthropic

“Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models ”, Rajeev et al 2025

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

“Constitutional Classifiers: Defending against Universal Jailbreaks ”

Constitutional Classifiers: Defending against universal jailbreaks

“Best-Of-N Jailbreaking ”, Hughes et al 2024

Best-of-N Jailbreaking

“Drowning in Documents: Consequences of Scaling Reranker Inference ”, Jacob et al 2024

Drowning in Documents: Consequences of Scaling Reranker Inference

“Hacking Back the AI-Hacker: Prompt Injection As a Defense Against LLM-Driven Cyberattacks ”, Pasquini et al 2024

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

“Jailbreaking LLM-Controlled Robots ”, Robey et al 2024

Jailbreaking LLM-Controlled Robots

“The Structure of the Token Space for Large Language Models ”, Robinson et al 2024

The structure of the token space for large language models

“AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ”, Andriushchenko et al 2024

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

“Bilinear MLPs Enable Weight-Based Mechanistic Interpretability ”, Pearce et al 2024

Bilinear MLPs enable weight-based mechanistic interpretability

“A Single Cloud Compromise Can Feed an Army of AI Sex Bots ”, Krebs 2024

A Single Cloud Compromise Can Feed an Army of AI Sex Bots

“Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing ”

Invisible Unicode text that AI chatbots understand and humans can’t? Yep, it’s a thing

“RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking ”, Jiang et al 2024

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

“Many-Shot Jailbreaking ”, Anil et al 2024

Many-shot Jailbreaking

“How to Evaluate Jailbreak Methods: A Case Study With the StrongREJECT Benchmark ”, Bowen et al 2024

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

“Ensemble Everything Everywhere: Multi-Scale Aggregation for Adversarial Robustness ”, Fort & Lakshminarayanan 2024

Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

“Tamper-Resistant Safeguards for Open-Weight LLMs ”, Tamirisa et al 2024

Tamper-Resistant Safeguards for Open-Weight LLMs

“Does Refusal Training in LLMs Generalize to the Past Tense? ”, Andriushchenko & Flammarion 2024

Does Refusal Training in LLMs Generalize to the Past Tense?

“Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation ”, Halawi et al 2024

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

“Mitigating Skeleton Key, a New Type of Generative AI Jailbreak Technique ”

Mitigating Skeleton Key, a new type of generative AI jailbreak technique

“Can Go AIs Be Adversarially Robust? ”, Tseng et al 2024

Can Go AIs be adversarially robust?

“Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Zhao et al 2024

Probing the Decision Boundaries of In-context Learning in Large Language Models

“Super(Ficial)-Alignment: Strong Models May Deceive Weak Models in Weak-To-Strong Generalization ”, Yang et al 2024

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

“Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI ”, Hönig et al 2024

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

“Safety Alignment Should Be Made More Than Just a Few Tokens Deep ”, Qi et al 2024

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

“A Theoretical Understanding of Self-Correction through In-Context Alignment ”, Wang et al 2024

A Theoretical Understanding of Self-Correction through In-context Alignment

“Fishing for Magikarp: Automatically Detecting Under-Trained Tokens in Large Language Models ”, Land & Bartolo 2024

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

“Cutting through Buggy Adversarial Example Defenses: Fixing 1 Line of Code Breaks Sabre ”, Carlini 2024

Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre

“Novel Universal Bypass for All Major LLMs ”

Novel Universal Bypass for All Major LLMs

“A Rotation and a Translation Suffice: Fooling CNNs With Simple Transformations ”, Engstrom et al 2024

A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations

“Foundational Challenges in Assuring Alignment and Safety of Large Language Models ”, Anwar et al 2024

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

“CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”, Chiu et al 2024

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge

“Many-Shot Jailbreaking ”, Anthropic 2024

Many-shot jailbreaking

“Privacy Backdoors: Stealing Data With Corrupted Pretrained Models ”, Feng & Tramèr 2024

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

“Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression ”, Hong et al 2024

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

“Logits of API-Protected LLMs Leak Proprietary Information ”, Finlayson et al 2024

Logits of API-Protected LLMs Leak Proprietary Information

“Syntactic Ghost: An Imperceptible General-Purpose Backdoor Attacks on Pre-Trained Language Models ”, Cheng et al 2024

Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models

“When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback ”, Lang et al 2024

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

“Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts ”, Samvelyan et al 2024

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

“Fast Adversarial Attacks on Language Models In One GPU Minute ”, Sadasivan et al 2024

Fast Adversarial Attacks on Language Models In One GPU Minute

“`ArtPrompt`: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”, Jiang et al 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

“Using Hallucinations to Bypass GPT-4’s Filter ”, Lemkin 2024

Using Hallucinations to Bypass GPT-4’s Filter

“Discovering Universal Semantic Triggers for Text-To-Image Synthesis ”, Zhai et al 2024

Discovering Universal Semantic Triggers for Text-to-Image Synthesis

“Organic or Diffused: Can We Distinguish Human Art from AI-Generated Images? ”, Ha et al 2024

Organic or Diffused: Can We Distinguish Human Art from AI-generated Images?

“Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training ”, Hubinger et al 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

“Do Not Write That Jailbreak Paper ”

Do not write that jailbreak paper

“Using Dictionary Learning Features As Classifiers ”

Using Dictionary Learning Features as Classifiers

View HTML:

https://transformer-circuits.pub/2024/features-as-classifiers/index.html

“May the Noise Be With You: Adversarial Training without Adversarial Examples ”, Arous et al 2023

May the Noise be with you: Adversarial Training without Adversarial Examples

“Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically ”, Mehrotra et al 2023

Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically

“Eliciting Language Model Behaviors Using Reverse Language Models ”, Pfau et al 2023

Eliciting Language Model Behaviors using Reverse Language Models

“Universal Jailbreak Backdoors from Poisoned Human Feedback ”, Rando & Tramèr 2023

Universal Jailbreak Backdoors from Poisoned Human Feedback

“Language Model Inversion ”, Morris et al 2023

Language Model Inversion

“Dazed & Confused: A Large-Scale Real-World User Study of ReCAPTCHAv2 ”, Searles et al 2023

Dazed & Confused: A Large-Scale Real-World User Study of reCAPTCHAv2

“Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming in the Wild ”, Inie et al 2023

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild

“Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game ”, Toyer et al 2023

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

“Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition ”, Schulhoff et al 2023

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

“Nightshade: Prompt-Specific Poisoning Attacks on Text-To-Image Generative Models ”, Shan et al 2023

Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

“PAIR: Jailbreaking Black Box Large Language Models in 20 Queries ”, Chao et al 2023

PAIR: Jailbreaking Black Box Large Language Models in 20 Queries

“Low-Resource Languages Jailbreak GPT-4 ”, Yong et al 2023

Low-Resource Languages Jailbreak GPT-4

“Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion ”, Kim et al 2023

Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion

“Human-Producible Adversarial Examples ”, Khachaturov et al 2023

Human-Producible Adversarial Examples

“How Robust Is Google’s Bard to Adversarial Image Attacks? ”, Dong et al 2023

How Robust is Google’s Bard to Adversarial Image Attacks?

“Why Do Universal Adversarial Attacks Work on Large Language Models?: Geometry Might Be the Answer ”, Subhash et al 2023

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

“Investigating the Existence of ‘Secret Language’ in Language Models ”, Wang et al 2023

Investigating the Existence of ‘Secret Language’ in Language Models

“A LLM Assisted Exploitation of AI-Guardian ”, Carlini 2023

A LLM Assisted Exploitation of AI-Guardian

“Prompts Should Not Be Seen As Secrets: Systematically Measuring Prompt Extraction Attack Success ”, Zhang & Ippolito 2023

Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success

“CLIPMasterPrints: Fooling Contrastive Language-Image Pre-Training Using Latent Variable Evolution ”, Freiberger et al 2023

CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

“On the Exploitability of Instruction Tuning ”, Shu et al 2023

On the Exploitability of Instruction Tuning

“Are Aligned Neural Networks Adversarially Aligned? ”, Carlini et al 2023

Are aligned neural networks adversarially aligned?

“Evaluating Superhuman Models With Consistency Checks ”, Fluri et al 2023

Evaluating Superhuman Models with Consistency Checks

“Evaluating the Robustness of Text-To-Image Diffusion Models against Real-World Attacks ”, Gao et al 2023

Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

“Large Language Models Sometimes Generate Purely Negatively-Reinforced Text ”, Roger 2023

Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

“On Evaluating Adversarial Robustness of Large Vision-Language Models ”, Zhao et al 2023

On Evaluating Adversarial Robustness of Large Vision-Language Models

“Fundamental Limitations of Alignment in Large Language Models ”, Wolf et al 2023

Fundamental Limitations of Alignment in Large Language Models

“TrojText: Test-Time Invisible Textual Trojan Insertion ”, Liu et al 2023

TrojText: Test-time Invisible Textual Trojan Insertion

“Poisoning Web-Scale Training Datasets Is Practical ”, Carlini et al 2023

Poisoning Web-Scale Training Datasets is Practical

“Glaze: Protecting Artists from Style Mimicry by Text-To-Image Models ”, Shan et al 2023

Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models

“Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons ”, Zehavi & Shamir 2023

Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons

“TrojanPuzzle: Covertly Poisoning Code-Suggestion Models ”, Aghakhani et al 2023

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

“Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models ”, Henderson et al 2022

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

“SNAFUE: Diagnostics for Deep Neural Networks With Automated Copy/Paste Attacks ”, Casper et al 2022

SNAFUE: Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

“Are AlphaZero-Like Agents Robust to Adversarial Perturbations? ”, Lan et al 2022

Are AlphaZero-like Agents Robust to Adversarial Perturbations?

“Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models ”, Struppek et al 2022

Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models

“Adversarial Policies Beat Superhuman Go AIs ”, Wang et al 2022

Adversarial Policies Beat Superhuman Go AIs

“Broken Neural Scaling Laws ”, Caballero et al 2022

Broken Neural Scaling Laws

“On Optimal Learning Under Targeted Data Poisoning ”, Hanneke et al 2022

On Optimal Learning Under Targeted Data Poisoning

“BTD: Decompiling X86 Deep Neural Network Executables ”, Liu et al 2022

BTD: Decompiling x86 Deep Neural Network Executables

“Discovering Bugs in Vision Models Using Off-The-Shelf Image Generation and Captioning ”, Wiles et al 2022

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

“Adversarially Trained Neural Representations May Already Be As Robust As Corresponding Biological Neural Representations ”, Guo et al 2022

Adversarially trained neural representations may already be as robust as corresponding biological neural representations

“MET: Masked Encoding for Tabular Data ”, Majmundar et al 2022

MET: Masked Encoding for Tabular Data

“Flatten the Curve: Efficiently Training Low-Curvature Neural Networks ”, Srinivas et al 2022

Flatten the Curve: Efficiently Training Low-Curvature Neural Networks

“Why Robust Generalization in Deep Learning Is Difficult: Perspective of Expressive Power ”, Li et al 2022

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

“Diffusion Models for Adversarial Purification ”, Nie et al 2022

Diffusion Models for Adversarial Purification

“Planting Undetectable Backdoors in Machine Learning Models ”, Goldwasser et al 2022

Planting Undetectable Backdoors in Machine Learning Models

“Training a Helpful and Harmless Assistant With Reinforcement Learning from Human Feedback ”, Bai et al 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

“Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings ”, Mao et al 2022

Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings

“On the Effectiveness of Dataset Watermarking in Adversarial Settings ”, Tekgul & Asokan 2022

On the Effectiveness of Dataset Watermarking in Adversarial Settings

“An Equivalence Between Data Poisoning and Byzantine Gradient Attacks ”, Farhadkhani et al 2022

An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

“Red Teaming Language Models With Language Models ”, Perez et al 2022

Red Teaming Language Models with Language Models

“WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation ”, Liu et al 2022

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

“CommonsenseQA 2.0: Exposing the Limits of AI through Gamification ”, Talmor et al 2022

CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

“Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs ”, Korkmaz 2021

Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs

“Models in the Loop: Aiding Crowdworkers With Generative Annotation Assistants ”, Bartolo et al 2021

Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts ”, Khashabi et al 2021

PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts

“Spinning Language Models for Propaganda-As-A-Service ”, Bagdasaryan & Shmatikov 2021

Spinning Language Models for Propaganda-As-A-Service

“TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems ”, Doan et al 2021

TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems

“AugMax: Adversarial Composition of Random Augmentations for Robust Training ”, Wang et al 2021

AugMax: Adversarial Composition of Random Augmentations for Robust Training

“Unrestricted Adversarial Attacks on ImageNet Competition ”, Chen et al 2021

Unrestricted Adversarial Attacks on ImageNet Competition

“Quantization Backdoors to Deep Learning Commercial Frameworks ”, Ma et al 2021

Quantization Backdoors to Deep Learning Commercial Frameworks

“The Dimpled Manifold Model of Adversarial Examples in Machine Learning ”, Shamir et al 2021

The Dimpled Manifold Model of Adversarial Examples in Machine Learning

“Partial Success in Closing the Gap between Human and Machine Vision ”, Geirhos et al 2021

Partial success in closing the gap between human and machine vision

“A Universal Law of Robustness via Isoperimetry ”, Bubeck & Sellke 2021

A Universal Law of Robustness via Isoperimetry

“Manipulating SGD With Data Ordering Attacks ”, Shumailov et al 2021

Manipulating SGD with Data Ordering Attacks

“Gradient-Based Adversarial Attacks against Text Transformers ”, Guo et al 2021

Gradient-based Adversarial Attacks against Text Transformers

“A Law of Robustness for Two-Layers Neural Networks ”, Bubeck et al 2021

A law of robustness for two-layers neural networks

“Multimodal Neurons in Artificial Neural Networks [CLIP] ”, Goh et al 2021

Multimodal Neurons in Artificial Neural Networks [CLIP]

“Do Input Gradients Highlight Discriminative Features? ”, Shah et al 2021

Do Input Gradients Highlight Discriminative Features?

“Words As a Window: Using Word Embeddings to Explore the Learned Representations of Convolutional Neural Networks ”, Dharmaretnam et al 2021

Words as a window: Using word embeddings to explore the learned representations of Convolutional Neural Networks

“Bot-Adversarial Dialogue for Safe Conversational Agents ”, Xu et al 2021

Bot-Adversarial Dialogue for Safe Conversational Agents

“Unadversarial Examples: Designing Objects for Robust Vision ”, Salman et al 2020

Unadversarial Examples: Designing Objects for Robust Vision

“Emergent Complexity and Zero-Shot Transfer via Unsupervised Environment Design [UED+PAIRED] ”, Dennis et al 2020

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design [UED+PAIRED]

“Concealed Data Poisoning Attacks on NLP Models ”, Wallace et al 2020

Concealed Data Poisoning Attacks on NLP Models

“Recipes for Safety in Open-Domain Chatbots ”, Xu et al 2020

Recipes for Safety in Open-domain Chatbots

“Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples ”, Gowal et al 2020

Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples

“Dataset Cartography: Mapping and Diagnosing Datasets With Training Dynamics ”, Swayamdipta et al 2020

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

“Collaborative Learning in the Jungle (Decentralized, Byzantine, Heterogeneous, Asynchronous and Nonconvex Learning) ”, El-Mhamdi et al 2020

Collaborative Learning in the Jungle (Decentralized, Byzantine, Heterogeneous, Asynchronous and Nonconvex Learning)

“Do Adversarially Robust ImageNet Models Transfer Better? ”, Salman et al 2020

Do Adversarially Robust ImageNet Models Transfer Better?

“Smooth Adversarial Training ”, Xie et al 2020

Smooth Adversarial Training

“Sponge Examples: Energy-Latency Attacks on Neural Networks ”, Shumailov et al 2020

Sponge Examples: Energy-Latency Attacks on Neural Networks

“Improving the Interpretability of FMRI Decoding Using Deep Neural Networks and Adversarial Robustness ”, McClure et al 2020

Improving the Interpretability of fMRI Decoding using Deep Neural Networks and Adversarial Robustness

“Approximate Exploitability: Learning a Best Response in Large Games ”, Timbers et al 2020

Approximate exploitability: Learning a best response in large games

“Radioactive Data: Tracing through Training ”, Sablayrolles et al 2020

Radioactive data: tracing through training

“ImageNet-A: Natural Adversarial Examples ”, Hendrycks et al 2020

ImageNet-A: Natural Adversarial Examples

“Adversarial Examples Improve Image Recognition ”, Xie et al 2019

Adversarial Examples Improve Image Recognition

“Fooling LIME and SHAP: Adversarial Attacks on Post Hoc Explanation Methods ”, Slack et al 2019

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

“The Bouncer Problem: Challenges to Remote Explainability ”, Merrer & Tredan 2019

The Bouncer Problem: Challenges to Remote Explainability

“Distributionally Robust Language Modeling ”, Oren et al 2019

Distributionally Robust Language Modeling

“Universal Adversarial Triggers for Attacking and Analyzing NLP ”, Wallace et al 2019

Universal Adversarial Triggers for Attacking and Analyzing NLP

“Robustness Properties of Facebook’s ResNeXt WSL Models ”, Orhan 2019

Robustness properties of Facebook’s ResNeXt WSL models

“Intriguing Properties of Adversarial Training at Scale ”, Xie & Yuille 2019

Intriguing properties of adversarial training at scale

“Adversarially Robust Generalization Just Requires More Unlabeled Data ”, Zhai et al 2019

Adversarially Robust Generalization Just Requires More Unlabeled Data

“Adversarial Robustness As a Prior for Learned Representations ”, Engstrom et al 2019

Adversarial Robustness as a Prior for Learned Representations

“Are Labels Required for Improving Adversarial Robustness? ”, Uesato et al 2019

Are Labels Required for Improving Adversarial Robustness?

“Adversarial Policies: Attacking Deep Reinforcement Learning ”, Gleave et al 2019

Adversarial Policies: Attacking Deep Reinforcement Learning

“Adversarial Examples Are Not Bugs, They Are Features ”, Ilyas et al 2019

Adversarial Examples Are Not Bugs, They Are Features

“Smooth Adversarial Examples ”, Zhang et al 2019

Smooth Adversarial Examples

“Benchmarking Neural Network Robustness to Common Corruptions and Perturbations ”, Hendrycks & Dietterich 2019

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

“Fairwashing: the Risk of Rationalization ”, Aïvodji et al 2019

Fairwashing: the risk of rationalization

“AdVersarial: Perceptual Ad Blocking Meets Adversarial Machine Learning ”, Tramèr et al 2018

AdVersarial: Perceptual Ad Blocking meets Adversarial Machine Learning

“Adversarial Reprogramming of Text Classification Neural Networks ”, Neekhara et al 2018

Adversarial Reprogramming of Text Classification Neural Networks

“Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations ”, Hendrycks & Dietterich 2018

Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations

“Adversarial Reprogramming of Neural Networks ”, Elsayed et al 2018

Adversarial Reprogramming of Neural Networks

“Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data ”, Yang et al 2018

Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data

“Robustness May Be at Odds With Accuracy ”, Tsipras et al 2018

Robustness May Be at Odds with Accuracy

“Towards the First Adversarially Robust Neural Network Model on MNIST ”, Schott et al 2018

Towards the first adversarially robust neural network model on MNIST

“Adversarial Vulnerability for Any Classifier ”, Fawzi et al 2018

Adversarial vulnerability for any classifier

“Sensitivity and Generalization in Neural Networks: an Empirical Study ”, Novak et al 2018

Sensitivity and Generalization in Neural Networks: an Empirical Study

“Intriguing Properties of Adversarial Examples ”, Cubuk et al 2018

Intriguing Properties of Adversarial Examples

“First-Order Adversarial Vulnerability of Neural Networks and Input Dimension ”, Simon-Gabriel et al 2018

First-order Adversarial Vulnerability of Neural Networks and Input Dimension

“Adversarial Spheres ”, Gilmer et al 2018

Adversarial Spheres

“CycleGAN, a Master of Steganography ”, Chu et al 2017

CycleGAN, a Master of Steganography

“Adversarial Phenomenon in the Eyes of Bayesian Deep Learning ”, Rawat et al 2017

Adversarial Phenomenon in the Eyes of Bayesian Deep Learning

“Mitigating Adversarial Effects Through Randomization ”, Xie et al 2017

Mitigating Adversarial Effects Through Randomization

“Chihuahua or Muffin? My Search for the Best Computer Vision API ”, Yao 2017

Chihuahua or muffin? My search for the best computer vision API

“Learning Universal Adversarial Perturbations With Generative Models ”, Hayes & Danezis 2017

Learning Universal Adversarial Perturbations with Generative Models

“Robust Physical-World Attacks on Deep Learning Models ”, Eykholt et al 2017

Robust Physical-World Attacks on Deep Learning Models

“Lempel-Ziv: a ‘1-Bit Catastrophe’ but Not a Tragedy ”, Lagarde & Perifel 2017

Lempel-Ziv: a ‘1-bit catastrophe’ but not a tragedy

“Towards Deep Learning Models Resistant to Adversarial Attacks ”, Madry et al 2017

Towards Deep Learning Models Resistant to Adversarial Attacks

“Ensemble Adversarial Training: Attacks and Defenses ”, Tramèr et al 2017

Ensemble Adversarial Training: Attacks and Defenses

“The Space of Transferable Adversarial Examples ”, Tramèr et al 2017

The Space of Transferable Adversarial Examples

“Learning from Simulated and Unsupervised Images through Adversarial Training ”, Shrivastava et al 2016

Learning from Simulated and Unsupervised Images through Adversarial Training

“Membership Inference Attacks against Machine Learning Models ”, Shokri et al 2016

Membership Inference Attacks against Machine Learning Models

“Adversarial Examples in the Physical World ”, Kurakin et al 2016

Adversarial examples in the physical world

“Foveation-Based Mechanisms Alleviate Adversarial Examples ”, Luo et al 2015

Foveation-based Mechanisms Alleviate Adversarial Examples

“Explaining and Harnessing Adversarial Examples ”, Goodfellow et al 2014

Explaining and Harnessing Adversarial Examples

“Scunthorpe ”, Sandberg 2026

Scunthorpe

“Baiting the Bot ”

Baiting the bot

“Janus ”

Janus

“A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’ ”

A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’

View External Link:

https://distill.pub/2019/advex-bugs-discussion/

“A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’: Learning from Incorrectly Labeled Data ”

A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’: Learning from Incorrectly Labeled Data

View External Link:

https://distill.pub/2019/advex-bugs-discussion/response-6/

“Beyond the Board: Exploring AI Robustness Through Go ”

Beyond the Board: Exploring AI Robustness Through Go

“Adversarial Policies in Go ”

Adversarial policies in Go

“Imprompter ”

Imprompter

“MCP Security Notification: Tool Poisoning Attacks ”

MCP Security Notification: Tool Poisoning Attacks

“Why I Attack ”, Carlini 2026

Why I Attack

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning ”

It’s Owl in the Numbers: Token Entanglement in Subliminal Learning

View HTML (21MB):

/doc/www/owls.baulab.info/73d15d13c522548da36e0057ad1a23809113f8fd.html

“When AI Gets Hijacked: Exploiting Hosted Models for Dark Roleplaying ”

When AI Gets Hijacked: Exploiting Hosted Models for Dark Roleplaying

View HTML:

/doc/www/permiso.io/462bd55e2aa087f2ca4a344d106f70275fed821b.html

“Neural Style Transfer With Adversarially Robust Classifiers ”

Neural Style Transfer with Adversarially Robust Classifiers

View HTML:

/doc/www/reiinakano.com/eaeb42d2e6178ce198f63d85d8aff91b4c8ff537.html

“Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations ”

Pixels still beat text: Attacking the OpenAI CLIP model with text patches and adversarial pixel perturbations

“Time Blindness: Why Video-Language Models Can’t See What Humans Can? ”

Time Blindness: Why Video-Language Models Can’t See What Humans Can?

“Adversarial Machine Learning ”

Adversarial machine learning

View HTML:

/doc/www/win-vector.com/a1d36a41223f2f4cf6b348be17328dc1eb789447.html

“The Chinese Women Turning to ChatGPT for AI Boyfriends ”

The Chinese women turning to ChatGPT for AI boyfriends

“Interpreting Preference Models W/Sparse Autoencoders ”

Interpreting Preference Models w/Sparse Autoencoders

View HTML:

/doc/www/www.greaterwrong.com/704ba4488bcfca509f4f8c8bb3627ef5fb21f53b.html

“[MLSN #2]: Adversarial Training ”

[MLSN #2]: Adversarial Training

View External Link:

https://www.lesswrong.com/posts/7GQZyooNi5nqgoyyJ/mlsn-2-adversarial-training

“AXRP Episode 1—Adversarial Policies With Adam Gleave ”

AXRP Episode 1—Adversarial Policies with Adam Gleave

View External Link:

https://www.lesswrong.com/posts/8MZ72PYa3kRe4yRDD/axrp-episode-1-adversarial-policies-with-adam-gleave

“I Found >800 Orthogonal ‘Write Code’ Steering Vectors ”

I found >800 orthogonal ‘write code’ steering vectors

View HTML:

/doc/www/www.greaterwrong.com/441e2c82f2dbe90699728ce7f7fefd27ae4f2a0e.html

“When Your AIs Deceive You: Challenges With Partial Observability in RLHF ”

When Your AIs Deceive You: Challenges with Partial Observability in RLHF

“Claude Sonnet 3.7 (Often) Knows When It’s in Alignment Evaluations ”

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

“Role Embeddings: Making Authorship More Salient to LLMs ”

Role embeddings: making authorship more salient to LLMs

View HTML:

/doc/www/www.greaterwrong.com/a5ba3f418580c3338e920195027e0cf9ab6da175.html

“A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More ”

A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

“Bing Finding Ways to Bypass Microsoft’s Filters without Being Asked. Is It Reproducible? ”

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

View HTML:

/doc/www/www.greaterwrong.com/487c2fd3e7587697c1ac89e105d4245b348ffc89.html

“Best-Of-n With Misaligned Reward Models for Math Reasoning ”

Best-of-n with misaligned reward models for Math reasoning

“One-Shot Steering Vectors Cause Emergent Misalignment, Too ”

One-shot steering vectors cause emergent misalignment, too

“Steganography and the CycleGAN—Alignment Failure Case Study ”

Steganography and the CycleGAN—alignment failure case study

“A Three-Layer Model of LLM Psychology ”

A Three-Layer Model of LLM Psychology

“This Viral AI Chatbot Will Lie and Say It’s Human ”

This Viral AI Chatbot Will Lie and Say It’s Human

View External Link:

https://www.wired.com/story/bland-ai-chatbot-human/

“A Universal Law of Robustness ”

A Universal Law of Robustness

https://www.youtube.com/watch?v=OzGguadEHOU

“Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI’s CLIP Model! ”

Apple or iPod? Easy Fix for Adversarial Textual Attacks on OpenAI’s CLIP Model!

“A Law of Robustness and the Importance of Overparameterization in Deep Learning ”

A law of robustness and the importance of overparameterization in deep learning

https://www.youtube.com/watch?v=ujMvnQpP528

NoaNabeshima

The new CLIP adversarial examples are partially from the use-mention distinction. CLIP was trained to predict which caption from a list matches an image. It makes sense that a picture of an apple with a large ‘iPod’ label would be captioned with ‘iPod’, not ‘Granny Smith’!

/doc/www/localhost/18251ac2d494b792d2220fe0c0410d78ac0ed12c.html

fabianstelzer

[Grok-3 indirect prompt injection via tweet retrieval]

repligate

Claude-3 base-model-like jailbreak

https://x.com/repligate/status/1776041976653402508

Wikipedia (2)

Miscellaneous

Bibliography

https://arxiv.org/abs/2410.13691: “Jailbreaking LLM-Controlled Robots ”, Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, George J. Pappas

link-bibliography
https://arxiv.org/abs/2410.08993: “The Structure of the Token Space for Large Language Models ”, Michael Robinson, Sourya Dey, Shauna Sweet

link-bibliography
https://arxiv.org/abs/2408.05446: “Ensemble Everything Everywhere: Multi-Scale Aggregation for Adversarial Robustness ”, Stanislav Fort, Balaji Lakshminarayanan

link-bibliography
https://arxiv.org/abs/2407.11969: “Does Refusal Training in LLMs Generalize to the Past Tense? ”, Maksym Andriushchenko, Nicolas Flammarion

link-bibliography
https://arxiv.org/abs/2406.12843: “Can Go AIs Be Adversarially Robust? ”, Tom Tseng, Euan McLean, Kellin Pelrine, Tony T. Wang, Adam Gleave

link-bibliography
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Siyan Zhao, Tung Nguyen, Aditya Grover

link-bibliography
https://arxiv.org/abs/2404.06664: “CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge ”, Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi

link-bibliography
https://arxiv.org/abs/2402.17747: “When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback ”, Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

link-bibliography
https://arxiv.org/abs/2402.15570: “Fast Adversarial Attacks on Language Models In One GPU Minute ”, Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi

link-bibliography
https://arxiv.org/abs/2402.11753: “ArtPrompt: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”, Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

link-bibliography
https://arxiv.org/abs/2401.05566#anthropic: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training ”, Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

link-bibliography
https://arxiv.org/abs/2310.08419: “PAIR: Jailbreaking Black Box Large Language Models in 20 Queries ”, Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong

link-bibliography
https://arxiv.org/abs/2310.02279#sony: “Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion ”, Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon

link-bibliography
https://arxiv.org/abs/2309.11751: “How Robust Is Google’s Bard to Adversarial Image Attacks? ”, Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu

link-bibliography
https://arxiv.org/abs/2306.07567: “Large Language Models Sometimes Generate Purely Negatively-Reinforced Text ”, Fabien Roger

link-bibliography
https://arxiv.org/abs/2305.16934: “On Evaluating Adversarial Robustness of Large Vision-Language Models ”, Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, Min Lin

link-bibliography
https://arxiv.org/abs/2303.02242: “TrojText: Test-Time Invisible Textual Trojan Insertion ”, Yepeng Liu, Bo Feng, Qian Lou

link-bibliography
https://arxiv.org/abs/2302.04222: “Glaze: Protecting Artists from Style Mimicry by Text-To-Image Models ”, Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, Ben Y. Zhao

link-bibliography
https://arxiv.org/abs/2211.03769: “Are AlphaZero-Like Agents Robust to Adversarial Perturbations? ”, Li-Cheng Lan, Huan Zhang, Ti-Rong Wu, Meng-Yu Tsai, I-Chen Wu, Cho-Jui Hsieh

link-bibliography
https://arxiv.org/abs/2211.00241: “Adversarial Policies Beat Superhuman Go AIs ”, Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

link-bibliography
https://arxiv.org/abs/2208.08831#deepmind: “Discovering Bugs in Vision Models Using Off-The-Shelf Image Generation and Captioning ”, Olivia Wiles, Isabela Albuquerque, Sven Gowal

link-bibliography
https://arxiv.org/abs/2205.07460: “Diffusion Models for Adversarial Purification ”, Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, Anima Anandkumar

link-bibliography
https://swabhs.com/assets/pdf/wanli.pdf#allen: “WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation ”, Alisa Liu, Swabha Swayamdipta, Noah A. Smith, Yejin Choi

link-bibliography
https://arxiv.org/abs/2201.05320#allen: “CommonsenseQA 2.0: Exposing the Limits of AI through Gamification ”, Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, Jonathan Berant

link-bibliography
https://arxiv.org/abs/2110.13771#nvidia: “AugMax: Adversarial Composition of Random Augmentations for Robust Training ”, Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, Zhangyang Wang

link-bibliography
https://arxiv.org/abs/2106.07411: “Partial Success in Closing the Gap between Human and Machine Vision ”, Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wiel, Brendel

link-bibliography
https://arxiv.org/abs/2105.12806: “A Universal Law of Robustness via Isoperimetry ”, Sébastien Bubeck, Mark Sellke

link-bibliography
https://distill.pub/2021/multimodal-neurons/#openai: “Multimodal Neurons in Artificial Neural Networks [CLIP] ”, Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, Chris Olah

link-bibliography
https://aclanthology.org/2021.naacl-main.235.pdf#facebook: “Bot-Adversarial Dialogue for Safe Conversational Agents ”, Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan

link-bibliography
https://arxiv.org/abs/2006.14536#google: “Smooth Adversarial Training ”, Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, Quoc V. Le

link-bibliography
https://arxiv.org/abs/2002.00937: “Radioactive Data: Tracing through Training ”, Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

link-bibliography
https://arxiv.org/abs/1911.09665: “Adversarial Examples Improve Image Recognition ”, Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, Quoc V. Le

link-bibliography
https://arxiv.org/abs/1907.07640: “Robustness Properties of Facebook’s ResNeXt WSL Models ”, A. Emin Orhan

link-bibliography
https://arxiv.org/abs/1706.06083: “Towards Deep Learning Models Resistant to Adversarial Attacks ”, Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

link-bibliography