<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Norman Mu</title>
    <description></description>
    <link>https://www.normanmu.com/</link>
    <atom:link href="https://www.normanmu.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Wed, 15 Apr 2026 15:58:52 -0700</pubDate>
    <lastBuildDate>Wed, 15 Apr 2026 15:58:52 -0700</lastBuildDate>
    <generator>Jekyll v4.4.1</generator>
    
      <item>
        <title>What Do Adversarial Examples Tell Us About Prompt Injections?</title>
        <description>&lt;p&gt;It’s been more than 3 years since the concept of a &lt;a href=&quot;https://simonwillison.net/2022/Sep/12/prompt-injection/&quot;&gt;prompt injection&lt;/a&gt; was first popularized, and in that time AI has gone from “barely coherent conversations” to “autonomous discovery of novel vulnerabilities in the Linux kernel”. But models remain highly credulous and easily misled; &lt;a href=&quot;https://arxiv.org/abs/2603.12277&quot;&gt;a recent paper&lt;/a&gt; was able to reliably circumvent refusals by simply appending specious reasoning traces which justify a given request. Why is prompt injection such a difficult problem to solve? The mixed results from a decade of research on adversarial examples in image classification are instructive.&lt;/p&gt;

&lt;p&gt;An adversarial example is an input intentionally constructed to cause a failure in a machine learning system, such as an email crafted to bypass spam filters. The setting that has drawn the most attention in recent years is that of &lt;a href=&quot;https://arxiv.org/abs/1412.6572&quot;&gt;imperceptible modifications to images&lt;/a&gt; that cause misclassifications by neural networks.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2026-04-15-adversarial-examples-prompt-injections/fgsm_figure.png&quot; alt=&quot;The canonical example of an adversarial example (Goodfellow et al., 2014).&quot; style=&quot;width: 80% ; margin-bottom: 1rem&quot; /&gt;
    &lt;figcaption&gt;The canonical example of an adversarial example (&lt;a href=&quot;https://arxiv.org/abs/1412.6572&quot;&gt;Goodfellow et al., 2014&lt;/a&gt;).&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;Over the years, researchers have studied different instantiations of this problem that vary in the kind of modifications the adversary is allowed to make. The most widely studied whitebox threat model assumes that:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The adversary can make bounded perturbations to real input images&lt;/li&gt;
  &lt;li&gt;The adversary has full knowledge of the classifier and defenses (including all weights)&lt;/li&gt;
  &lt;li&gt;The adversary can afford to compute many (thousands) predictions and gradient steps&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This threat model maps onto some low-to-moderate security threats in the real world, such as evading content filters or sabotaging self-driving cars, but much of the research in this area over the last decade has been motivated by the intellectual challenge of adversarial examples rather than real-world threats to AI systems (&lt;a href=&quot;https://arxiv.org/abs/1807.06732&quot;&gt;Gilmer et al., 2019&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;There are a number of differing explanations for where adversarial examples come from. The closest-to-consensus view is that because neural network models are less constrained in their processing of information than humans, they often learn to leverage spurious statistical correlations during training (&lt;a href=&quot;https://arxiv.org/abs/1905.02175&quot;&gt;Ilyas et al., 2019&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/1811.12231&quot;&gt;Geirhos et al., 2018&lt;/a&gt;). Test accuracy suffers when these correlations no longer hold, due to out-of-sample data or intentional scrambling by an adversary (&lt;a href=&quot;https://arxiv.org/abs/1906.08988&quot;&gt;Yin et al., 2019&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2004.07780&quot;&gt;Geirhos et al., 2020&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The early days of research on adversarial examples were marked by a flurry of clever ideas each claiming large improvements in robustness. Researchers often declared victory prematurely after evaluating against a handful of basic attacks that were not tailored to proposed defenses. For instance, injecting random noise during inference or introducing non-differentiable operations made direct gradient computation impossible, but could also be easily addressed by &lt;a href=&quot;https://arxiv.org/abs/1802.00420&quot;&gt;approximating the gradient&lt;/a&gt;. Many successful careers were built on swatting down spurious claims and establishing careful evaluation protocols (&lt;a href=&quot;https://arxiv.org/abs/1902.06705&quot;&gt;Carlini et al., 2019&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2003.01690&quot;&gt;Croce and Hein, 2020&lt;/a&gt;) that paved the way for standardized benchmarks like &lt;a href=&quot;https://arxiv.org/abs/2010.09670&quot;&gt;RobustBench&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The RobustBench leaderboard is the canonical record book for tracking the performance of image classifiers against a standardized battery of whitebox attacks. Technically, the reported numbers are merely upper bounds on model performance as it is possible that more effective attacks are discovered in the future, but the constraint of making small perturbations to existing images is lim    iting, and there have been years of intensive research into new attack methods. Currently, the best CIFAR-10 classifier stands at 93.7% “clean accuracy” on the original test set, and drops to 73.7% “robust accuracy” when evaluated on the adversarially perturbed test set.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2026-04-15-adversarial-examples-prompt-injections/robustbench.png&quot; alt=&quot;RobustBench (4/15/2026).&quot; style=&quot;width: 100% ; margin-bottom: 1rem&quot; /&gt;
    &lt;figcaption&gt;&lt;a href=&quot;https://robustbench.github.io/&quot;&gt;RobustBench&lt;/a&gt; (4/15/2026).&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;Today, the most robust image classifiers are trained with a more sophisticated version of one of the earliest defenses: adversarial training, or training the model to correctly classify perturbations which are continually generated by an adversary. A critical ingredient is that the perturbations must be freshly computed in each training iteration, since a specific perturbation is easy to defend against. The key differences now are that advances in generative modeling have enabled scaling up the training dataset from CIFAR-10’s original 50K training images to millions of synthetic images, and hardware advances have enabled training much bigger models for much longer. This is the approach taken by the team who trained the model that currently sits at the top of the RobustBench leaderboard (&lt;a href=&quot;https://arxiv.org/abs/2404.09349&quot;&gt;Bartoldson et al., 2024&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The same authors conducted a scaling study which trained many models across varying levels of compute and data quality to fit scaling curves and predict what further investments in compute would yield. Eyeballing their curves, a roughly GPT-4 sized training run would achieve roughly 85% robust accuracy, in the vicinity of where the same authors also estimated human robustness with a human study.&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; This would cost about $10M at current H100 prices, which is at least ten million times more expensive than training a model to 85% clean accuracy. So, it is likely well within our physical capabilities today to train an image classification model that is roughly as robust against adversarial examples as humans, though there hasn’t been anyone willing to fund such an effort.&lt;/p&gt;

&lt;p&gt;Back in the realm of LLMs, even a ten-fold increase in the cost of training is daunting. Complicating matters further, the expensive but feasible research results achieved on image classification assume perfect knowledge of test-time attacks, while attackers in the real world offer no such affordances. There are still regular discoveries of novel attacks, as well as attack techniques used by human experts which are difficult to proceduralize with code or LLMs. Such attacks are difficult or impossible to train against directly.&lt;/p&gt;

&lt;p&gt;Much of agent security work today focuses on securing input and output channels — for instance by &lt;a href=&quot;https://arxiv.org/abs/2503.18813&quot;&gt;isolating untrusted data&lt;/a&gt;, &lt;a href=&quot;https://www.anthropic.com/engineering/claude-code-auto-mode&quot;&gt;detecting misbehavior with classifiers&lt;/a&gt;, or &lt;a href=&quot;https://cdn.openai.com/pdf/dd8e7875-e606-42b4-80a1-f824e4e11cf4/prevent-url-data-exfil.pdf&quot;&gt;minimizing the means&lt;/a&gt; through which malicious instructions can be carried out— accepting the vulnerability of models as a given. The security of software systems that undergird AI agents can be more straightforward to reason about and improve upon, and will benefit significantly from further advances in AI coding capabilities. But for now, there are no easy answers for securing agents that consume any channels of untrusted information.&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;The authors estimate human performance at around 90%, though this is a generous upper bound on human performance since the study subjects were shown images optimized against neural networks, and not images optimized against them. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Wed, 15 Apr 2026 00:00:00 -0700</pubDate>
        <link>https://www.normanmu.com/2026/04/15/adversarial-examples-prompt-injections.html</link>
        <guid isPermaLink="true">https://www.normanmu.com/2026/04/15/adversarial-examples-prompt-injections.html</guid>
        
        
      </item>
    
      <item>
        <title>The Myth of Data Inefficiency in Large Language Models</title>
        <description>&lt;p&gt;A popular assertion among cognitive psychologists is that large language models (LLMs) rely on a “developmentally implausible” form of language learning. There are many points of discordance between AI and human development, but data inefficiency is one of the most commonly discussed. Specifically, it has been claimed that LLMs “receive around four or five orders of magnitude more language data than human children” (Frank, 2023).&lt;/p&gt;

&lt;p&gt;In concrete terms: the 7 billion parameter Llama-2 model was pre-trained on 2 trillion tokens, and the 100 million token BabyLM dataset proposed by Warstadt et al. (2023) has been proposed as a more developmentally plausible amount of “pre-training data” seen by adolescent humans.&lt;/p&gt;

&lt;p&gt;The data inefficiency argument misses two key details, which together suggest that LLMs may not be so far behind humans in data efficiency. First, it neglects an important consequence of LLM scaling laws: to achieve the same level of language fluency, larger models require significantly less data than smaller models. This is an easy point to miss if one were only following AI headlines, which have fixated on the major commercial labs’ strategies of massively scaling up compute, data, and model size.&lt;/p&gt;

&lt;p&gt;The foundational scaling law papers by Kaplan et al. (2020) and Hoffman et al. (2022) are primarily motivated by the question: given a fixed compute budget, how big of a model should be trained to achieve the lowest loss? To further improve the loss, more compute can be spent either on a bigger model or training for longer, so a principled approach is needed for balancing these tradeoffs.&lt;/p&gt;

&lt;p&gt;Kaplan et al. (2020) point out in their Figure 2 (pg. 4) that larger models learn more quickly, though with the sparsely labeled X-axis it’s hard to get a sense of the exact relationship between data and parameters. Fortunately for us, the data from a similar set of experiments in Hoffmann et al. (2022) has been extracted by a team of researchers in a replication study (Besiroglu et al., 2024), and we can plot another version showing dataset size as the dependent variable:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2025-02-14-data-inefficiency-llms/plot.png&quot; alt=&quot;Iso-loss contours as a function of model size (parameters) and training tokens. Plotting code [here](https://gist.github.com/normster/8b97a0f74a5f5a60ba8bee8d7790f1b3).&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;Iso-loss contours as a function of model size (parameters) and training tokens. Plotting code [here](https://gist.github.com/normster/8b97a0f74a5f5a60ba8bee8d7790f1b3).&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;In the context of language learning, models like Llama-2 7B are already able to achieve a level of basic language understanding roughly on par with humans. Remember, we’re just looking at the acquisition of language and not the mastery of all intellectual activities involving language. So how much data would a brain-sized LLM need? A crude comparison on the basis of matching LLM parameter count to synapse count suggests that Llama-2 7B is ~14,000x smaller than the brain which is generally thought to contain ~100T synapses. Plugging this into the revised scaling law by Besiroglu et al. (2024) gives us a dataset of ~60B tokens which is 34x smaller than the 2T tokens needed for Llama-2 7B. A brain-sized (i.e., 100T parameters) LLM is quite feasible with today’s hardware– &lt;a href=&quot;https://chatgpt.com/share/67ae31ca-8fac-8007-a6bc-62fff1c0679b&quot;&gt;10 days of training with 10,000 GB200 GPUs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Furthermore, a fixed data budget can be stretched much further by training for multiple randomly-ordered passes, or “epochs”. Muennighoff et al. (2023) find that text data can be reused up to four times without incurring noticeable model degradations, and increasing the compute budget allows for even more aggressive data reuse. So even with a totally ordinary model architecture (Transformer) and training algorithm (gradient descent), basic language understanding can likely be reached with ~15B tokens, and fewer still if you’re willing to spend more compute on more passes over the training dataset. This already brings LLMs within 2 orders of magnitude of humans, a 100x to 1000x smaller gap than typically claimed.&lt;/p&gt;

&lt;p&gt;More fundamentally, the data inefficiency argument also ignores hundreds of thousands of years of evolutionary development. Developmental psychologists have written extensively on the importance of non-learned “core knowledge” (Spelke, 2007) for enabling within-lifetime learning of language. Machine learning researchers have similar concepts of “inductive bias” or “statistical priors”, which researchers often specify by hand for their computational models. But Nature was afforded no such help in establishing human core knowledge! Instead, this evolved across the millennia that separate our species from earlier hominids.&lt;/p&gt;

&lt;p&gt;By some accounts, language and culture co-evolved with our species starting from 200,000 years ago, which constitutes roughly 10,000 generations. Before an individual human has seen a single token of language, the information contained within their genetic code has already “experienced” the language tokens seen by all 20,000 ancestors prior to reproduction of each subsequent generation. It is only by ignoring this ancestral data that one is able to reach the conclusion that LLMs are vastly less data efficient than humans.&lt;/p&gt;

&lt;p&gt;Our early ancestors likely used language less frequently than we do today, so assuming 100M tokens (i.e., what the BabyLM organizers cite as plausible for a modern 13 year old) for each ancestor, we arrive at 2T tokens total within one’s direct lineage. Also critical to the process of natural selection is the remaining population of the species, across which there have existed ~100B humans throughout history each learning from even more language data.&lt;/p&gt;

&lt;p&gt;To summarize, our very rough estimates hold that an LLM the size of a human brain would only require pre-training for ~15B tokens to reach the language fluency level of Llama-2 7B, whereas the development of human language learning has already consumed at least 2T tokens in across-lifetime learning. And as DeepSeek showed with the R1 model, developing additional skills such as mathematical reasoning requires several orders of magnitude less data than pre-training. Perhaps this shouldn’t be too surprising, given that evolution is a zeroth order algorithm while gradient descent is a more powerful first order algorithm.&lt;/p&gt;

&lt;p&gt;The pursuit of radically more data efficient language model training is motivated by a flawed premise which incorrectly assesses current progress in AI. Instead, researchers studying both artificial and human intelligence need to consider the possibility that deep learning might already be as powerful, if not more powerful than, human learning. Two implications of this: 1) LLMs are a better model for human language learning than many researchers currently believe, and 2) closely emulating human-like patterns of learning and thinking is within grasp.&lt;/p&gt;

&lt;hr class=&quot;section-divider&quot; /&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;Besiroglu, T., et al. (2023). &lt;em&gt;Chinchilla Scaling: A Replication Attempt&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Frank, M. C. (2023). &lt;em&gt;Bridging the data gap between children and large language models&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Hoffmann, J., et al. (2022). &lt;em&gt;Training Compute-Optimal Large Language Models&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Kaplan, J., et al. (2020). &lt;em&gt;Scaling Laws for Neural Language Models&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Muennighoff, N., et al. (2023). &lt;em&gt;Scaling Data-Constrained Language Models&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Spelke, E. S. (2007). &lt;em&gt;Core knowledge&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Warstadt, A., et al. (2023). &lt;em&gt;Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora&lt;/em&gt;.&lt;/p&gt;
</description>
        <pubDate>Fri, 14 Feb 2025 00:00:00 -0800</pubDate>
        <link>https://www.normanmu.com/2025/02/14/data-inefficiency-llms.html</link>
        <guid isPermaLink="true">https://www.normanmu.com/2025/02/14/data-inefficiency-llms.html</guid>
        
        
      </item>
    
      <item>
        <title>Adversarial Patches for Deep Neural Networks</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;A lot of academic attention has been focused on the topic of adversarial attacks against neural networks recently, and for good reason. Because we still lack a thorough understanding of why neural networks work well, the failure modes of neural networks are quite surprising&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. By injective noise which is imperceptible to human vision, it is possible to fool a network into misclassifying an input image with very high confidence. The classic attack called the Fast Gradient Sign Method (FGSM) is shown below and can be implemented in just a few lines of code.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/fgsm.png&quot; alt=&quot;Fast Gradient Sign Method (https://arxiv.org/pdf/1412.6572.pdf)&quot; style=&quot;width: 80% ; margin-bottom: 1rem&quot; /&gt;
    &lt;figcaption&gt;Fast Gradient Sign Method (https://arxiv.org/pdf/1412.6572.pdf)&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;There are already plenty of resources explaining the basics of neural networks and backpropagation, so I’ll just quickly summarize the basics here. The standard formulation of training a neural network is that of minimizing a loss function defined with respect to the weights/parameters of the neural network. In practice this is usually an empirical loss function, i.e. the loss of the network’s predictions averaged over a finite and fixed dataset:&lt;/p&gt;

\[\arg \min_{\theta} \frac{1}{N}\sum_{i=1}^N{l(x_i, y_i; \theta)}\]

&lt;p&gt;for $\theta$ the parameters of the network, $l(\theta;x, y)$ the loss function defined on the parameters of the network with respect to a training input $x$ and label $y$.&lt;/p&gt;

&lt;p&gt;The gold standard of optimization algorithms remains Stochastic Gradient Descent (SGD), where we iteratively update the parameters of the network in the opposite direction of the gradient with respect to the parameters. Because we choose the architecture of a neural network and its loss function to be differentiable, it is also possible to differentiate the loss function with respect to the input $x_i$.&lt;/p&gt;

&lt;p&gt;In order to attack a neural network, we can just flip the original optimization objective on its head. In an untargeted attack, we wish to reduce overall accuracy of the network and can simply maximize the same loss function. In a targeted attack, we wish to convince the network that an input image belongs to an arbitrary class of our choosing. We also add a constraint bounding the maximum norm of the perturbation and our optimization problem in a simple untargeted attack becomes:&lt;/p&gt;

\[\arg \max \quad l(x + \delta, y; \theta)\\
\text{subject to} \quad \|\delta\|_p &amp;lt; \epsilon\]

&lt;p&gt;A key assumption of many adversarial attacks is that a small perturbation does not affect the semantic content of the input, which generally holds in the domain of natural images. In the domain of natural language, however, it has been well documented by comedians on the internet that changing a single word or even punctuation mark can drastically alter the meaning of a sentence.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/comma.jpg&quot; alt=&quot;(knowyourmeme.com)&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;(knowyourmeme.com)&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;There are a wide variety of algorithms which have been designed specifically for this inverted optimization problem which work quite well. In both of these scenarios, we assume full knowledge of both the architecture and the weights of the neural network. These are known as “whitebox” attacks as opposed to “blackbox” attacks. Due to the transfer learning properties of neural networks, however, it is possible to create adversarial attacks even without knowledge of architecture and weights by training a similar architecture on a similar dataset. Surprisingly, whitebox attacks on this voodoo doll network transfer well to the original blackbox target.&lt;/p&gt;

&lt;h1 id=&quot;methods&quot;&gt;Methods&lt;/h1&gt;

&lt;p&gt;In any case, most of these results published previously deal with the case of attacking a fixed input. As the adversarial perturbations are only computed with respect to a particular input, they lose their efficacy when applied to other inputs or slightly transformed. Is it at all possible, then, to create adversarial perturbations that work across a variety of transformations? It turns out that the naive approach of just optimizing the average adversarial objective function across all transformations works quite well. Our untargeted attack from above becomes:&lt;/p&gt;

\[\arg \max \quad \mathbb{E}_{t \sim T, x \sim X} [l(x + \delta, y; \theta)]\\
\text{subject to} \quad \mathbb{E}_{t \sim T, x \sim X} [\|\delta\|_p &amp;lt; \epsilon]\]

&lt;p&gt;for $T$ the distribution of transformations and $X$ the distribution of inputs. In practice the expectation is replaced by a sample mean over a random sample of transformations and inputs.&lt;/p&gt;

&lt;p&gt;This is the main insight behind the paper &lt;a href=&quot;https://arxiv.org/pdf/1712.09665.pdf&quot;&gt;Adversarial Patch&lt;/a&gt; which uses the above optimization framework to train a single adversarial patch which can fool a classifier with high probability when applied on top of arbitrary images, under arbitrary transformations. Given a patch $p$ of some fixed size and an image $x$, they define a patch application operator $A(p, x, t)$ which applies the patch to $x$ after rotating, scaling, and translating by transformation parameter $t$. If we implement the rotation, scaling, and translation through a single affine transformation, the entire operator $A$ is differentiable with respect to the patch $p$.&lt;/p&gt;

&lt;p&gt;Replacing the generic loss function above with the log probability of our target class as well as the noise application with our patch application, we arrive at out final optimization objective which we also optimize with SGD:&lt;/p&gt;

\[\arg \max_{p} \quad \mathbb{E}_{t \sim T, x \sim X} [\log \text{Pr}(\hat{y} | A(p, x, t); \theta))]\]

&lt;h1 id=&quot;results&quot;&gt;Results&lt;/h1&gt;

&lt;p&gt;I re-implemented the Adversarial Patch code in PyTorch and after 500 SGD iterations at a minibatch size of 16, with transformation angle in degrees sampled from [-22.5, 22.5], scale factor sampled from [0.1, 0.9], and location sampled uniformly at random from all locations where the entire patch remains in the image frame, i.e. no part is clipped. With the target class set to toaster, running the attack on a pretrained ResNet50 neural network gives us the following patch:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/baseline_patch.png&quot; alt=&quot;&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;Our patch looks somewhat like a toaster, though perhaps this is just a coincidence. Repeating the attack with target classes of dalmation and rattlesnake, we confirm that each patch does indeed bear an abstract resemblance to its target class. In the original paper, the authors conjecture that the adversarial patch attack works by essentially computing a very high saliency version of the underlying target class which drowns out the original class in the image, and our results here are consistent with this hypothesis.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/baseline_patch_dog.png&quot; alt=&quot;Dalmation patch&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;Dalmation patch&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/baseline_patch_snake.png&quot; alt=&quot;Rattlesnake patch&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;Rattlesnake patch&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;We can also measure the attack success rate against the size of the patch&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/baseline_graph.png&quot; alt=&quot;&quot; style=&quot;width: 70% ; margin-bottom: 1rem&quot; /&gt;
    &lt;figcaption&gt;&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;Quite intuitively, we can see that as we reduce the size of the patch it becomes more and more difficult to pull off a successful attack. We can somewhat mitigate this by shifting the range of sampled scale factors down to focus more on smaller scales which also are more realistic. Sampling the scale instead from the range 0.05 to 0.5 (which corresponds to 2% to 20% of the image area) indeed gives us improved performance at smaller sizes but interestingly also loses some performance at higher sizes which don’t appear during training.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/baseline_small_graph.png&quot; alt=&quot;&quot; style=&quot;width: 70% ; margin-bottom: 1rem&quot; /&gt;
    &lt;figcaption&gt;&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;At this point I decided to experiment with adding a regularization term which penalizes large distances between neighboring pixels of the patch to see if it would give us less noisy patches. I decided to use a variant of total variation:&lt;/p&gt;

\[v_{\text{aniso}}(y) = \sum_{i, j}\|y_{i+1, j} - y_{i,j}\| + \|y_{i, j+1} - y_{i,j}\|\]

&lt;p&gt;Since the sum of distances between all pairs of neighboring pixels ends up being quite large, I then scaled the total variation term down by 0.0001 before adding it to the previous loss. Re-running the original baseline with this new total variation term gives us the following patch:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/tv_patch.png&quot; alt=&quot;&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;The total variation regularized patch is much smoother and more strongly resembles a real toaster. Total variation regularized patches with target classes of dalmation and rattlesnake also realistically resemble their respective classes:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/tv_patch_dog.png&quot; alt=&quot;TV-regularized dalmation patch&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;TV-regularized dalmation patch&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/tv_patch_snake.png&quot; alt=&quot;TV-regularized rattlesnake patch&quot; style=&quot;width:80% ; margin-bottom: 1rem;&quot; /&gt;
    &lt;figcaption&gt;TV-regularized rattlesnake patch&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;We also see a dramatic increase in attack success rate at 5% of image area from 2% to 46% and at 10% of image area from 67% to 84%. I suspect this is because the total variation regularized patch avoids the noisy structure present in the baseline patch which gets destroyed when scaling down to 5-10% and thus fares better at those sizes.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  
    &lt;img src=&quot;/assets/2019-01-17-adversarial-patches/tv_graph.png&quot; alt=&quot;&quot; style=&quot;width: 70% ; margin-bottom: 1rem&quot; /&gt;
    &lt;figcaption&gt;&lt;/figcaption&gt;
  
&lt;/figure&gt;

&lt;p&gt;To me, the most surprising part of these results is how effective the adversarial patch works. With a batch size of 16 and trained with 500 iterations, each patch is trained on just 8000 images. Yet at just 5% of the image area, the patch is successful at fooling the classifier over 40% of the time! I ran some additional experiments which train for more iterations and anneal the learning rate and found further, albeit minor improvements. You can find the code used in these experiments at &lt;a href=&quot;http://github.com/normster/adversarial_patch_pytorch&quot;&gt;http://github.com/normster/adversarial_patch_pytorch&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;In contrast the failure modes of human vision (more commonly known as &lt;a href=&quot;https://en.wikipedia.org/wiki/Optical_illusion&quot;&gt;optical illusions&lt;/a&gt;) have been more extensively studied and understood. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;If an input image already belongs to the attack target class we technically shouldn’t count these in our success rate, but since ImageNet consists of 1000 more or less balanced classes this represents a less than 0.1% inflation to our success rates. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Thu, 17 Jan 2019 00:00:00 -0800</pubDate>
        <link>https://www.normanmu.com/2019/01/17/adversarial-patches.html</link>
        <guid isPermaLink="true">https://www.normanmu.com/2019/01/17/adversarial-patches.html</guid>
        
        
      </item>
    
  </channel>
</rss>
