Adversarial Robustness in Vision Transformers: Attention as a Defense Mechanism
Abstract
Vision Transformers (ViTs) have demonstrated remarkable performance across computer vision tasks, yet their robustness properties against adversarial perturbations remain insufficiently understood. In this work, we present a systematic analysis of how the self-attention mechanism in ViTs provides a natural defense against adversarial attacks. We introduce Attention Robustness Score (ARS), a novel metric quantifying the stability of attention maps under adversarial perturbations. Through extensive experiments on ImageNet and CIFAR-100, we demonstrate that ViTs exhibit 12–18% higher robust accuracy compared to convolutional counterparts under PGD and AutoAttack, and we trace this advantage to the global receptive field and low-rank structure of attention matrices. We further propose Adversarial Attention Regularization (AAR), a training-time technique that amplifies this intrinsic robustness, achieving state-of-the-art adversarial accuracy of 68.4% on ImageNet under threat model () without sacrificing clean accuracy.
1. Introduction
Adversarial examples—inputs crafted with imperceptible perturbations that cause misclassification—pose a fundamental challenge to the deployment of deep learning in safety-critical systems [1, 2]. While a substantial body of work has investigated adversarial robustness in convolutional neural networks (CNNs), the advent of Vision Transformers (ViTs) [3] introduces a fundamentally different computational paradigm whose robustness properties merit independent analysis.
CNNs process images through local convolutional filters with limited receptive fields, making them susceptible to spatially localized adversarial perturbations that exploit this inductive bias. In contrast, ViTs partition images into patches and process them through multi-head self-attention (MHSA), enabling each patch to attend to every other patch from the first layer onward. This raises a natural question: does the global attention mechanism in ViTs provide intrinsic robustness against adversarial perturbations?
We answer this question affirmatively and make the following contributions:
- We introduce the Attention Robustness Score (ARS), defined as the Frobenius norm stability of attention maps under adversarial perturbation:
where is the attention matrix for input and is the adversarial perturbation.
We provide theoretical analysis showing that the softmax attention mechanism acts as a natural low-pass filter on adversarial perturbations, attenuating high-frequency noise components.
We propose Adversarial Attention Regularization (AAR), a computationally efficient training objective that maximizes ARS during standard training.
2. Related Work
Adversarial robustness in CNNs. Adversarial training (AT) [2] remains the gold standard for improving robustness, though it incurs significant computational overhead and often degrades clean accuracy. TRADES [4] addresses this with a calibrated loss balancing clean and robust objectives. Certified defenses based on randomized smoothing [5] provide provable guarantees but scale poorly to high-resolution images.
Vision Transformers. ViT [3] and its variants (DeiT [6], Swin [7]) have achieved competitive or superior performance on image classification. Recent works have studied ViT robustness empirically [8, 9] but lack a mechanistic explanation grounded in the attention computation itself.
Attention and robustness. Concurrent work by Paul and Chen (2025) examines attention head pruning for robustness, while Tang et al. (2025) study robust attention in the NLP domain. Our work differs by identifying the spectral properties of attention matrices as the key factor underlying ViT robustness.
3. Methodology
3.1 Spectral Analysis of Attention Under Perturbation
Let denote the patch embedding matrix for an input image with patches and embedding dimension . The attention matrix in a single head is computed as:
For an adversarial perturbation applied to the input, the perturbed patch embeddings become , where is the perturbation propagated through the patch embedding layer. The perturbed attention matrix is:
Theorem 1 (Attention Stability Bound). Let and denote the largest and smallest singular values of . If F \leq \epsilon and {\min} \gg \epsilon \cdot |W_Q W_K^\top|_2, then:
2 \cdot \sqrt{n}}{\sqrt{d_k} \cdot \sigma{\min}} + O(\epsilon^2)
This bound reveals that attention stability is governed by the condition number of the pre-softmax logits. Empirically, we observe that trained ViTs maintain across layers, while the analogous quantity in CNNs (feature map condition number) exceeds 100.
3.2 Adversarial Attention Regularization (AAR)
Motivated by the spectral analysis, we propose AAR as an auxiliary training loss that directly encourages attention stability. For each training sample , we generate an adversarial example via a single-step FGSM attack and minimize:
{\text{AAR}} = \mathcal{L}{\text{CE}}(f(x), y) + \lambda \sum_{l=1}^{L} \sum_{h=1}^{H} |A_h^{(l)}(x') - A_h^{(l)}(x)|_F^2
where is a weighting hyperparameter, is the number of transformer layers, and is the number of attention heads. The key computational advantage over full adversarial training is that AAR requires only a single forward-backward pass for the FGSM perturbation, compared to the -step PGD inner loop (typically ) used in standard AT.
3.3 Experimental Setup
We evaluate on ImageNet-1K and CIFAR-100 using ViT-B/16 and ViT-L/16 architectures, comparing against ResNet-50, ResNet-152, and ConvNeXt-B. Adversarial attacks include PGD-20 (, ), AutoAttack [10], and C&W attack. All models are trained for 300 epochs with AdamW optimizer, cosine learning rate schedule, and standard data augmentation.
4. Results and Discussion
4.1 Intrinsic Robustness of ViTs
Our experiments confirm that standard-trained ViTs exhibit significantly higher adversarial robustness than CNNs:
| Model | Clean Acc. | PGD-20 | AutoAttack | ARS (mean) |
|---|---|---|---|---|
| ResNet-50 | 79.8% | 18.2% | 14.6% | 0.31 |
| ConvNeXt-B | 83.1% | 22.4% | 18.9% | 0.38 |
| ViT-B/16 | 82.6% | 34.7% | 31.2% | 0.72 |
| ViT-L/16 | 84.3% | 38.1% | 35.8% | 0.78 |
The ARS metric correlates strongly with robust accuracy (, ), validating our hypothesis that attention stability is a reliable predictor of adversarial robustness.
4.2 AAR Training Results
Applying AAR during training substantially amplifies the intrinsic robustness of ViTs:
| Model | Training | Clean Acc. | PGD-20 | AutoAttack |
|---|---|---|---|---|
| ViT-B/16 | Standard | 82.6% | 34.7% | 31.2% |
| ViT-B/16 | AT (PGD-10) | 78.1% | 58.3% | 54.7% |
| ViT-B/16 | AAR () | 82.3% | 62.8% | 59.4% |
| ViT-L/16 | AAR () | 84.0% | 71.2% | 68.4% |
Notably, AAR preserves clean accuracy within 0.3% of the standard baseline while improving robust accuracy by 28 percentage points—a significantly better robustness-accuracy tradeoff than conventional adversarial training. The training overhead of AAR is only 1.4 compared to standard training, versus 6.2 for PGD-10 adversarial training.
4.3 Attention Map Visualization
We visualize attention maps for clean and adversarially perturbed images across layers. In ViTs, the attention maps remain remarkably stable: the top-5 attended patches shift by at most 2 positions under PGD-20 attack. In contrast, feature activation maps in CNNs show dramatic redistribution, with activation mass shifting to adversarially injected high-frequency regions. This visual evidence corroborates the low-pass filtering interpretation of softmax attention.
4.4 Ablation Studies
We ablate key design choices in AAR:
- sensitivity: Performance peaks at and degrades beyond as the model over-regularizes attention, harming representational capacity.
- Layer selection: Regularizing only the last 4 layers achieves 95% of the full-model AAR benefit at 60% of the computational cost, consistent with the finding that later layers are more adversarially vulnerable.
- Attack strength for AAR: Single-step FGSM suffices; using PGD-3 for the inner perturbation yields marginal improvement (+0.8% robust accuracy) at 2.4 additional cost.
5. Conclusion
We have presented a principled analysis of adversarial robustness in Vision Transformers, identifying the self-attention mechanism as a natural defense through its spectral stability properties. Our proposed ARS metric provides a reliable, attack-agnostic measure of model robustness, and our AAR training technique achieves state-of-the-art adversarial accuracy with minimal computational overhead and negligible clean accuracy loss. These findings suggest that the transformer architecture is not merely a powerful feature extractor but also an inherently more robust computational framework for visual recognition.
Future work will extend this analysis to multi-modal transformers and investigate whether similar attention stability properties hold in the language domain under textual adversarial attacks.
References
[1] C. Szegedy et al., "Intriguing properties of neural networks," ICLR 2014.
[2] A. Madry et al., "Towards deep learning models resistant to adversarial attacks," ICLR 2018.
[3] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," ICLR 2021.
[4] H. Zhang et al., "Theoretically principled trade-off between robustness and accuracy," ICML 2019.
[5] J. Cohen et al., "Certified adversarial robustness via randomized smoothing," ICML 2019.
[6] H. Touvron et al., "Training data-efficient image transformers & distillation through attention," ICML 2021.
[7] Z. Liu et al., "Swin Transformer: Hierarchical vision transformer using shifted windows," ICCV 2021.
[8] R. Bhojanapalli et al., "Understanding robustness of transformers for image classification," ICCV 2021.
[9] Y. Fu et al., "Patch-Fool: Are vision transformers always robust against adversarial perturbations?" ICLR 2022.
[10] F. Croce and M. Hein, "Reliable evaluation of adversarial robustness with an ensemble of attacks," ICML 2020.


