Adversarial robustness with diffusion models

Dai, Xuelong

Author:	Dai, Xuelong
Title:	Adversarial robustness with diffusion models
Advisors:	Xiao, Bin (COMP)
Degree:	Ph.D.
Year:	2025
Department:	Department of Computing
Pages:	xvii, 124 pages : color illustrations
Language:	English
Abstract:	Artificial Intelligence (AI) and Deep Learning (DL) have experienced rapid development and widespread industry deployment in recent years. Among the various deep learning models, Computer Vision (CV) stands out as one of the most advanced fields. DL models have achieved performance comparable to human experts across a range of 2D and 3D tasks. However, adversarial attacks pose a significant threat to the further application of DL-based CV techniques. These attacks involve adding small perturbations to input data, which do not affect human classification but lead to high-confidence misclassification by the target deep learning network. This challenge highlights the urgent need to evaluate and enhance the adversarial robustness of deep learning models. Diffusion models, a recently proposed generative model known for its outstanding performance, have made a significant impact due to their impressive data generation capabilities and user-friendly interface. In addition to their excellent generative performance, these models have demonstrated the ability to conduct high-quality adversarial attacks by generating adversarial data, posing a new threat to the security of deep learning models. Consequently, it is important to investigate the attack capabilities of diffusion models under various threat scenarios and to explore strategies for enhancing adversarial robustness against attacks driven by these models. Firstly, we observe that current adversarial attacks utilizing diffusion models typically employ PGD-like gradients to guide the creation of adversarial examples. However, the generation process of diffusion models should adhere strictly to the learned diffusion process. As a result, these current attacks often produce low-quality adversarial examples with limited effectiveness. To address these issues, we introduce AdvDiff, a theoretically provable adversarial attack method that leverages diffusion models. We have developed two novel adversarial guidance techniques to sample adversarial examples by following the trained reverse generation process of diffusion models. These guidance techniques are effective and stable, enabling the generation of high-quality, realistic adversarial examples by integrating the gradients of the target classifier in an interpretable manner. Experimental results on the MNIST and ImageNet datasets demonstrate that AdvDiff excels in generating unrestricted adversarial examples, surpassing state-of-the-art unrestricted adversarial attack methods in both attack performance and generation quality. Secondly, we note that in no-box adversarial scenarios, where the attacker lacks access to both the training dataset and the target model, the performance of existing attack methods is significantly hindered by limited data access and poor inference from the substitute model. To overcome these challenges, we propose a no-box adversarial attack method that leverages the generative and adversarial capabilities of diffusion models. Specifically, our approach involves generating a synthetic dataset using diffusion models to train a substitute model. We then fine-tune this substitute model using a classification diffusion model, taking into account model uncertainty and incorporating noise augmentation. Finally, we generate adversarial examples from the diffusion models by averaging approximations over the diffusion substitute model with multiple inferences. Extensive experiments on the ImageNet dataset demonstrate that our proposed attack method achieves state-of-the-art performance in both no-box and black-box attack scenarios. Thirdly, we find that existing adversarial research on 3D point cloud models predominantly focuses on white-box scenarios and struggles to achieve successful transfer attacks on recently developed 3D deep-learning models. Moreover, the adversarial perturbations in current 3D attacks often cause noticeable shifts in point coordinates, resulting in unrealistic adversarial examples. To address these challenges, we propose a high-quality adversarial point cloud shape completion method that leverages the generative capabilities of 3D diffusion models. By using partial points as prior knowledge, we generate realistic adversarial examples through shape completion with adversarial guidance. To enhance attack transferability, we explore the characteristics of 3D point clouds and utilize model uncertainty for improved model classification inference through random down-sampling of point clouds. We employ ensemble adversarial guidance to improve transferability across different network architectures. To maintain generation quality, we restrict our adversarial guidance to the critical points of the point clouds by calculating saliency scores. Extensive experiments demonstrate that our proposed attacks outperform state-of-the-art adversarial attack methods against both black-box models and defenses. Our black-box attack establishes a new baseline for evaluating the robustness of various 3D point cloud classification models. Fourthly, we notice that while current diffusion-based adversarial purification methods offer effective and practical defense against adversarial attacks, they suffer from low time efficiency and limited performance against recently developed unrestricted adversarial attacks. To address these issues, we propose an effective and efficient diffusion-based adversarial purification method that counters both perturbation-based and unrestricted adversarial attacks. Our defense is inspired by the observation that adversarial attacks typically occur near the decision boundary and are sensitive to pixel changes. To tackle this, we introduce adversarial anti-aliasing to mitigate adversarial modifications. Additionally, we propose adversarial super-resolution, which uses prior knowledge from clean datasets to benignly recover images. These approaches do not require additional training and are computationally efficient, as they do not involve gradient calculations. Extensive experiments against both perturbation-based and unrestricted adversarial methods demonstrate that our defense method outperforms state-of-the-art adversarial purification techniques.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
8413.pdf	For All Users	35.43 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13958