Compliance vulnerabilities and test-time governance in transformers

Dong, Peiran

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.contributor.advisor	Guo, Jingcai (COMP)	en_US
dc.contributor.advisor	Guo, Song (COMP)	en_US
dc.creator	Dong, Peiran	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/13826	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Compliance vulnerabilities and test-time governance in transformers	en_US
dcterms.abstract	Transformer models have greatly advanced AI applications in areas such as natural language processing and image generation by utilizing their sophisticated architectures for both discriminative and generative tasks. For example, Transformer models trained on large text corpora excel in tasks like semantic analysis and language translation. When integrated into visual models, they also enable text-conditioned image generation.	en_US
dcterms.abstract	However, the increasing deployment of these models has introduced new security risks, particularly concerning compliance vulnerabilities. These vulnerabilities involve ensuring that model outputs meet ethical and regulatory standards, even when faced with malicious attacks. To prevent an AI race that compromises safety and ethical values, it is essential to balance the risks and benefits of deploying AI models.	en_US
dcterms.abstract	This thesis addresses these concerns by focusing on the compliance vulnerabilities of Transformer architectures, particularly backdoor attacks and unsafe content generation. First, we investigate the security risks of backdoor attacks in discriminative models. We introduce a novel backdoor attack method that uses encoding-specific perturbations to trigger malicious behaviors in pre-trained language models. Our research shows that Transformer-based language models can be manipulated to pass off harmful text as benign, allowing it to spread on public platforms undetected. Traditional defenses against backdoor attacks, such as data preprocessing or model fine-tuning, are often expensive. To overcome this, we propose a test-time defense for Vision Transformers (ViTs). By examining output distributions across different ViT blocks, we develop a Directed Term Frequency-Inverse Document Frequency (TF-IDF) based method to detect and classify poisoned inputs effectively. Our approach significantly improves the security and reliability of ViTs against backdoor attacks.	en_US
dcterms.abstract	Generative models with Transformer architectures also face severe compliance risks. Users can generate harmful content, such as violent, infringing, or pornographic material, through text prompts, leading to negative social impacts. To address this, we introduce the PROTORE framework, which ensures safe content generation at test time. This framework employs a "Prototype, Retrieve, and Refine" pipeline to enhance the identification and mitigation of unsafe concepts in generative models. Comprehensive evaluations on various benchmarks demonstrate the effectiveness and scalability of the PROTORE approach in refining generated content.	en_US
dcterms.abstract	In summary, this thesis provides a thorough examination of compliance vulnerabilities in Transformer-based models. Our proposed methodologies and frameworks tackle critical issues in model compliance, laying the groundwork for future research in secure and responsible AI deployment.	en_US
dcterms.extent	xiv, 131 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2025	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Artificial intelligence	en_US
dcterms.LCSH	Computer security	en_US
dcterms.LCSH	Cyber intelligence (Computer security)	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
8336.pdf	For All Users	11.5 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13826