Towards elastic, robust and privacy-preserving AI model serving

Chen, Jinyu

Author:	Chen, Jinyu
Title:	Towards elastic, robust and privacy-preserving AI model serving
Advisors:	Xu, Wenchao (COMP) Guo, Song (COMP) Xiao, Bin (COMP)
Degree:	Ph.D.
Year:	2025
Subject:	Artificial intelligence -- Data processing Machine learning Computer security Hong Kong Polytechnic University -- Dissertations
Department:	Department of Computing
Pages:	xvi, 130 pages : color illustrations
Language:	English
Abstract:	AI model serving has become a cornerstone of intelligent applications, transforming industries and enhancing daily life through AI-driven services. The emergence of foundation models, such as GPT and Vision Transformers, has revolutionized AI services across diverse domains. These models, with billions of parameters, exhibit remarkable generalization capabilities but introduce substantial computational and deployment challenges, underscoring the need for efficient serving strategies to enable real-world adoption. However, modern AI model serving systems face several critical challenges. First, the rapid expansion of model size and complexity results in significant inference overhead, necessitating extensive computational resources and memory bandwidth. Second, the dynamic and unpredictable nature of query loads in AI services leads to severe latency fluctuations and resource contention. Third, user requirements vary significantly in terms of accuracy and response time, demanding flexible serving solutions capable of adaptively balancing efficiency and quality. Additionally, privacy concerns arise when deploying AI models in edge environments, where user data cannot be directly transmitted to centralized servers. To address these challenges, this thesis investigates techniques to enhance elasticity, robustness, and privacy preservation in AI model serving. First, we develop the first elastic serving system specifically designed for Transformer models. Unlike conventional approaches that pre-train multiple model variants of different sizes to accommodate diverse service requirements, which result in prohibitive I/O delays and excessive training costs, we propose a lightweight token adaptation mechanism for elastic Transformer serving. This mechanism dynamically adds prompting tokens to enhance accuracy and reduces redundant tokens to accelerate inference, thereby enhancing system elasticity. To further improve serving throughput, our framework integrates an application-aware selective batching strategy and an online token adaptation algorithm, which dynamically adjusts the token allocation scheme in real time. Experimental results demonstrate that our method significantly enhances serving throughput while maintaining high accuracy. Second, while token reduction techniques effectively accelerate inference by dynamically removing redundant tokens, they often introduce unpredictable accuracy degradation under varying reduction ratios, compromising service robustness. To address this challenge, we introduce Prodigy, an elastic and robust Transformer serving system based on token-reduction warm-up. The core idea is to pre-train multiple warmed-up models at different token reduction levels, leveraging the insight that fine-tuning with token reduction significantly enhances inference accuracy. Instead of fine-tuning models for every possible reduction setting, we develop a strategic fine-tuning planner and a model ensemble method that enable robust inference across a wide range of reduction ratios with high efficiency. These approaches substantially improve service quality while reducing the computational and storage costs for fine-tuning. Third, to enable privacy-preserving optimization, we propose a fast multimodal edge inference framework with a selective feature distillation method. Our method selectively distills knowledge from a pre-trained model in the cloud by uploading only feature representations for public data selection, effectively preventing user data leakage. Additionally, we introduce a privacy-preserving feature clustering mechanism that transmits only prototype-based representations of local features, further enhancing security. To accommodate varying communication bandwidths, we design an adaptive feature compression module that efficiently reduces transmission costs. Experimental results demonstrate that the proposed framework ensures strong privacy protection, optimizes resource utilization, and maintains high inference accuracy. In summary, this thesis presents a set of innovative techniques to improve the elasticity, robustness, and privacy-preserving of AI model serving. Through extensive experiments and evaluations, we demonstrate that our proposed methods significantly enhance serving system performance across diverse real-world scenarios. These contributions pave the way for future advancements in scalable and advanced AI model deployment, ultimately fostering more intelligent, efficient, and trustworthy AI services for society.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
8604.pdf	For All Users	4.74 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14150