Towards efficient tiny machine learning systems for ubiquitous edge intelligence

Zhou, Qihua

Author:	Zhou, Qihua
Title:	Towards efficient tiny machine learning systems for ubiquitous edge intelligence
Advisors:	Guo, Song (COMP)
Degree:	Ph.D.
Year:	2023
Subject:	Machine learning Artificial intelligence Edge computing Hong Kong Polytechnic University -- Dissertations
Department:	Department of Computing
Pages:	xxii, 189 pages : color illustrations
Language:	English
Abstract:	Modern machine learning (ML) applications are often deployed in the cloud environment to exploit the computational power of clusters. However, traditional in-cloud computing schemes cannot satisfy the demands of emerging edge intelligence scenarios, including providing personalized models, protecting user privacy, adapting to real-time tasks and saving resource costs. In order to conquer the limitations of conventional in-cloud computing, a new trend is to utilize the on-device learning paradigm, which makes the end-to-end ML procedure closer to edge devices. As a result, the promising advantages of on-device learning promote the rise of Tiny Machine Learning (TinyML) systems, a scope that focuses on developing ML algorithms and models on resource-constrained edge devices, e.g., microcontrollers, IoT sensors and embedded devices. The term “Tiny” highlights the limited processing capacity, memory volume, and energy resources on these devices. As discussed by the research background in §1.1, TinyML has become an important research topic due to the growth of edge intelligence applications, including smart homes, wearables, robotics, and healthcare services. By applying TinyML systems on ubiquitous edge devices, developers and researchers can effectively reduce inference latency, save resource costs, increase usage experience and protect user privacy. However, implementing a high-performance TinyML system is not easy in practice. We need to dive into the fundamental architecture design and framework implementation, standing in the perspective of system implementation in a full stack, including reducing data scale, model complexity, computational overhead, and communication traffic. Aiming at building an efficient TinyML system, we summarize three core challenges of system design and implementation in §1.2. These challenges motivate the design principle of our methodologies, corresponding to the major contributions of this thesis in §1.3. More precisely, by conducting a comprehensive background review of TinyML systems in Chap. 2, we intend to optimize the system design in three aspects: (1) leveraging the INT8 quantization-aware training to break computational resource constraints on edge devices in Chap. 3, (2) utilizing the hierarchical channel-spatial encoding to alleviate communication bottleneck during edge-cloud collaboration in Chap. 4 and (3) exploring the patch automatic skip scheme to improve on-device model execution efficiency in Chap. 5. First, as will be discussed in Chap. 3, we focus on breaking the constraints of limited resources, alleviating computational overhead and discussing how to improve the computational speed of on-device learning. We show that employing the 8-bit fixed-point (INT8) quantization in both forward and backward passes over a deep model is a promising way to enable tiny on-device learning in practice. The key to an efficient quantization-aware training method is to exploit the hardware-level enabled acceleration while preserving the training quality in each layer. We implement our method in Octo, a lightweight cross-platform system for tiny on-device learning. Experiments on commercial AI chips show that Octo holds higher training efficiency over state-of-the-art quantization training methods, while achieving adequate processing speedup and memory reduction over full-precision training. Second, as will be discussed in Chap. 4, we also cover continuous data analytics and video streaming applications. In this condition, improving communication efficiency by reducing traffic size is one of the most crucial issues for realistic deployment. Existing systems mainly compress features at the pixel level and ignore the characteristics of feature structure, which could be further exploited for more efficient compression. In this work, we take new insights into implementing scalable CL systems through a hierarchical compression on features, termed Stripe-wise Group Quantization (SGQ). Different from previous unstructured quantization methods, SGQ captures both channel and spatial similarity in pixels, and simultaneously encodes features in these two levels to gain a much higher compression ratio. Experiments show that SGQ can effectively alleviate the communication bottleneck with much less traffic, while still preserving the learning accuracy as the original full-precision version. This verifies that SGQ can be applied to a wide spectrum of edge intelligence applications. Third, as will be discussed in Chap. 5, real-time video perception tasks are often challenging on resource-constrained edge devices due to the issues of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods mainly rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. These limitations motivate us to design a general and task-independent methodology, called Patch Automatic Skip Scheme (PASS), which supports diverse video perception settings by decoupling acceleration and tasks. The gist is to capture inter-frame correlations and skip redundant computations at the patch level, where the patch is a non-overlapping square block in visual. Experiments show that applying PASS can benefit the on-device video perception performance, including processing speedups, memory reduction, computation saving, model quality, prediction stability and environmental adaptation. PASS can generalize to real-time video streams on commodity edge devices, e.g., NVIDIA Jetson Nano, with efficient performance in realistic deployment. In summary, TinyML is an emerging technique to pave the last mile of enabling edge intelligence, which eliminates the limitations of conventional in-cloud computing where dozens of computational capacities and memories are needed. Building an efficient TinyML system requires breaking the constraints of limited resources and alleviating computational overhead. Therefore, this thesis presents a software and hardware synergy for TinyML system implementation. Extensive evaluation on commercial edge devices shows the remarkable performance improvement of our proposed system over existing solutions.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
7148.pdf	For All Users	9.17 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12715