Energy efficient computing-in-memory accelerators for deep learning

Liu, Dingbang

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.contributor.advisor	Chen, Changwen (COMP)	en_US
dc.contributor.advisor	Lyu, Mingsong (COMP)	en_US
dc.creator	Liu, Dingbang	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/14152	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Energy efficient computing-in-memory accelerators for deep learning	en_US
dcterms.abstract	The exponential data growth in the information era necessitates a re-evaluation of traditional hardware architectures for data-oriented computing. This is particularly critical in the context of emerging deep learning applications, especially in resource-limited edge systems (1; 2). The limitations of pure software-based platforms in handling such vast datasets have become increasingly apparent. To address this challenge, we require scalable hardware solutions capable of both efficient data storage and processing. Overcoming the memory bottleneck is a key hurdle in designing energy-efficient architectures that can support future big data-driven applications.	en_US
dcterms.abstract	To meet edge-computing demands, optimizing algorithms and designing specialized circuit systems cooperatively for hardware acceleration is essential. Techniques like precision quantization, data sparsity processing, and network fine-tuning reduce computational load and algorithm complexity. For instance, (3) leverages network redundancy to quantize neural networks to 8-bit integers, improving power consumption and performance. However, uniform quantization not only fails to eliminate all redundancy but can also degrade output accuracy. Method in (4) uses mixed-precision quantization to maintain Transformer performance with lower computational cost; its dependence on a static architecture limits adaptability in dynamic environments. Although (5) proposes a layer-wise fine-grained pruning technique for model sparsification, their approach is limited because it does not incorporate cross-layer sensitivity into the optimization. Consequently, it fails to fully leverage the potential of highly efficient low-bit operations (e.g., 2-bit or 1-bit).	en_US
dcterms.abstract	Computing-In-Memory (CIM) accelerators have the characteristics of storage and computing integration, which has the potential to break through the limit of Moore's law and the bottleneck of Von Neumann architecture. CIM architectures can be categorized into current-based, charge-based, time-based, and digital-domain implementations. Current-based CIM architectures perform multiply-accumulate (MAC) operations by modulating bit-line currents in accordance with Kirchhoff's law. While such architectures achieve high density, they are susceptible to non-idealities such as process, voltage, and temperature (PVT) variations and read disturbances. Previous current-based CIM designs have mitigated device nonlinearity and PVT effects through additional peripheral circuitry and limited parallelism (6–8). Charge-domain CIM architectures employ charge accumulation schemes to improve error tolerance and linearity compared to conventional analog approaches (9–11). However, these architectures remain constrained by parasitic effects, which can compromise overall computational performance (12–14). Time-based CIM architectures utilize programmable delay units to perform computations and accumulate results through delay lines (15). A key advantage of this approach is its superior sensing margin compared to current- or charge-based alternatives, though this comes at the cost of increased latency due to sequential processing. Digital CIM architectures perform multiplication and dimension reduction directly using digital logic (16). Despite their robustness, the significant overhead associated with full-precision digital circuits currently limits their energy efficiency.	en_US
dcterms.abstract	Static random access memory (SRAM), a conventional memory technology, is renowned for its high speed, high precision, and immunity to noise, making it suitable for accuracy-oriented applications. Resistive random access memory (ReRAM) has also emerged as a promising candidate for power-oriented applications. Its non-volatile nature minimizes leakage power, while it provides high-density storage capabilities. This integration of logic and memory functions within a single device offers significant power and area advantages. Moreover, both memory devices could perform in-memory computations without relying on external I/O operations, making it a potential universal memory solution for future big data applications.	en_US
dcterms.abstract	In the first work a fabricated SRAM based charge-domain CIM macro is presented. This work firstly employs neural network search (NAS) method to find out the layer-wise optimized precisions and sparsities for convolutional neural networks (CNNs). Then, a 144-Kb charge-domain signed mixed-precision (2/4/8-bit) CIM accelerator employing bootstrapped SRAM cells with 9-transistors and 1-capacitor (9T1C) structure is proposed that incorporates a bit-level sparsity-aware analog-to-digital converter (ADC). This work not only achieves highly linear parallel accumulation operations to meet AI computing demands but also implements a hardware and software co-optimization system tailored to specific data characteristics. The design is verified on NAS-optimized networks VGG-16 and ResNet-18 using CIFAR-10 dataset, which could achieve an equivalent accuracy at 4-bit of 68.68% while maintaining a high energy efficiency at 2-bit of 135.19TOPS/W by measurements.	en_US
dcterms.abstract	The second work presents a comprehensive hardware-algorithm co-design solution: First, we develop a novel ternary weight splitting (TWS) binarization technique that enables Brain-Floating-Point-16×INT1 (BF16×1-b) transformers, achieving competitive accuracy while dramatically reducing model size compared to full-precision counterparts. Second, we design a fully digital SRAM-based CIM accelerator that integrates bit-parallel SRAM macros within an efficient group vector systolic architecture, capable of storing one complete BERT-Tiny column with stationary systolic data reuse. Implemented in 28nm technology, our design requires only 2KB SRAM within 2mm² area while delivering 6.55TOPS throughput at 419.74mW power consumption, achieving state-of-the-art area efficiency of 3.3TOPS/mm² and normalized energy efficiency of 20.98 TOPS/W for BERT-Tiny model acceleration.	en_US
dcterms.abstract	In the last work, an ReRAM based CNN accelerator is designed. Mixed-bit operations from 1 bit to 8 bits are supported by an effective bitwidth configuration scheme to implement Neural Architecture Search (NAS)-optimized layer-wise multi-bit CNNs. Besides, column-parallel readout is achieved with excellent energy-efficient performance by a variation-reduction accumulation mechanism and low-power readout circuits. Additionally, we further explore systolic data reuse in an ReRAM-based PE array. Experiments are implemented on NAS-optimized ResNet-18. Benchmarks show that the proposed ReRAM accelerator can achieve peak energy efficiency of 2490.32 TOPS/W for 1-bit operation and average energy efficiency of 479.37 TOPS/W for 1∼8-bit operations with evaluating NAS-optimized multi-bitwidth CNNs. When compared with the state-of-the-art works, the proposed accelerator shows at least 14.18× improvement on energy efficiency.	en_US
dcterms.abstract	In summary, the proposed work is dedicated to the highly efficient computing of deep learning applications. We deploy hardware-software co-optimization methodologies to alleviate computational complexity while maintaining acceptable performance. Specifically, Neural Architecture Search (NAS) is used to automatically design layer-wise mixed-precision networks, enhancing efficiency and minimizing quantization error. A pruning-based methodology is also deployed to introduce sparsity, further reducing computational complexity and creating opportunities for sparse-aware data processing. Additionally, we explore more extreme binarization of Transformer models to boost computational efficiency while maintaining accuracy.	en_US
dcterms.abstract	These neural network optimizations create additional requirements for circuit design. All three of our proposed accelerators are tailored at the macro and architectural levels to accommodate these pre-optimized networks with high energy efficiency. Finally, the proposed work addresses prominent issues in charge-based and digital Computing-In-Memory (CIM) architectures, specifically concerning signal integrity, PVT effects, parallelism, and computing accuracy.	en_US
dcterms.extent	xviii, 136 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2025	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Deep learning (Machine learning)	en_US
dcterms.LCSH	Computer architecture	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
8606.pdf	For All Users	19.42 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14152