Author: | Sun, Ying |
Title: | Designing effective preconditioned optimizers for deep neural network training |
Advisors: | Zhang, Lei (COMP) |
Degree: | Ph.D. |
Year: | 2024 |
Subject: | Neural networks (Computer science) Machine learning Computer algorithms Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | xviii, 122 pages : color illustrations |
Language: | English |
Abstract: | Designing an effective optimizer for training deep neural networks (DNNs) has been under the research spotlight for the past decades, and one of the most effective ways is to design preconditioned optimizers. In this thesis, we focus on developing effective preconditioned optimizers from the following four aspects: Hessian-based preconditioned approach, nature gradient based preconditioned approach, preconditioned gradient adaptive stepsize approach, and attention-feature-based preconditioned approach for transformer structures. Accordingly, we present four algorithms: Stochastic Gradient Descent with Partial Hessian (SGD-PH), the Newton- Kronecker Factorized Approximate Curvature (NKFAC) algorithm, Adaptive learning rate method with a Rotation transformation (AdamR), and Attention-Feature- based Optimizer(AFOpt). The details of the works are as follows. SGD-PH: In this work, we propose a compound optimizer, which is a combination of a second-order optimizer with a precise partial Hessian matrix for updating channel-wise parameters and the first-order stochastic gradient descent (SGD) optimizer for updating the other parameters. We show that the associated Hessian matrices of channel-wise parameters are diagonal and can be extracted directly and precisely from Hessian-free methods. The proposed method, namely SGD with Partial Hessian (SGD-PH), inherits the advantages of both first-order and second-order optimizers. Compared with first-order optimizers, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order optimizers, it keeps the good generalization performance of first-order optimizers. Experiments on image classification tasks demonstrate the effectiveness of our proposed optimizer SGD-PH. NKFAC: This work presents a Newton-Kronecker factorized approximate curvature (NKFAC) algorithm, which incorporates Newton’s iteration method for inverting second-order statistics. As the Fisher information matrix between adjacent iterations changes little, Newton’s iteration can be initialized by the inverse obtained from the previous step, producing accurate results within a few iterations thanks to its fast local convergence. This approach reduces the computation time and inherits the property of second-order optimizers, enabling practical applications. The proposed algorithm is further enhanced with several useful implementations, resulting in state-of-the-art generalization performance without the need for extensive parameter tuning. The efficacy of NKFAC is demonstrated through experiments on various computer vision tasks. AdamR: In pursuit of attaining a more favorable regret bound, we propose to integrate a rotation transformation into the existing adaptive learning rate algorithms. We employ the widely-recognized adaptive learning rate optimization method AdamW as a base optimizer, and develop a novel optimizer named AdamR. It consists of three steps in each iteration to compute the modified gradient. Firstly, the computation of the gradient with a rotation; secondly, the execution of the standard Adam step; and finally, the reorientation of the gradient back to its original space. The experimental results on image classification, object detection and segmentation have demonstrated AdamR’s superior performance in accelerating the training process and improving the generalization capability. AFOpt: For attention module that acting as the most critical module in transformers, in this work, we consider the gradient descent step in the attention matrix space and propose a preconditioned optimizer named AFOpt. By converting the gradient step into the attention space, more information in the attention module can be combined into the final descent direction to assist the training process, which helps the transformer training. Numerical experiments are conducted to verify the effect of the proposed optimizer. Overall, we propose four preconditioned optimizers in this thesis. Among them, SGD-PH adopts the Hessian information of normalization layers to assist in training DNNs; NKFAC combines Newton’s method and practical implementations into KFAC to improve the effectiveness and efficiency; AdamR employs an adaptive stepsize method with rotation transformation to achieve lower regret bound; and AFOpt performs the attention matrix based gradient step to better train transformers. The improvement of generalization performance in DNN training experiments proves their effectiveness. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/13232