Author: Lai, Songjiang
Title: Deep learning for human action recognition
Advisors: Lam, Kin Man (EIE)
Degree: M.Sc.
Year: 2022
Subject: Computer vision -- Mathematical models
Human activity recognition
Hong Kong Polytechnic University -- Dissertations
Department: Department of Electronic and Information Engineering
Pages: ix, 52 pages : color illustrations
Language: English
Abstract: With the rapid development and wide popularity of deep learning in recent years, the performance of computer vision tasks has been greatly improved. The two-stream neural network model, applied to video-based action recognition, has now become a hot research topic. Similar to the traditional two-stream convolutional neural network model for action recognition, the inputs to the two branches are the RGB stream and the optical-flow stream, which can be used for human action recognition with promising performance. However, the two-stream model requires high computational complexity because computing optical flow from a video sequence is computationally intensive. Furthermore, the inputs for the two streams are different, i.e., RGB and optical flow. The original two-stream model cannot be trained end to end, which increases the complexity in training the model and limits the performance. In this research, we introduce a representation flow algorithm proposed by AJ al et[1]., based on the TV1-L1 [2] model, which is similar to the optical-flow algorithm. We replace the traditional optical flow branch of egocentric action recognition model proposed by Swathikiran et al. [3] with the representation-flow branch to make it an end-to-end trainable model. This can greatly reduce the computational cost and the prediction runtime of the new model. We apply the new two-stream model for egocentric action recognition. Moreover, we also apply the class attention maps (CAMs) to the RGB stream, so the model can pay more attention to those regions correlated with the activities under consideration. This can significantly improve the recognition accuracy. Then, we apply convLSTM for spatio-temporal encoding on the image features with spatial attention. We train and evaluate the proposed model on three different data sets: GTEA61, EGTEA GAZE+ and HMDB[4]. Experiment results show that our proposed model can achieve the same recognition accuracy as the original egorcnn model with an optical-flow branch on GTEA61 but outperforms it by 0.65% and 0.84% on EGTEA GAZE+ and HMDB, respectively. In terms of speed, experiment results show the average runtime of our proposed model is 0.1881s, 0.1503s, and 0.1459s on the GTEA61, EGTEA GAZE+ and HMDB databases, respectively, while the corresponding runtimes (including the time for extracting optical flow) for the original model are 101.6795s, 25.3799s, and 203.9958s, respectively. Finally, we also conduct ablation studies and discuss the influence of different parameters on the performance of our proposed model, such as the number of layers for representation flow, the different number of blocks for the backbone architecture, etc.
Rights: All rights reserved
Access: restricted access

Files in This Item:
File Description SizeFormat 
6523.pdfFor All Users (off-campus access for PolyU Staff & Students only)1.21 MBAdobe PDFView/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: