| Author: | Zhang, Hanhao |
| Title: | Deep neural networks for skeleton-based human action recognition |
| Advisors: | Lam, Kin-man Kenneth (EEE) |
| Degree: | M.Sc. |
| Year: | 2025 |
| Department: | Department of Electrical and Electronic Engineering |
| Pages: | ix, 66 pages : color illustrations |
| Language: | English |
| Abstract: | Skeleton-based action recognition has received significant attention in recent years due to its robustness and efficiency in learning human motion. This task aims to classify human actions using only skeletal data, such as joint coordinates and bone connectivity. Traditional approaches, typically based on Convolutional Networks (CNNs) or Graph Convolutional Networks (GCNs), commonly require multiple iterations to capture global dependencies, which can be computationally expensive and inefficient, especially for real-time or customised applications. To address these limitations, we propose a lightweight Transformer-based architecture called GraphViViT. The model follows the Vision Transformer embedding strategy and fuses multiple data types, including joint absolute positions, joint absolute velocity values and static joint connection degrees. We introduce spatial and temporal attention biases based on shortest path distances, Euclidean geometry, and temporal windows to enhance the model's ability to capture spatial and temporal dependencies in skeleton data. The model adopts a factorised encoder design to independently process spatial and temporal features, achieving an efficient and scalable structure. We experimented with Graph-ViViT on popular datasets, NTU-RGB+D 60 and NTU-RGB+D 120, showing competitive performance. On NTU-RGB+D 60, it achieves 87.23% (Cross-Subject) and 92.60% (Cross-View) accuracy, outperforming early GCN-based methods like ST-GCN. For NTU RGB+D 120, it reaches 82.79% (Cross-Subject) and 85.03% (Cross-Set) accuracy, closely matching state-of-the-art models despite using joint-only data. The GraphViViT operates with a pure Transformer architecture, avoiding the need for complex graph convolutions or additional modules. It requires only 2.18M parameters and 4.54 GFLOPs for 60-class regression, which is significantly lower than most existing methods. The ablation study confirmed the effectiveness of spatial-temporal attention mechanisms, showing at least a 1.74% improvement in accuracy when using them. Our work demonstrates the potential of pure Transformer architectures for skeleton-based action recognition, simulating traditional CNNs and GCNs operation while achieving local-global feature integration. Future directions include incorporating multi-modal data (e.g., bone modality, RGB) and extending the framework to other graph structure works. GraphViViT establishes a robust baseline for resource-constrained applications, balancing accuracy with computational efficiency in real-time action recognition systems. |
| Rights: | All rights reserved |
| Access: | restricted access |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| 8322.pdf | For All Users (off-campus access for PolyU Staff & Students only) | 5.04 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/13914

