Author: | Xiang, Wangmeng |
Title: | Towards efficient and reliable human activity understanding |
Advisors: | Zhang, Lei (COMP) |
Degree: | Ph.D. |
Year: | 2023 |
Subject: | Computer vision Image analysis Motion perception (Vision) Pattern recognition systems Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | xv, 146 pages : color illustrations |
Language: | English |
Abstract: | Human activity understanding has been an active research area due to its wide range of applications, e.g., sports analysis, healthcare, security monitoring, environment protection, entertainment, self-driving vehicle and human-computer interaction. Generally speaking, understanding of human activities requires us to answer "who (person re-identification) is doing what (action recognition)". In this thesis, we aim to investigate efficient and reliable methodologies for person re-identification and action recognition. In order to reliably recognize human identity, in chapter 2, we propose a novel Part-aware Attention Network (PAN) for person re-identification by using part feature maps as queries to perform second-order information propagation from middle-level features. PAN operates on all spatial positions of feature maps so that it can discover long-distance relations. Considering that hard negative samples have huge impact on action recognition performance, in chapter 3 we propose a Common Daily Action Dataset (CDAD), which contains positive and negative action pairs for reliable daily action understanding. The established CDAD dataset could not only serve as a benchmark for several important daily action understanding tasks, including multi-label action recognition, temporal action localization and spatial-temporal action detection, but also provide a testbed for researchers to investigate the influence of highly similar negative samples in learning action understanding models. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformer based action recognition. In chapter 4, we propose Temporal Patch Shift (TPS) for efficient spatiotemporal self-attention modeling, which largely increase the temporal modeling ability of 2D transformer without additional computation cost. Previous skeleton-based action recognition methods are typically formulated as a classification task of one-hot labels without fully utilizing the semantic relations between actions. To fully explore the action prior knowledge contained in languages, in chapter 5 we propose Language Supervised Training (LST) for skeleton-based action recognition. More specifically, we take a large-scale language model as the knowledge engine to provide text descriptions for body parts' actions and apply a multi-modal training scheme to supervise the skeleton encoder for action representation learning. In summary, in this thesis we present three methods and one dataset for efficient and reliable human activity understanding. Among them, PAN uses part feature to aggregate information from mid-level feature of CNN for person re-identification; CDAD collects positive and negative action pairs for reliable action recognition; TPS applies patch shift operation for efficient spatial-temporal modeling in transformer for video action recognition; and LST deploys human part language description to guide skeleton-based action recognition. Extensive experiments demonstrate their efficiency and reliability for human activity understanding. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/12263