Author: | Liu, Tianshan |
Title: | Machine learning for human activity analysis and recognition |
Advisors: | Lam, Kin-man Kenneth (EIE) |
Degree: | Ph.D. |
Year: | 2023 |
Subject: | Human activity recognition Machine learning Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Electronic and Information Engineering |
Pages: | xxxii, 175 pages : color illustrations |
Language: | English |
Abstract: | The analysis and the recognition of human activities in videos are crucial and fundamental topics in computer vision. With the development of machine-learning methods, especially the deep-learning-based techniques, and the emergence of large-scale data sets, remarkable improvements have been achieved on the performance of human activity recognition. However, most of the current research is devoted to analyzing single-person activities, captured from third-person views in trimmed videos. This hinders the existing approaches being deployed in some more complicated real-world scenarios, such as when the scene involves interactions between multiple persons, or the activities are recorded from first-person (egocentric) views, or only the raw long untrimmed videos are available. Thus, this thesis mainly focuses on investigating effective machine-learning-based models for addressing these challenging issues, which have arisen from four specific tasks, including egocentric activity recognition, group activity recognition, concurrent first and third-person activity recognition, and anomaly event detection in untrimmed videos. First, the videos captured from first-person views usually contain frequent egomotion, cluttered background, and partial body-movement of the camera-wearer, which leads to the scarcity of useful information. Hence, it is vital to sequentially localize the relevant regions of human-object interactions for identifying the target motion patterns and active objects. This thesis proposes an enhanced attention-tracking method, to coherently capture fine-grained human-object interactions in video sequences without requiring extra frame-level annotations, thereby resulting in accurately recognizing egocentric activities. Second, group activity in a scene generally involves complex interactions between multiple persons. Without knowing specific interaction patterns, it is challenging to model the hidden relationships among subjects from the video inputs. This thesis explores a visual-semantic graph neural network (VS-GNN), which aims to simultaneously exploit abundant visual modalities, and the semantic hierarchies from label space. By discovering the diverse relations between individuals and groups, the proposed VS-GNN contributes to the improvement of the performance of group activity recognition. Third, this thesis investigates a novel task, i.e., concurrent first and third-person activity recognition (CFT-AR), which is essentially a hybrid scenario that has not been studied in previous works. A new activity data set, namely PolyU CFT Daily, was first created to facilitate the research on CFT-AR. This data set inherits the characteristics of egocentric videos and involves multiple persons in varied scenes, which poses unprecedented challenges. Then, a comprehensive solution is presented, which learns both holistic scene-level and local instance-level representations to provide sufficient discriminative patterns for recognizing both first and third-person activities. Fourth, anomaly event detection (AED) aims to identify the snippets, involving anomalous activities or behaviors in a long untrimmed video. In particular, the weakly supervised (WS) setting is a promising pipeline for AED, as it solely utilizes cheap video-level labels, while significantly improving detection performance. Current WS-AED methods tend to employ multimodal inputs to guarantee the robustness of the detector, which highly rely on the availability of multiple modalities and are computationally expensive in processing long sequences. This thesis designs a privileged knowledge-distillation (KD) framework specifically for the WS-AED task, with the goal of training a lightweight yet effective unimodal detector. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/12320