Machine learning for human activity analysis and recognition

Liu, Tianshan

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.contributor.advisor	Lam, Kin-man Kenneth (EIE)	en_US
dc.creator	Liu, Tianshan	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/12320	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Machine learning for human activity analysis and recognition	en_US
dcterms.abstract	The analysis and the recognition of human activities in videos are crucial and fundamental topics in computer vision. With the development of machine-learning methods, especially the deep-learning-based techniques, and the emergence of large-scale data sets, remarkable improvements have been achieved on the performance of human activity recognition. However, most of the current research is devoted to analyzing single-person activities, captured from third-person views in trimmed videos. This hinders the existing approaches being deployed in some more complicated real-world scenarios, such as when the scene involves interactions between multiple persons, or the activities are recorded from ﬁrst-person (egocentric) views, or only the raw long untrimmed videos are available. Thus, this thesis mainly focuses on investigating eﬀective machine-learning-based models for addressing these challenging issues, which have arisen from four speciﬁc tasks, including egocentric activity recognition, group activity recognition, concurrent ﬁrst and third-person activity recognition, and anomaly event detection in untrimmed videos.	en_US
dcterms.abstract	First, the videos captured from ﬁrst-person views usually contain frequent egomotion, cluttered background, and partial body-movement of the camera-wearer, which leads to the scarcity of useful information. Hence, it is vital to sequentially localize the relevant regions of human-object interactions for identifying the target motion patterns and active objects. This thesis proposes an enhanced attention-tracking method, to coherently capture ﬁne-grained human-object interactions in video sequences without requiring extra frame-level annotations, thereby resulting in accurately recognizing egocentric activities.	en_US
dcterms.abstract	Second, group activity in a scene generally involves complex interactions between multiple persons. Without knowing speciﬁc interaction patterns, it is challenging to model the hidden relationships among subjects from the video inputs. This thesis explores a visual-semantic graph neural network (VS-GNN), which aims to simultaneously exploit abundant visual modalities, and the semantic hierarchies from label space. By discovering the diverse relations between individuals and groups, the proposed VS-GNN contributes to the improvement of the performance of group activity recognition.	en_US
dcterms.abstract	Third, this thesis investigates a novel task, i.e., concurrent ﬁrst and third-person activity recognition (CFT-AR), which is essentially a hybrid scenario that has not been studied in previous works. A new activity data set, namely PolyU CFT Daily, was ﬁrst created to facilitate the research on CFT-AR. This data set inherits the characteristics of egocentric videos and involves multiple persons in varied scenes, which poses unprecedented challenges. Then, a comprehensive solution is presented, which learns both holistic scene-level and local instance-level representations to provide sufficient discriminative patterns for recognizing both ﬁrst and third-person activities.	en_US
dcterms.abstract	Fourth, anomaly event detection (AED) aims to identify the snippets, involving anomalous activities or behaviors in a long untrimmed video. In particular, the weakly supervised (WS) setting is a promising pipeline for AED, as it solely utilizes cheap video-level labels, while signiﬁcantly improving detection performance. Current WS-AED methods tend to employ multimodal inputs to guarantee the robustness of the detector, which highly rely on the availability of multiple modalities and are computationally expensive in processing long sequences. This thesis designs a privileged knowledge-distillation (KD) framework speciﬁcally for the WS-AED task, with the goal of training a lightweight yet effective unimodal detector.	en_US
dcterms.extent	xxxii, 175 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2023	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Human activity recognition	en_US
dcterms.LCSH	Machine learning	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
6767.pdf	For All Users	34.47 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12320