Towards end-to-end temporal action detection in videos

Lu, Chongkai

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	en_US
dc.contributor.advisor	Mak, Man Wai (EEE)	en_US
dc.creator	Lu, Chongkai	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/13241	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Towards end-to-end temporal action detection in videos	en_US
dcterms.abstract	The exponential surge in video content in recent years has positioned video as a dominant medium of social interactions. The abundant video content facilitates the curation of video datasets rich in insightful content, enabling researchers to study human behaviors and enhancing our understanding of the world. However, the casual and random nature of everyday video recordings often leads to a predominance of irrelevant information, necessitating efficient methods for extracting valuable content. Temporal Action Detection (TAD) addresses this challenge by distinguishing ‘foreground’ and ‘background’ segments in video sequences, based on the presence of target actions. This technology, pivotal in processing untrimmed raw videos, helps pinpoint segments of interest and extract pertinent information from these datasets.	en_US
dcterms.abstract	Deep learning, a key subset of machine learning, has revolutionized various fields by enabling neural networks to learn from large datasets. Its primary advantage is the ability to learn from extensive data rather than relying on pre-existing human knowledge, leading to robust generalization and superior performance. Recently, deep learning-based methods have become central in advancing video analysis techniques, particularly in TAD, where they are now the standard approach in the academic community.	en_US
dcterms.abstract	The main contribution of this work is the development of several deep-learning based TAD frameworks that outperform previous methods and offer unique structural benefits. A core goal of this research is to enhance the efficiency and performance of TAD methods. Traditional TAD approaches are often complex and multi-staged, requiring significant engineering effort to fine-tune the model’s hyperparameters. In contrast, our research leverages deep learning’s spirit of efficiency and streamlined processing to develop end-to-end TAD models that integrate feature extraction and action detection in a single process.	en_US
dcterms.abstract	The first part of this dissertation addresses the input stage of TAD. In response to the challenge of processing extensive untrimmed videos, the dissertation introduces the Action Progression Network (APN). APN employs ‘action progression’ as a measurable indicator, enabling the use of a single frame or a brief video segment as the input. This innovation streamlines the TAD process, ensuring uniform computational efficiency, irrespective of video duration. Additionally, APN is distinctively trained to target specific actions independent of background activities, substantially improving its generalization capabilities and diminishing the dependency on large datasets. APN has demonstrated exceptional precision in identifying actions with notable evolutionary features. This proficiency, coupled with its top-tier performance on public datasets, establishes APN as a groundbreaking development in enhancing both the efficiency and accuracy of TAD.	en_US
dcterms.abstract	The second part of this dissertation focuses on optimizing the output stage of TAD. Traditional TAD models generate a multitude of initial results, typically requiring laborious post-processing, such as Non-maximum Suppression (NMS), for refinement. To streamline this process, we integrated the Detection with Transformer (DETR) approach into TAD, enabling the model to directly produce finalized detection results via a one-to-one matching mechanism. This integration not only simplifies the overall detection workflow but also faithfully adheres to the end-to-end principles. Our work further entails the adaptation and refinement of various DETR optimization techniques for TAD applications, involving a series of experiments with diverse configurations to elevate both the performance and accuracy of the models. The result of this extensive research and development is DITA: DETR with Improved Queries for End-to-End Temporal Action Detection. DITA successfully incorporates the traditional detection and post-processing methods into TAD, achieving competitive performance on public datasets and demonstrating its robust capability in practical TAD applications.	en_US
dcterms.abstract	In conclusion, the contributions of this work significantly propel the development of end-to-end TAD, making action detection in videos simple and efficient. The impressive performance of these frameworks on public datasets demonstrates their efficacy and real-world applicability. Looking ahead, we plan to integrate the insights from this work and draw inspiration from other methodologies to develop a truly comprehensive end-to-end TAD model. Further, we plan to delve deeper into the mechanics of deep learning models in video action detection, seeking knowledge beyond traditional model design. This exploration is anticipated to uncover new insights, enhancing the efficiency and effectiveness of TAD models.	en_US
dcterms.extent	135 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2024	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Image processing -- Digital techniques	en_US
dcterms.LCSH	Video recordings	en_US
dcterms.LCSH	Machine learning	en_US
dcterms.LCSH	Deep learning (Machine learning)	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
7696.pdf	For All Users	14.48 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13241