Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineeringen_US
dc.contributor.advisorMak, Man Wai (EEE)en_US
dc.creatorLu, Chongkai-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/13241-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic Universityen_US
dc.rightsAll rights reserveden_US
dc.titleTowards end-to-end temporal action detection in videosen_US
dcterms.abstractThe exponential surge in video content in recent years has positioned video as a dominant medium of social interactions. The abundant video content facilitates the curation of video datasets rich in insightful content, enabling researchers to study human behaviors and enhancing our understanding of the world. However, the casual and random nature of everyday video recordings often leads to a predominance of irrelevant information, necessitating efficient methods for extracting valuable content. Temporal Action Detection (TAD) addresses this challenge by distinguishing ‘foreground’ and ‘background’ segments in video sequences, based on the presence of target actions. This technology, pivotal in processing untrimmed raw videos, helps pinpoint segments of interest and extract pertinent information from these datasets.en_US
dcterms.abstractDeep learning, a key subset of machine learning, has revolutionized various fields by enabling neural networks to learn from large datasets. Its primary advantage is the ability to learn from extensive data rather than relying on pre-existing human knowledge, leading to robust generalization and superior performance. Recently, deep learning-based methods have become central in advancing video analysis techniques, particularly in TAD, where they are now the standard approach in the academic community.en_US
dcterms.abstractThe main contribution of this work is the development of several deep-learning based TAD frameworks that outperform previous methods and offer unique structural benefits. A core goal of this research is to enhance the efficiency and performance of TAD methods. Traditional TAD approaches are often complex and multi-staged, requiring significant engineering effort to fine-tune the model’s hyperparameters. In contrast, our research leverages deep learning’s spirit of efficiency and streamlined processing to develop end-to-end TAD models that integrate feature extraction and action detection in a single process.en_US
dcterms.abstractThe first part of this dissertation addresses the input stage of TAD. In response to the challenge of processing extensive untrimmed videos, the dissertation introduces the Action Progression Network (APN). APN employs ‘action progression’ as a measurable indicator, enabling the use of a single frame or a brief video segment as the input. This innovation streamlines the TAD process, ensuring uniform computational efficiency, irre­spective of video duration. Additionally, APN is distinctively trained to target specific actions independent of background activities, substantially improving its generalization capabilities and diminishing the dependency on large datasets. APN has demonstrated exceptional precision in identifying actions with notable evolutionary features. This pro­ficiency, coupled with its top-tier performance on public datasets, establishes APN as a groundbreaking development in enhancing both the efficiency and accuracy of TAD.en_US
dcterms.abstractThe second part of this dissertation focuses on optimizing the output stage of TAD. Traditional TAD models generate a multitude of initial results, typically requiring laborious post-processing, such as Non-maximum Suppression (NMS), for refinement. To streamline this process, we integrated the Detection with Transformer (DETR) approach into TAD, enabling the model to directly produce finalized detection results via a one-­to-one matching mechanism. This integration not only simplifies the overall detection workflow but also faithfully adheres to the end-to-end principles. Our work further en­tails the adaptation and refinement of various DETR optimization techniques for TAD applications, involving a series of experiments with diverse configurations to elevate both the performance and accuracy of the models. The result of this extensive research and development is DITA: DETR with Improved Queries for End-to-End Temporal Action Detection. DITA successfully incorporates the traditional detection and post-processing methods into TAD, achieving competitive performance on public datasets and demon­strating its robust capability in practical TAD applications.en_US
dcterms.abstractIn conclusion, the contributions of this work significantly propel the development of end-to-end TAD, making action detection in videos simple and efficient. The impressive performance of these frameworks on public datasets demonstrates their efficacy and real-world applicability. Looking ahead, we plan to integrate the insights from this work and draw inspiration from other methodologies to develop a truly comprehensive end-to-end TAD model. Further, we plan to delve deeper into the mechanics of deep learning models in video action detection, seeking knowledge beyond traditional model design. This ex­ploration is anticipated to uncover new insights, enhancing the efficiency and effectiveness of TAD models.en_US
dcterms.extent135 pages : color illustrationsen_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2024en_US
dcterms.educationalLevelPh.D.en_US
dcterms.educationalLevelAll Doctorateen_US
dcterms.LCSHImage processing -- Digital techniquesen_US
dcterms.LCSHVideo recordingsen_US
dcterms.LCSHMachine learningen_US
dcterms.LCSHDeep learning (Machine learning)en_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
7696.pdfFor All Users14.48 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13241