Author: | Lu, Chongkai |
Title: | Towards end-to-end temporal action detection in videos |
Advisors: | Mak, Man Wai (EEE) |
Degree: | Ph.D. |
Year: | 2024 |
Subject: | Image processing -- Digital techniques Video recordings Machine learning Deep learning (Machine learning) Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Electrical and Electronic Engineering |
Pages: | 135 pages : color illustrations |
Language: | English |
Abstract: | The exponential surge in video content in recent years has positioned video as a dominant medium of social interactions. The abundant video content facilitates the curation of video datasets rich in insightful content, enabling researchers to study human behaviors and enhancing our understanding of the world. However, the casual and random nature of everyday video recordings often leads to a predominance of irrelevant information, necessitating efficient methods for extracting valuable content. Temporal Action Detection (TAD) addresses this challenge by distinguishing ‘foreground’ and ‘background’ segments in video sequences, based on the presence of target actions. This technology, pivotal in processing untrimmed raw videos, helps pinpoint segments of interest and extract pertinent information from these datasets. Deep learning, a key subset of machine learning, has revolutionized various fields by enabling neural networks to learn from large datasets. Its primary advantage is the ability to learn from extensive data rather than relying on pre-existing human knowledge, leading to robust generalization and superior performance. Recently, deep learning-based methods have become central in advancing video analysis techniques, particularly in TAD, where they are now the standard approach in the academic community. The main contribution of this work is the development of several deep-learning based TAD frameworks that outperform previous methods and offer unique structural benefits. A core goal of this research is to enhance the efficiency and performance of TAD methods. Traditional TAD approaches are often complex and multi-staged, requiring significant engineering effort to fine-tune the model’s hyperparameters. In contrast, our research leverages deep learning’s spirit of efficiency and streamlined processing to develop end-to-end TAD models that integrate feature extraction and action detection in a single process. The first part of this dissertation addresses the input stage of TAD. In response to the challenge of processing extensive untrimmed videos, the dissertation introduces the Action Progression Network (APN). APN employs ‘action progression’ as a measurable indicator, enabling the use of a single frame or a brief video segment as the input. This innovation streamlines the TAD process, ensuring uniform computational efficiency, irrespective of video duration. Additionally, APN is distinctively trained to target specific actions independent of background activities, substantially improving its generalization capabilities and diminishing the dependency on large datasets. APN has demonstrated exceptional precision in identifying actions with notable evolutionary features. This proficiency, coupled with its top-tier performance on public datasets, establishes APN as a groundbreaking development in enhancing both the efficiency and accuracy of TAD. The second part of this dissertation focuses on optimizing the output stage of TAD. Traditional TAD models generate a multitude of initial results, typically requiring laborious post-processing, such as Non-maximum Suppression (NMS), for refinement. To streamline this process, we integrated the Detection with Transformer (DETR) approach into TAD, enabling the model to directly produce finalized detection results via a one-to-one matching mechanism. This integration not only simplifies the overall detection workflow but also faithfully adheres to the end-to-end principles. Our work further entails the adaptation and refinement of various DETR optimization techniques for TAD applications, involving a series of experiments with diverse configurations to elevate both the performance and accuracy of the models. The result of this extensive research and development is DITA: DETR with Improved Queries for End-to-End Temporal Action Detection. DITA successfully incorporates the traditional detection and post-processing methods into TAD, achieving competitive performance on public datasets and demonstrating its robust capability in practical TAD applications. In conclusion, the contributions of this work significantly propel the development of end-to-end TAD, making action detection in videos simple and efficient. The impressive performance of these frameworks on public datasets demonstrates their efficacy and real-world applicability. Looking ahead, we plan to integrate the insights from this work and draw inspiration from other methodologies to develop a truly comprehensive end-to-end TAD model. Further, we plan to delve deeper into the mechanics of deep learning models in video action detection, seeking knowledge beyond traditional model design. This exploration is anticipated to uncover new insights, enhancing the efficiency and effectiveness of TAD models. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/13241