Video-based pattern recognition by spatio-temporal modeling via multi-modality co-learning

Pao Yue-kong Library Electronic Theses Database

Video-based pattern recognition by spatio-temporal modeling via multi-modality co-learning


Author: Zheng, Haomian
Title: Video-based pattern recognition by spatio-temporal modeling via multi-modality co-learning
Degree: Ph.D.
Year: 2012
Subject: Digital video.
Image processing -- Digital techniques.
Pattern recognition systems.
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Computing
Pages: xiv, 105 p. : ill. ; 30 cm.
Language: English
InnoPac Record:
Abstract: The rapid growth of online video content makes it a challenging task to analyze, understand and process video contentinreal time. Video pattern recognition is emerging as an important research topic in computer vision and communication. Real-time applications such as Internet video searching and video surveillance are popular nowadays. Therefore effective and fast processing approaches are highly demanded. Although the traditional pattern recognition techniques can solve problems for text and image with satisfactory performance, they are subject to certain limitations when processing video due to the large amount of data and time complexity. On the other hand, some statistic models have been proposed for some special video processing applications, however, they cannot handle the general video-based pattern recognition problem. In this thesis, we tackle these problems by addressing three key issues: feature extraction/video representation, indexing, and similarity measurement for classification. The feasibility of the proposed approaches is demonstrated through the experiments on audio-visual speaker identification, video action recognition and gesture recognition. Firstly we investigate the problem for video feature extraction and representation. Trajectories in high dimensional space are used to represent the video clip and global statistical features are extracted from the trajectory for classification. Based on such feature extraction, we propose two new approaches, Differential Luminance Field Trajectory (DLFT) and Luminance Aligned Projection Distance (LAPD) for the recognition task. For DLFT, we extract the differential signals as features, and then classify the action by supervised learning. For the LAPD approach, we define a new similarity measurement and compute a distance metric to describe the similarity between videos for classification. A potential fusion of the two methods yields more promising properties. Experimental results demonstrate the methods work effectively and efficiently.
Secondly we extend our work by utilizing local spatio-temporal features via indexing. Local features generally contain more statistical information for discrimination. We deal with the spatio-temporal modeling by partitioning appearance space. The proposed approach can capture the discriminative information among different action classes. For trajectory matching solution, we develop a query-driven dynamic appearance modeling method and use localized subspaces to obtain more reliable distance for discrimination. Flexibility is also guaranteed by introducing a warping scheme. The processing is implemented based on an indexing scheme, which is very fast in computation. Simulation results demonstratethe effectiveness of the solution. Thirdly we focus on improving the pattern recognition performance by proposing novel learning methods. Consider the various features used for video representation, we target on utilizing multiple set of features to jointly solve the recognition problem. We propose a multi-modality distance metric co-learning method. Two set of different features are jointly utilized to generate a better description the video clips. In this way the similarity between video clips is better evaluated and the recognition accuracy is improved. The effectiveness of proposed method is proved by audio-visual speaker identification. Furthermore, to demonstrate the robustness, the proposed method is also applied on digit recognition and text classification. Experiment results show the proposed multi-modality result is better than single modality, together with other previous method in recognition accuracy.

Files in this item

Files Size Format
b26158693.pdf 4.709Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.


Quick Search


More Information