On the proposal generation and overfitting for CNN based single object tracking

Yang, Lingxiao

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.contributor.advisor	Zhang, Lei (COMP)	-
dc.creator	Yang, Lingxiao	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/10410	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	-
dc.rights	All rights reserved	en_US
dc.title	On the proposal generation and overfitting for CNN based single object tracking	en_US
dcterms.abstract	Visual object tracking is a core problem in computer vision and multimedia understanding because it can be applied into numerous applications, such as human-computer interaction, video surveillance, automatic driving, etc. In this thesis, we focus on the task of single object tracking, where the goal is to estimate a series of trajectories of an arbitrary object overtime. As a standard setup, the object is often annotated as a rectangle in the first frame. Recently, Convolutional Neural Networks (CNNs) have shown impressive performance in this task. However, existing CNN trackers still have two common problems: less effective proposal generation and easy overfitting. We aim to address these two issues in this thesis. Proposal generation is often the first step in many existing trackers, which provides a number of candidates by some motion models, e.g. Gaussian Random Walk (GRW). Such candidates are always generated near the center of the previously estimated target position. Then, a CNN model is employed to classify these candidates and the most likely one is the target position in the current frame. However, existing proposal generation methods only use the cues in the previous frames, ignoring lots of useful information in the current frame. In addition, existing motion models tend to generate a substantial amount of low-quality proposals such as background and distractors, which increase the risk of drifting. In this thesis, we propose two methods to address these issues. We first develop a new tracking framework, referred as Deep Location-Specifc Tracking, which decomposes the tracking problem into a localization task and a classification task, and trains an individual network for each task. The localization network exploits the information in the current frame and provides a specific location to improve the probability of successful tracking, while the classification network finds the target among many examples generated around the target location in the previous frame, as well as the one estimated from the localization network in the current frame. Extensive experimental results on popular benchmark datasets demonstrated the effectiveness of the proposed tracking framework. In our second work, a fast estimator is introduced to produce an instance-aware proposal (IAP), which serves as a guidance to remove many useless proposals. The estimator is updated online to adapt changes of target objects according to the feedback from the detector. With the proposed IAP, the whole tracking algorithm achieves leading results on many tracking benchmarks in comparison with other state-of-the-art methods. Furthermore, we also present a small network for fast tracking. In particular, with our IAP component, the fast tracker obtains very similar results compared with the original ones, but runs at a speed over 20 FPS, which is 2~3 times faster than the normal one.	en_US
dcterms.abstract	Due to the limited training data during online tracking, most existing CNN trackers cannot well handle the overfitting problem, which is the major diffculty for robust tracking. An intuitive idea to mitigate this problem is to use large scale labelled data to offine train a powerful CNN. However, the large scale datasets as well as their annotations are expensive to acquire. In recent years, some unsupervised methods have been proposed to learn visual trackers without labelled data, while their performance lags far behind the supervised methods. The main bottleneck of these methods is because of the inconsistent objectives between offline training and online tracking stages. To address this problem, in our third work we propose a novel unsupervised learning pipeline based on the discriminative correlation filter network. It iteratively updates the tracker by alternating between target localization and network optimization. In particular, we propose to learn the network from a single movie without any annotation, where this type of data source can be easily obtained other than collecting thousands of video clips or millions of images. Our approach is insensitive to the employed movies and it achieves the best performance among all unsupervised learning approaches. Moreover, we found out tracker can obtain similar results to the related supervised learning methods, which trains the same network on a large scaled labeled video datasets. Another possible way to address the problem of overfitting is to use a small (or lightweight) CNN model for online tracking. In our last work, we design a tiny (<100 KB) CNN architecture for single object tracking based on the popular regression tracking framework, referred as regression trackers. Regression trackers learn to map an input sample to soft labels, which are usually generated by a Gaussian function. Existing regression trackers mostly train deep models for feature extraction, and employ sophisticated architectures for online detection. Such systems should optimize a massive number of trainable parameters, introducing the risk of severe overfitting. Moreover, the use of very deep models compromises the speed for many practical applications. In this work, we present a simple yet effective system, called LiteCNT, which only consists of three convolutional layers for the whole tracking process. A multi-region convolutional operator is introduced for output regression. This operator is simple but powerful as it enables our tracker to capture more details of target object. We further derive an efficient and effective operator to approximate multi-region aggregation. Experiments on five benchmark datasets, including OTB-2013, OTB-2015, UAV-123, LaSOT and VOT-2017, showed that the proposed method is comparable with state-of-the-art trackers in accuracy, while having much smaller in model size, as well as running in faster speed. As a summary, in this thesis we propose two methods to improve the quality of sampled proposals for sampling based trackers, and present two methods to address the overfitting problem in online tracking. The proposed methods are effective and efficient, and they have great potentials for practical use in systems with limited computational resources.	en_US
dcterms.extent	xvi, 122 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2020	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.LCSH	Neural networks (Computer science)	en_US
dcterms.LCSH	Computer vision	en_US
dcterms.LCSH	Machine learning	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
991022378658703411.pdf	For All Users	4.27 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10410