Author: | Kwong, Ngai Wing |
Title: | Video quality assessment (VQA) based on machine learning |
Advisors: | Chan, Yui-lam (EEE) |
Degree: | Ph.D. |
Year: | 2024 |
Subject: | Image processing -- Digital techniques Imaging systems -- Image quality Digital video Webcasting Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Electrical and Electronic Engineering |
Pages: | xvi, 163 pages : color illustrations |
Language: | English |
Abstract: | Nowadays, streaming platforms such as Netflix, Disney+, and YouTube have changed the way we watch TV shows and movies, leading to a significant surge in internet traffic. This expansion is further propelled by the emergence of over-the-top live gaming content video (GCV) services, including Twitch.tv and YouTube Gaming. Additionally, the COVID-19 pandemic has put a spotlight on screen content video (SCV), which has experienced a marked increase in attention due to the widespread shift to remote work, online conferencing, shared-screen collaborations, and online education. This shift has elevated SCV from being a niche form of media to a mainstream media. As a result, video sharing on social networks and streaming platforms has seen remarkable growth, fostering the development of various video-related applications. Against this backdrop, there is an escalating need for video quality assessment (VQA) methods that accurately evaluate human-perceived video quality, ensuring both the Quality of Service (QoS) and Quality of Experience (QoE). Therefore, this thesis proposes some novel deep learning-based methodologies to enhance the effectiveness of VQA and uphold service quality. In the realm of VQA, some existing methods encounter a domain gap issue, resulting in sub-optimal feature representation and reduced accuracy. To address this, our thesis introduces an advanced VQA approach that utilizes a multi-channel Convolutional Neural Network (CNN) combined with a Gated Recurrent Unit (GRU) for optimal feature learning. Initially, drawing inspiration from self-supervised learning (SSL), the multi-channel CNN is pre-trained within the image quality assessment (IQA) domain without relying on human-annotated labels. Then, semi-supervised learning is applied to fine-tune the pre-trained model, transferring the domain from IQA to VQA. This phase also integrates motion-aware information to enhance frame-level quality feature representation. Subsequently, both human visual perception (HVP) features and frame-level representations are fed into the GRU to derive an accurate prediction of video quality. The experimental results underscore the effectiveness and reliability of our model, confirming its alignment with human perceptual judgments. Afterward, recognizing the significance of spatiotemporal features in GCV, which is often characterized by less spatial complexity, high smoothing motion information, and shared spatial and temporal features across frames, we propose a specialized deep learning model tailored for GCV quality assessment tasks, with a focus on GCV spatiotemporal feature learning. Our approach begins with the deployment of a multi-task SSL Spatiotemporal Pyramid CNN (STP-CNN), which is designed to extract multiscale spatiotemporal features across various time scales and frames arranged in a pyramid structure, capturing a wide array of spatiotemporal cues dynamically. Building upon this, we introduce the Differential Transformer (Diff-Transformer) model with SSL pre-training and fine-tuning strategies to process all short-term spatiotemporal features within a GCV, extracting global spatiotemporal features of GCV to assess the overall quality of GCV. The results of our experimental evaluations validate the superiority of this innovative method over existing gaming content video quality assessment (GCVQA) models, accurately predicting human-perceived quality. Considering the limitations of existing screen content video quality assessment (SCVQA) approaches that rely on handcrafted features, which may not capture all pertinent distortions and overlook critical features, we introduce the first deep learning-based model specifically crafted for SCVQA to overcome the constraints of handcrafted features. First, our model utilizes a multi-channel CNN to independently extract spatial quality features from pictorial and textual regions within the screen content frame(SCF) and fuse them to form a comprehensive spatial quality feature representation of SCF. Subsequently, we propose a Time-distributed CNN-Transformer (TCNNT) model to further process all spatial quality feature representations, learning both spatial and temporal features in tandem. By doing so, it can extract high-level spatiotemporal features crucial for assessing the overall quality of SCV. The experimental results of our SCVQA model confirm its robustness and effectiveness, demonstrating its capability to evaluate the quality of SCV accurately. In addition to our previous efforts, we also propose a novel SCVQA model specifically tailored to consider the spatiotemporal features of SCVs. Our approach introduces a Dual-Channel Spatiotemporal CNN (DCST-CNN) module to adeptly extract both content-aware and edge-aware spatiotemporal quality features, enabling a robust and effective representation learning of spatiotemporal quality features. Building upon the DCST-CNN, we further propose a Temporal Pyramid Transformer (TPT) module that processes spatiotemporal quality features across multiple temporal scales, exploring short-term and long-term temporal dependencies within an SCV and facilitating hierarchical learning to enhance the precision of SCVQA predictions. Experimental results underscore the strength and validity of our model, demonstrating the practical applicability of our model in real-world applications. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/13360