Author: | Ma, Jianqi |
Title: | Exploring the effective utilization of text priors for scene text image super-resolution |
Advisors: | Zhang, Lei (COMP) |
Degree: | Ph.D. |
Year: | 2024 |
Department: | Department of Computing |
Pages: | xxii, 148 pages : color illustrations |
Language: | English |
Abstract: | Scene text image super-resolution (STISR) task aims to enhance the text image resolution and recover the visual-pleasant and legibility text images, benefiting later higher-level textual information process. In this thesis, we present four studies pushing forward STISR research in four aspects: recognition information injection, deformation-robustness prior-based learning, benchmarking new dataset and evaluation protocol, and diffusion prior. The first two studies discuss the model designs that activate the text recognition prior for STISR and how to better enhance the text recognition prior for higher-quality scene text recovery. The third study collects a novel bilingual dataset, proposes a novel evaluation setting, and an edge-aware learning baseline for STISR task, pushing forward STISR to a more challenging and practical inference situation. Based on the evaluation settings in the third study, the fourth study applies the latent diffusion model (LDM) further to boost the performance of dense text image recovery. In Chapter 1, we introduce the common knowledge and history of the scene text image super-resolution and text image recovery, including the dataset benchmarks and the state-of-the-art approaches, and then the arrangement of the whole thesis. In Chapter 2, we present an innovative CNN-based STISR pipeline called TPGSR. This pipeline incorporates the guidance of recognition text prior and introduces multiple solutions for enhancing the text prior, i.e, distilling information from high-resolution (HR) text prior and multi-stage refinement. Building upon this, in Chapter 3, we delve into the design of TATT, a transformer-based approach that addresses the challenge of deformation-robust prior guidance activation. Besides, a text structure consistency (TSC) loss is proposed to ensure the text recovery consistency between the normal text and the deformed text. In Chapter 4 showcases our efforts in collecting a novel Chinese-English STISR dataset known as RealCE. This dataset allows for the exploration of more complex structured text recovery with well-annotated transcripts and localization boxes of text instances in the global images. Additionally, we propose a practical and challenging image recovery evaluation protocol to alleviate artifacts that text line recovery may occur, and an edge-aware recovery pipeline, utilizing the edge map as the prior information, specifically tailored for global dense text images. In Chapter 5, we introduce a model architecture that applies a latent diffusion model to further enhance STISR performance on global dense text image inputs. To better activate the diffusion prior, we unfreeze the denoising UNet to sufficiently learn the complex text structure and propose a loss signal-to-ratio (SNR) reweighting strategy to stabilize the diffusion model training. An inference strategy is also applied for size-variant input to achieve more stable text recovery results. This novel approach aims to push the boundaries of text image super-resolution by leveraging the power of latent diffusion modeling. In Chapter 6, we conclude all the works done in the thesis and discuss future research directions based on the works. To conclude, these research works mentioned in the thesis largely contribute to the development of the STISR research field by solving significant problems, raising specific challenges, and proposing novel model designs to boost the STISR performance on normal text and deformed text, benchmarking novel dense global text image recovery evaluation with dataset and protocols. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/13248