Author: | Zheng, Changmeng |
Title: | Learning versatile multimodal representation for knowledge extraction and reasoning |
Advisors: | Li, Qing (COMP) |
Degree: | Ph.D. |
Year: | 2025 |
Subject: | Machine learning Graph theory -- Data processing Data mining Artificial intelligence Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | xxi, 156 pages : color illustrations |
Language: | English |
Abstract: | Relational facts organize human knowledge of the real world in a triplet format. These structural facts are regarded as the way to implement conscious and logical intelligence. Although the past three decades have witnessed the rise of text analysis methods for extracting meaningful information from unstructured textual data, these methods often fall short of capturing the full semantic richness and complexity of human language, particularly when it comes to understanding the relationships between entities. Besides, textual semantics are sometimes incomplete and ambiguous, which can cause inaccuracy and severe misleading in facts. On the contrary, the information from other modalities (e.g., visual contents) is much more intuitive and specific. Inspired by the human capacity to perceive and communicate through a multisensory system, this thesis explores the potential of learning versatile multimodal representations for knowledge extraction and reasoning. This thesis delves into four critical challenges within multimodal learning, proposing novel solutions through a series of rigorous investigations: (1) A Unified Multimodal Graph Learning Framework: To overcome the prevalent issues of modality gaps and spurious alignments in multimodal knowledge extraction, we present a novel multimodal graph learning framework. This framework enables a comprehensive mapping of diverse elements from disparate modalities onto a unified graph structure. By emphasizing the capture of fine-grained correlations through semantic and structural graph alignment, we achieve improved knowledge extraction accuracy. Additionally, we introduce a benchmark dataset specifically designed for this task, empirically validating the efficacy of our proposed framework. (2) A Hierarchical Multimodal Representation Learning Method: To address the limitations of inconsistent semantic levels between individual modality representations, we further explore the integration of hierarchical multimodal learning by incorporating information at different granularities (e.g., from image-level to object-level visual features and from sentence-level to concept-level textual features). By connecting vision and language through paths within external concept graphs, we bridge the gap between modalities, mirroring the human association process. (3) A Robust Data Augmentation and Estimation System: To acknowledge the detrimental impact of misalignment issue in text-image datasets, we investigate methods for mitigating bias and distractions caused by such misalignments. Drawing inspiration from machine translation techniques, this work employs back-translation and divergence estimation to identify and reduce the influence of irrelevant or partially aligned information, leading to more robust and reliable knowledge extraction. (4) An Iterative Refined Graph Reasoning Application: To demonstrate the generality and versatility of the extracted multimodal knowledge graph, we incorporate multi-agent debate into multimodal reasoning to facilitate iterative refinement of knowledge representations. The proposed Blueprint Debate on Graphs framework utilizes a graph-based structure for representing and refining knowledge, encouraging collaboration and competition between agents to achieve a deeper understanding of the relationships and interactions within multimodal data. By addressing the challenges of fine-grained alignment, hierarchical learning, bias mitigation, and iterative refinement, this research contributes to the advancement of multimodal learning across several tasks and benchmarks, and unlocks new possibilities for understanding and utilizing the rich information embedded within multimodal data. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/13661