Learning versatile multimodal representation for knowledge extraction and reasoning

Zheng, Changmeng

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.contributor.advisor	Li, Qing (COMP)	en_US
dc.creator	Zheng, Changmeng	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/13661	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Learning versatile multimodal representation for knowledge extraction and reasoning	en_US
dcterms.abstract	Relational facts organize human knowledge of the real world in a triplet format. These structural facts are regarded as the way to implement conscious and logical intelligence. Although the past three decades have witnessed the rise of text analysis methods for extracting meaningful information from unstructured textual data, these methods often fall short of capturing the full semantic richness and complexity of human language, particularly when it comes to understanding the relationships between entities. Besides, textual semantics are sometimes incomplete and ambiguous, which can cause inaccuracy and severe misleading in facts. On the contrary, the information from other modalities (e.g., visual contents) is much more intuitive and specific. Inspired by the human capacity to perceive and communicate through a multisensory system, this thesis explores the potential of learning versatile multimodal representations for knowledge extraction and reasoning. This thesis delves into four critical challenges within multimodal learning, proposing novel solutions through a series of rigorous investigations:	en_US
dcterms.abstract	(1) A Unified Multimodal Graph Learning Framework: To overcome the prevalent issues of modality gaps and spurious alignments in multimodal knowledge extraction, we present a novel multimodal graph learning framework. This framework enables a comprehensive mapping of diverse elements from disparate modalities onto a unified graph structure. By emphasizing the capture of fine-grained correlations through semantic and structural graph alignment, we achieve improved knowledge extraction accuracy. Additionally, we introduce a benchmark dataset specifically designed for this task, empirically validating the efficacy of our proposed framework.	en_US
dcterms.abstract	(2) A Hierarchical Multimodal Representation Learning Method: To address the limitations of inconsistent semantic levels between individual modality representations, we further explore the integration of hierarchical multimodal learning by incorporating information at different granularities (e.g., from image-level to object-level visual features and from sentence-level to concept-level textual features). By connecting vision and language through paths within external concept graphs, we bridge the gap between modalities, mirroring the human association process.	en_US
dcterms.abstract	(3) A Robust Data Augmentation and Estimation System: To acknowledge the detrimental impact of misalignment issue in text-image datasets, we investigate methods for mitigating bias and distractions caused by such misalignments. Drawing inspiration from machine translation techniques, this work employs back-translation and divergence estimation to identify and reduce the influence of irrelevant or partially aligned information, leading to more robust and reliable knowledge extraction.	en_US
dcterms.abstract	(4) An Iterative Refined Graph Reasoning Application: To demonstrate the generality and versatility of the extracted multimodal knowledge graph, we incorporate multi-agent debate into multimodal reasoning to facilitate iterative refinement of knowledge representations. The proposed Blueprint Debate on Graphs framework utilizes a graph-based structure for representing and refining knowledge, encouraging collaboration and competition between agents to achieve a deeper understanding of the relationships and interactions within multimodal data.	en_US
dcterms.abstract	By addressing the challenges of fine-grained alignment, hierarchical learning, bias mitigation, and iterative refinement, this research contributes to the advancement of multimodal learning across several tasks and benchmarks, and unlocks new possibilities for understanding and utilizing the rich information embedded within multimodal data.	en_US
dcterms.extent	xxi, 156 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2025	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Machine learning	en_US
dcterms.LCSH	Graph theory -- Data processing	en_US
dcterms.LCSH	Data mining	en_US
dcterms.LCSH	Artificial intelligence	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
8103.pdf	For All Users	9.62 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13661