Author: Fan, Yunfeng
Title: Comprehensive multimodal knowledge exploitation
Advisors: Xu, Wenchao (COMP)
Guo, Song (COMP)
Xiao, Bin (COMP)
Degree: Ph.D.
Year: 2025
Department: Department of Computing
Pages: xx, 112 pages : color illustrations
Language: English
Abstract: Multimodal learning (MML) endeavors to simultaneously leverage the characteristics of various modalities to compensate their inherent limitations. In contrast to uni-modal learning, MML can provide a clearer and more accurate perception of the target by removing redundancy and supplementing complementary information. Moreover, MML can enhance the robustness by reducing their reliance on a single modality. Models trained on multiple modalities are less susceptible to noise or errors in any single modality, resulting in more robust performance in real-world scenarios. However, the intrinsic heterogeneity between modalities makes it difficult to comprehensively utilize the multimodal information. Despite the great strides made yet, MML is still limited by three challenges that limit the exploitation of multimodal knowledge: 1) modality competition, 2) domain shift, and 3) distributed scenario. In this thesis, we explore effective ways to address the above challenges and design novel solutions to improve the learning efficiency of MML.
First, the joint training framework, which is commonly used in MML, inevitably falls into the notorious modality competition, making each modality under-explored. Specifically, modalities may interfere with each other, hindering the learning process especially for weak modalities. Therefore, in chapter 3, we introduce DI-MML, a novel detached MML framework designed to learn complementary information across modalities under the premise of avoiding modality competition. Specifically, DI-MML addresses competition by separately training each modality encoder with isolated learning objectives. It further encourages cross-modal interaction via a shared classifier that defines a common feature space and employing a dimension-decoupled unidirectional contrastive (DUC) loss to facilitate modality-level knowledge transfer.
Second, we take the domain shift into consideration and study a more challenging task, multimodal domain generalization (MMDG) where models trained on multi-modal source domains can generalize to unseen target distributions with the same modality set. Diverse modalities in real-world applications introduce more complex domain shifts, as the degree of domain shift varies across different modalities, significantly increasing the difficulty of addressing MMDG issue. Besides, previous domain generalization methods are specifically designed for unimodal setting and they are not compatible well in MMDG since the distinct properties between modalities leads to sub-optimal solutions. To bridge the gap, in chapter 4, we propose to construct consistent flat loss regions and enhance knowledge exploitation for each modality via cross-modal knowledge transfer. Innovatively, we turn to the optimization on representation-space loss landscapes instead of traditional parameter space, which allows us to build connections between modalities directly. Then, we introduce a novel method to flatten the high-loss region between minima from different modalities by interpolating mixed multi-modal representations.
Third, we consider a more complex MML scenario, multimodal federated learning (MFL), where multiple types of data is allocated on numerous distributed local devices. In this case, the diverse distribution heterogeneity of different modalities further increases the difficulty of exploiting multimodal knowledge effectively. However, existing federated learning (FL) frameworks employ client selection without taking into account the impact of modality differences across clients, as well as the modality bias. Thus, in chapter 5, we propose a novel Balanced Modality Selection framework for MFL (BMSFED) to overcome the bias. On the one hand, we incorporate a modal enhancement loss into local training to mitigate local imbalances by leveraging the aggregated global prototypes. On the other hand, we design a modality selection strategy to identify diverse subsets of local modalities, thereby ensuring global modality balance.
In summary, this thesis aims to maximize the extraction of knowledge from multiple modalities to achieve both efficiency and robustness across various complex scenarios. Extensive analysis and experimental evaluations show the performance advantages of our works with better performance over existing solutions.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
8605.pdfFor All Users1.89 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14151