Author: Liu, Bo
Title: Advancing safe, multimodal conversational AI : out-of-distribution detection and medical visual question answering approaches
Advisors: Wu, Xiao-ming (DSAI)
Degree: Ph.D.
Year: 2025
Department: Department of Data Science and Artificial Intelligence
Pages: xx, 174 pages : color illustrations
Language: English
Abstract: Conversational AI has seen remarkable progress in recent years, driven by the integration of large language models (LLMs) and multimodal learning. However, ensuring the robustness and usability of conversational systems remains a major challenge, particularly in high-stakes fields like medical applications, where incorrect or misleading responses can have serious consequences. Two crucial factors for improving these systems are out-of-distribution (OOD) detection that recognizes unfamiliar inputs and domain-specific multimodal understanding that integrates diverse data types. This thesis aims to advance both areas by (1) enhancing textual OOD detection with LLMs to better secure dialogue systems and (2) developing approaches to improve understanding and reasoning for medical visual question answering (Med-VQA), a key task in medical dialogues.
For textual OOD detection, we first conduct a pioneering empirical study on OOD detection in LLMs, addressing the gap in existing methods designed for smaller models like BERT, which may not generalize well to LLMs. We evaluate OOD detectors in both zero-shot and fine-tuning settings and propose a generative fine-tuning approach aligned with LLM pre-training objectives. Our results show that the cosine distance-based detector outperforms other ones, leveraging LLMs' isotropic embedding space. Next, we introduce a novel framework to tackle near-OOD detection, where in-distribution (ID) and OOD inputs share semantic similarities, by leveraging the isotropic embedding space of LLMs. Our framework derives semantic prototypes for each ID class and performs semantic matching for both OOD detection and ID classification. With high-quality textual representations from LLMs, our method demonstrates superior performance, especially in few-shot scenarios with limited data.
For Med-VQA, we first introduce SLAKE, a semantically-labeled knowledge-enhanced dataset with accurate visual and textual annotations and an extendable knowledge base, to overcome the limitations of dataset scarcity. To further mitigate overfitting to small-scale training data, we propose a CPRD framework that distills a lightweight visual feature extractor with various radiological knowledge for Med-VQA. Secondly, when faced with more complex medical questions and images compared to general ones, we propose a conditional reasoning framework that consists of a question-conditioned reasoning component and a type-conditioned reasoning strategy to adaptively learn reasoning skills for different Med-VQA tasks. Finally, we present GEMeX, a large-scale, groundable, and explainable benchmark for chest X-ray diagnosis. This new benchmark addresses key limitations of existing datasets by introducing a multi-modal explainability that enhances answer comprehensibility and four distinct question types that better reflect clinical needs. Our evaluations of 12 representative large vision language models and a fine-tuned baseline model demonstrate the dataset's challenges and effectiveness.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
8610.pdfFor All Users12.45 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14156