Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Computingen_US
dc.contributor.advisorWu, Xiao-ming (COMP)en_US
dc.creatorXu, Li-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/12184-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic Universityen_US
dc.rightsAll rights reserveden_US
dc.titleMulti-modal pre-training for medical vision-language understanding and generationen_US
dcterms.abstractWith the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP based on popular vision-language Transformer. Based on the empirical analysis, we develop several key insights which can guide future medical VLP research. Since current medical VL datasets are either noisy or of single modality, we propose RadioGraphy Captions (RGC), a multi-modality radiographic dataset containing 18,434 image-caption pairs, collected from an open-access online database MedPix. Our experimental results on RGC demonstrate that a domain-specific dataset with limited high-quality samples is also effective for pre-training. RGC can also be used as a new benchmark to evaluate VL models for report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we achieve new state-of-the-art or competitive results on medical VL tasks, including medical visual question answering, report generation, and medical image-text retrieval, compared with previous works, which can serve as solid baselines for future works. Additionally, we apply a token pruning method named Learned Token Pruning in the Transformer model to further reduce the inference time for downstream tasks.en_US
dcterms.extentxii, 76 pages : color illustrationsen_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2022en_US
dcterms.educationalLevelM.Phil.en_US
dcterms.educationalLevelAll Masteren_US
dcterms.LCSHComputer visionen_US
dcterms.LCSHMachine learningen_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
6632.pdfFor All Users3.63 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12184