| Author: | Meng, Shiyu |
| Title: | A study on semantic understanding for autonomous driving |
| Advisors: | Chau, Lap-pui (EEE) |
| Degree: | Ph.D. |
| Year: | 2025 |
| Department: | Department of Electrical and Electronic Engineering |
| Pages: | xix, 119 pages : color illustrations |
| Language: | English |
| Abstract: | Reliable and explainable autonomous driving systems must simultaneously achieve accurate perception, robust localization, and trustworthy decision-making in complex and dynamic environments. These core capabilities are essential not only for ensuring driving safety and efficiency, but also for enhancing user trust and system transparency. Semantic understanding serves as the bridge between perception and cognition, empowering autonomous systems to infer dynamic entities, contextual dependencies, and holistic scene semantics beyond mere sensory interpretation. This thesis presents a series of progressive contributions that advance semantic understanding in autonomous driving through innovations in BEV perception, multi-modal fusion, interpretable decision-making, and cross-modality place recognition. We begin by addressing the need for dense BEV moving-obstacle segmentation using cost-effective visual sensors. Among various BEV perception tasks, moving-obstacle segmentation is very important, since it can provide necessary information for downstream tasks, such as motion planning and decision making. In general, existing LiDAR-based methods often suffer from sparsity and hardware cost limitations. To this end, we propose a semantics-assisted segmentation framework that utilizes multi-camera visual inputs and temporal semantic cues to generate dense BEV maps of moving obstacles, enabling vision-based dynamic perception without relying on 3-D LiDAR information. To further improve segmentation performance in challenging scenarios, we extend this effort with a BEV multi-modal moving-obstacle segmentation framework. Recognizing the complementary strengths of LiDAR and image-based depth estimation, we introduce DPMoSeg, a novel architecture that integrates sparse 3-D point clouds to generate dense depth information through a sparse-dense attention mechanism. Therefore, DPMoSeg produces more accurate and complete BEV segmentation results. Our hybrid design bridges the gap between low-cost visual sensors and high-fidelity geometric cues. Further, we explore interpretable decision-making to improve the transparency of autonomous driving behavior. While many learning-based solutions offer accurate performance for vehicle behavior, they often lack human-oriented explanations, reducing user trust and hindering widespread adoption. To resolve this, we propose a unified framework that couples vehicle behavior prediction with natural language-based interpretation. This is achieved via a self-supervised, class-agnostic object segmentor and semantic-aware fusion, enabling decision outputs that are both effective and explainable, without requiring extra annotations. The final part of the thesis addresses the challenge of cross-modal place recognition, which is vital for localization in GPS-denied or degraded conditions. We propose a unified framework that matches real-time RGB images with pre-built LiDAR maps by transforming point clouds into range-view images. A Transformer-Mamba Mixer module is designed to model both intra-modal and inter-modal dependencies. Furthermore, a semantic-promoted descriptor enhancer is introduced to embed high-level scene context. The framework is trained under a contrastive learning paradigm to optimize cross-modal similarity learning. Experimental results on multiple benchmarks demonstrate its competitive performance against state-of-the-art methods. In summary, this thesis presents a set of novel and practical frameworks that address key challenges in perception, decision-making, and localization for autonomous driving. By leveraging visual information, semantic understanding, and multi-modal integration, our methods contribute to the development of more cost-effective and robust autonomous systems. |
| Rights: | All rights reserved |
| Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/14325

