A study on semantic scene understanding with multi-modal fusion for autonomous driving

Feng, Zhen

Author:	Feng, Zhen
Title:	A study on semantic scene understanding with multi-modal fusion for autonomous driving
Advisors:	Sun, Yuxiang (ME)
Degree:	Ph.D.
Year:	2024
Subject:	Image segmentation Image analysis Image processing -- Digital techniques Automated vehicles Hong Kong Polytechnic University -- Dissertations
Department:	Department of Mechanical Engineering
Pages:	xx, 114 pages : color illustrations
Language:	English
Abstract:	Traffic scene understanding is the basis for the safe driving of autonomous vehicles. Semantic segmentation is able to distinguish the class of each pixel in an image, which makes it known as one of the important methods for traffic scene understanding. Due to the complexity and variability of traffic scenes, single-modal data often cannot meet the needs of all scenes. Semantic segmentation algorithms with multi-modal fusion can address the problem that single-modal data is affected by environmental noise leading to performance degradation. Currently, traffic scene understanding based on multi-modal fusion has received increasing attention, such as the fusion of Red-Green-Blue (RGB) images with thermal images and the fusion of RGB images with depth images. The aim of this study is to investigate the segmentation of negative obstacles in traffic scenes and the segmentation of all-day traffic scenes by fusing multi-modal data. Although current multi-modal fusion networks for negative obstacle segmentation have achieved acceptable results, their encoders only use one structure to extract one kind of feature, such as local features. Due to the limitation of the receptive field, the local features extracted by a convolutional network cannot fully represent the global information in the image, while the global information extracted by the self-attention module cannot focus on the local detail information as much as the local features. To address this issue, we propose Multi-modal Attention Fusion Network named MAFNet for the segmentation of road potholes with the fusion of RGB images and disparity images. Specifically, we combine a convolutional network and transformer network as an encoder to extract features from images. In addition, we design fusion modules based on attention modules to fuse the features of RGB images and disparity images. Experiments illustrate that our proposed MAFNet network achieves better results than existing state-of-the-art networks. Large-scale datasets are necessary for training high-quality networks. To address the scarcity of datasets for negative obstacle segmentation with multi-modal fusion, we build and release a dataset for the segmentation of negative obstacles with RGB images and depth images. To reduce the workload of manual labeling, we manually labelled 745 images and generate coarse labels for the remaining 3000 images using the existing dataset and the labelled images. Currently, multi-modal fusion networks have the disadvantage of slow inference when dealing with large-size input data. To address this issue, we propose Channel and Position-wise Knowledge Distillation (CPKD) framework. Specifically, we replace the heavyweight encoder of the teacher network with a lightweight network while introducing a downsampling layer into the beginning of the student network to reduce the amount of data. We design Channel and Position-wise Distillation (CPD) modules to transfer knowledge from the teacher network to the student network. The experimental results illustrate that our proposed CPKD framework can greatly improve the inference speed of the network and enable the student network to achieve satisfactory performance. To address the effect of the blurred edge of thermal images and the issue that the performance of RGB-thermal image fusion networks is easily affected by the alignment relationship changing, we proposed Cross-modal Edge-privileged Knowledge Distillation (CEKD) framework for segmentation. This framework transfers the capability of edge detection from the multi-modal teacher network to the thermal-image student network by knowledge distillation. The main aim of the CEKD framework is to improve the segmentation accuracy of the student network. We introduce an edge detection module into the teacher network and introduce the edge labels as privileged information to train the teacher network. We also design a Thermal Enhancement (TE) module for the student network to improve the contrast between the high-temperature objects and the low-temperature background. The experimental results illustrate that the thermal-only student network trained by our designed CEKD framework is able to learn edge detection capability from the teacher network. The experimental results also illustrate that our student network achieves better performance than the single-modal network for the segmentation of traffic scenes with only thermal images.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
7284.pdf	For All Users	9.26 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12834