This thesis develops critical semantic segmentation methods for understanding scenes and guiding autonomous vehicles. Although recent deep learning-based methods have achieved impressive results in semantic segmentation, several challenges still exist. Thus, this thesis provides a systematic analysis of existing challenges and presents several deep-learning approaches to address these issues. Firstly, a cross attention network (CANet) composed of two branches. Specifically, a shallow branch with small stride is used to preserve low-level spatial information with large-sized output, and a deep branch with large stride is employed to extract high-level contextual features with small-sized output. A feature cross attention module is then utilized to combine the output features. The CANet outperforms other real-time methods, featuring improved speed on benchmark datasets with lightweight backbones and achieving state-of-the-art performances for deep backbones. Secondly, a lightweight feature pyramid encoding network (FPENet) enabling a good trade-off between accuracy and speed is proposed. In particular, feature pyramid encoding blocks encoding multi-scale contextual features with depthwise dilated convolutions are adopted as basic encoder blocks. And a mutual embedding upsample (MEU) module is added to the decoder to aggregate the high-level semantic features and low-level spatial details efficiently. The FPENet outperforms existing real-time methods with fewer parameters and exhibits improved inference speed on the Cityscapes and CamVid datasets. Next, a one-shot neural architecture search algorithm is introduced to design the structures of the FPE blocks automatically. This algorithm searches for optimal architectures among possible FPE block schemes at a cost comparable to that for regular training. The searched network is then combined with MEU modules to compose a updated version of the FPENet. Experimental results demonstrate the effectiveness of this one-shot NAS algorithm. Finally, methods for capturing long-range dependencies and aggregating global contexts within feature maps are proposed. The first method reduces the computational cost of the spatial attention mechanism by creating a sparse, as opposed to dense, attention map, and it achieves state-of-the-art results on a variety of datasets. The second method constructs pyramid feature graphs and a probabilistic graph to aggregate global contextual information. Extensive experimentation on the segmentation datasets shows the effectiveness of this method.
Date of Award | 1 Aug 2022 |
---|
Original language | English |
---|
Awarding Institution | - The University of Manchester
|
---|
Supervisor | Hujun Yin (Supervisor) & Wuqiang Yang (Supervisor) |
---|
- deep learning
- semantic segmentation
- convolutional neural network
Deep Learning for Semantic Segmentation
Liu, M. (Author). 1 Aug 2022
Student thesis: Phd