Deep Learning for Semantic Segmentation

  • Mengyu Liu

Student thesis: Phd


This thesis develops critical semantic segmentation methods for understanding scenes and guiding autonomous vehicles. Although recent deep learning-based methods have achieved impressive results in semantic segmentation, several challenges still exist. Thus, this thesis provides a systematic analysis of existing challenges and presents several deep-learning approaches to address these issues. Firstly, a cross attention network (CANet) composed of two branches. Specifically, a shallow branch with small stride is used to preserve low-level spatial information with large-sized output, and a deep branch with large stride is employed to extract high-level contextual features with small-sized output. A feature cross attention module is then utilized to combine the output features. The CANet outperforms other real-time methods, featuring improved speed on benchmark datasets with lightweight backbones and achieving state-of-the-art performances for deep backbones. Secondly, a lightweight feature pyramid encoding network (FPENet) enabling a good trade-off between accuracy and speed is proposed. In particular, feature pyramid encoding blocks encoding multi-scale contextual features with depthwise dilated convolutions are adopted as basic encoder blocks. And a mutual embedding upsample (MEU) module is added to the decoder to aggregate the high-level semantic features and low-level spatial details efficiently. The FPENet outperforms existing real-time methods with fewer parameters and exhibits improved inference speed on the Cityscapes and CamVid datasets. Next, a one-shot neural architecture search algorithm is introduced to design the structures of the FPE blocks automatically. This algorithm searches for optimal architectures among possible FPE block schemes at a cost comparable to that for regular training. The searched network is then combined with MEU modules to compose a updated version of the FPENet. Experimental results demonstrate the effectiveness of this one-shot NAS algorithm. Finally, methods for capturing long-range dependencies and aggregating global contexts within feature maps are proposed. The first method reduces the computational cost of the spatial attention mechanism by creating a sparse, as opposed to dense, attention map, and it achieves state-of-the-art results on a variety of datasets. The second method constructs pyramid feature graphs and a probabilistic graph to aggregate global contextual information. Extensive experimentation on the segmentation datasets shows the effectiveness of this method.
Date of Award1 Aug 2022
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorHujun Yin (Supervisor) & Wuqiang Yang (Supervisor)


  • deep learning
  • semantic segmentation
  • convolutional neural network

Cite this