Feature-Enhanced Representation with Transformers for Multi-View Stereo

Lintao Xiang, Hujun Yin

Research output: Contribution to journalArticlepeer-review


Most existing multi-view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer-based feature enhancement network (TF-MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long-range contextual information. To reduce memory consumption of feature matching, we leverage the cross-attention mechanism to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a color-guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported.
Original languageEnglish
JournalIET Image Processing
Early online date3 Mar 2024
Publication statusE-pub ahead of print - 3 Mar 2024


  • Multi-View Stereo
  • Deep Learning
  • Transformer


Dive into the research topics of 'Feature-Enhanced Representation with Transformers for Multi-View Stereo'. Together they form a unique fingerprint.

Cite this