Contrastive Learning with Enhancing Detailed Information for Pre-Training Vision Transformer

Zhuomin Liang, Liang Bai, Jinyu Fan, Xian Yang, Jiye Liang

Research output: Contribution to journalArticlepeer-review

17 Downloads (Pure)

Abstract

Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
Publication statusPublished - 11 Sept 2024

Keywords

  • Contrastive learning
  • Representation learning
  • Vision Transformer

Fingerprint

Dive into the research topics of 'Contrastive Learning with Enhancing Detailed Information for Pre-Training Vision Transformer'. Together they form a unique fingerprint.

Cite this