TY - JOUR
T1 - Contrastive Learning with Enhancing Detailed Information for Pre-Training Vision Transformer
AU - Liang, Zhuomin
AU - Bai, Liang
AU - Fan, Jinyu
AU - Yang, Xian
AU - Liang, Jiye
PY - 2024/9/11
Y1 - 2024/9/11
N2 - Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.
AB - Contrastive Learning (CL) is an effective self-supervised learning method. It performs instance-level contrastiveness based on the image representations, which enables the model to extract abstract information from images. However, when training data is insufficient, abstract information fails to distinguish samples from different classes. This problem is more severe in the pre-training of Vision Transformer (ViT). In general, detailed information is crucial for enhancing the discrimination of representations. Patch representations, which focus on the details of images, are often overlooked in existing methods that train ViT through CL, resulting in the confusion of similar samples. To address this problem, we propose a Contrastive Learning model with Enhancing Detailed Information (CL-EDI) for pre-training ViT. Our model consists of dual ViT contrastive modules. The first module is similar to MoCo V3, which can learn abstract information about images. The role of the second ViT contrastive module is to enhance detailed information in data representations by aggregating patch representations of images. Extensive experiments demonstrate the necessity of learning detailed information. Across several datasets, our model surpasses existing approaches in image classification, transfer learning and object detection tasks.
KW - Contrastive learning
KW - Representation learning
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85204107476&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2024.3457840
DO - 10.1109/TCSVT.2024.3457840
M3 - Article
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -