Parameterization of Cross-token Relations with Relative Positional Encoding for Vision MLP

Zhicai Wang, Yanbin Hao, Xingyu Gao, Hao Zhang, Shuo Wang, Tingting Mu, Xiangnan He

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

29 Downloads (Pure)

Abstract

Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity O(N2) of vision MLPs to $O(N)$ and O(1). We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14% to 74.02% and a learnable parameter reduction from 19.4M to 18.2M.
Original languageEnglish
Title of host publicationProceedings of the 30th ACM International Conference on Multimedia
PublisherACM Digital Library
Pages6288-6299
DOIs
Publication statusE-pub ahead of print - 1 Oct 2022

Keywords

  • Computer vision
  • Positional Encoding
  • Image classification

Fingerprint

Dive into the research topics of 'Parameterization of Cross-token Relations with Relative Positional Encoding for Vision MLP'. Together they form a unique fingerprint.

Cite this