High-performance tensor contractions for GPUs

A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, Tz Kolev, I. Masliah, S. Tomov

Research output: Contribution to journalConference articlepeer-review


We present a computational framework for high-performance tensor contractions on GPUs. High-performance is di cult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-speci cs, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order nite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

Original languageEnglish
Pages (from-to)108-118
Number of pages11
JournalProcedia Computer Science
Early online date1 Jun 2016
Publication statusPublished - 2016
EventInternational Conference on Computational Science, ICCS 2016 - San Diego, United States
Duration: 6 Jun 20168 Jun 2016


  • Applications
  • Batched linear algebra
  • FEM
  • GPU
  • Tensor contractions
  • Tensor HPC


Dive into the research topics of 'High-performance tensor contractions for GPUs'. Together they form a unique fingerprint.

Cite this