MATRIX MULTIPLICATION IN MULTIWORD ARITHMETIC: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES∗

Massimiliano Fasi, Nicholas Higham, Florent Lopez, Theo Mary, Mantas Mikaitis

Research output: Contribution to journalArticlepeer-review

Abstract

In multiword arithmetic, a matrix is represented as the unevaluated sum of two or more lower-precision matrices, and a matrix product is formed by multiplying the constituents in low precision. We investigate the use of multiword arithmetic for improving the performance–accuracy tradeoff of matrix multiplication with mixed precision block fused multiply–add (FMA) hardware, focusing especially on the tensor cores available on NVIDIA GPUs. Building on a general block FMA framework, we develop a comprehensive error analysis of multiword matrix multiplication. After confirming the theoretical error bounds experimentally by simulating low precision in software, we use the cuBLAS and CUTLASS libraries to implement a number of matrix multiplication algorithms using double-fp16 (double-binary16) arithmetic. When running the algorithms on NVIDIA V100 and A100 GPUs, we find that double-fp16 is not as accurate as fp32 (binary32) arithmetic despite satisfying the same worst-case error bound. Using probabilistic error analysis, we explain why this issue is likely to be caused by the rounding mode used by the NVIDIA tensor cores, and propose a parameterized blocked summation algorithm that alleviates the problem and significantly improves the performance–accuracy tradeoff.
Original languageEnglish
JournalSIAM Journal on Scientific Computing
Publication statusAccepted/In press - 24 Aug 2022

Fingerprint

Dive into the research topics of 'MATRIX MULTIPLICATION IN MULTIWORD ARITHMETIC: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES∗'. Together they form a unique fingerprint.

Cite this