## Abstract

Computing units that carry out a fused multiply-add (FMA) operation with matrix

arguments, referred to as tensor units by some vendors, have great potential for use in scientic computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical

linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative renement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication

and LU factorization with TC16 and TC32 forms of FMA, which dier in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU conrm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

arguments, referred to as tensor units by some vendors, have great potential for use in scientic computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical

linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative renement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication

and LU factorization with TC16 and TC32 forms of FMA, which dier in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU conrm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

Original language | English |
---|---|

Journal | S I A M Journal on Scientific Computing |

Publication status | Accepted/In press - 25 Feb 2020 |