A Guide For Achieving High Performance With Very Small Matrices On GPU: A case Study of Batched LU and Cholesky Factorizations

Azzam Haidar, Ahmad Abdelfattah, Mawussi Zounon, Stanimire Tomov, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

837 Downloads (Pure)


We present high-performance results with a high speed-up over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that prevent from designing efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance spectrum, before a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorization and show how the proposed methodology allows us to achieve a performance close to the theoretical upper bound. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundred of thousands of small size Cholesky and LU factorization. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need of such kernel in scientific simulations such as astrophysics applications. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to <formula><tex>$6 \times$</tex></formula> for the factorization, using double precision arithmetic on a Pascal P100 GPU.

Original languageEnglish
JournalIEEE Transactions on Parallel and Distributed Systems
Early online date15 Dec 2017
Publication statusPublished - 2017


  • Algorithm design and analysis
  • batched computation
  • Computational modeling
  • Computer architecture
  • GPUs
  • Graphics processing units
  • Kernel
  • Libraries
  • Linear algebra
  • variable small sizes


Dive into the research topics of 'A Guide For Achieving High Performance With Very Small Matrices On GPU: A case Study of Batched LU and Cholesky Factorizations'. Together they form a unique fingerprint.

Cite this