Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Jakub Kurzak, Hartwig Anzt, Mark Gates, Jack Dongarra

    Research output: Contribution to journalArticlepeer-review

    420 Downloads (Pure)

    Abstract

    Many problems in engineering and scientific computing require the solution of a large number of small systems of linear equations. Due to their high processing power, Graphics Processing Units became an attractive target for this class of problems, and routines based on the LU and the QR factorization have been provided by NVIDIA in the cuBLAS library. This work addresses the situation where the systems of equations are symmetric positive definite. The paper describes the implementation and tuning of the kernels for the Cholesky factorization and the forward and backward substitution. Targeted workloads involve the solution of thousands of linear systems of the same size, where the focus is on matrix dimensions from 5 by 5 to 100 by 100. Due to the lack of a cuBLAS Cholesky factorization, execution rates of cuBLAS LU and cuBLAS QR are used for comparison against the proposed Cholesky factorization in this work. Execution rates of forward and backward substitution routines are compared to equivalent cuBLAS routines. Comparisons against optimized multicore implementations are also presented. Superior performance is reached in all cases.

    Original languageEnglish
    Article number7275187
    Pages (from-to)2036-2048
    Number of pages13
    JournalIEEE Transactions on Parallel and Distributed Systems
    Volume27
    Issue number7
    Early online date24 Sept 2015
    DOIs
    Publication statusPublished - 1 Jul 2016

    Keywords

    • batched
    • Cholesky factorization
    • CUDA
    • GPU
    • kernel
    • SIMT

    Fingerprint

    Dive into the research topics of 'Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs'. Together they form a unique fingerprint.

    Cite this