Enabling Direct-Access Global Shared Memory for Distributed Heterogeneous Computing

Student thesis: Phd

Abstract

Large applications that run across multiple host computers need access to both local and remote distributed memory. They also use compute accelerators to achieve high levels of computation. The capacity of local memory is increasing gradually, but this trend is slowing down because of physical constraints that limit the number and size of memory devices that can be accessed locally. These limitations have led to the development of various new memory technologies that aim to increase local memory capacity. They have also led to the use of various Distributed Shared Memory (DSM) systems that provide pools of remote memory. In both cases, however, the application uses different memory abstractions, while its performance is affected by characteristics of each memory interface. Moreover, compute accelerators may have their own local memories and are tied to local computation without being part of any broader distributed memory architecture. In such configurations, all communication from any processing element to the system memory resources must be processed by host software leading to significant performance inefficiencies. However, current approaches to provide accelerators access to any pool of distributed memory require significant capabilities and resources from the accelerator. Therefore, providing efficient access across a distributed heterogeneous system to pools of distributed memory is a key challenge for scaling applications. The approach for accessing local and remote memories currently operates at two different levels of granularity in abstraction: the level where the host software abstracts the application access to remote memory, and the level where the host hardware and its memory transactions provide direct access to its local memory. At the software-level granularity, any access to remote resources requires higher-level software abstractions that add latency. By switching to hardware-level granularity for remote memory, any processing element will be able to directly access any memory resources through its instruction set architecture and native read/write transactions and thus remove any software invocation overhead and reduce latency compared to all current software-based approaches. Therefore, this thesis contributes the description of the Generalised Memory System (GMS). More precisely, this thesis: 1) Proposes the system architecture for GMS that logically unifies local, remote, and accelerator memory resources to create a novel, directly addressable distributed shared memory pool. 2) Extends the GMS so that any processing element within a heterogeneous distributed system can also directly access memory in the GMS without software overhead. 3) Contributes both, through engineering examples, with the solutions that allow FPGA accelerated code and existing PCIe-host based code to participate in the GMS, including the implementation of associated firmware. 4) Demonstrates the advantages of GMS through elevating accelerators as a peer of the hosts within a distributed application. To evaluate the benefits of GMS, this work included a GMS implementation known as F-GMS. It is a baseline implementation of a GMS used to evaluate in multiple endpoint types. It was deployed on a 4-node distributed heterogeneous computing cluster that combines Peripheral Component Interconnect Express (PCIe)-attached Field Programmable Gate Array (FPGA) accelerators and x86-64 AMD computing nodes connected on a commodity 100Gbit network. The results show that the proposed GMS reduces the execution time of distributed applications by half compared to MPI, through the reduction in the access time of load/store operations in remote shared memory, and provide accelerators in heterogeneous applications direct access to remote memory.
Date of Award31 Dec 2023
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorMikel Luján (Supervisor) & Anthony Goodacre (Supervisor)

Keywords

  • RDMA
  • PGAS
  • Global Address Space
  • PCIe
  • Distributed Systems
  • Computer Clusters
  • Big Data
  • MPI
  • Heterogeneous Systems
  • Accelerators
  • FPGA
  • Heterogeneous Computing
  • Shared Memory
  • DSM
  • Disaggregated Memory
  • Memory Architecture
  • Global Shared Memory

Cite this

'