Abstract
Memory throughput is one of the major bottlenecks for accelerator performance. Now that Zynq UltraScale+ systems are being deployed at exascale to edge, it is important to understand its limitations and optimizations possible for developers. In this paper, we extensively evaluate the memory performance and behaviour for various AXI ports combinations, burst sizes, access patterns, and the number of accelerators per AXI port. Our results on ZCU102 and Ultra 96 boards show that 1) effective throughput of these systems is only 75% and 92.5% of theoretical maximum respectively, 2) 128 and 192 byte burst size is often optimal, 3) AXI ports of the same type may not always exhibit similar behaviour, 4) multiplexing accelerators in PL can provide better throughput distribution compared to multiplexing in PS, and 5) using all AXI ports does not lead to the highest performance.
Original language | English |
---|---|
Title of host publication | International Conference on Field-Programmable Technology (FPT) |
Publication status | Accepted/In press - 7 Oct 2019 |
Keywords
- Memory
- FPGA
- Quantitative analysis