Abstract
The idea of coupling recongurable fabrics with general-purpose processors has been extensively studied during the last couple of decades. Custom instructions targeting those recongurable fabrics had to be handcrafted because tools
capable of high level synthesis were not available at the time. Nowadays, high level synthesis tools have matured to a state allowing system designers to automatically generate hardware implementations from software applications.
At the end of Moore's era, it is required to reinvestigate recongurable custom instructions by taking full advantage of the latest HLS compilers. In this paper we introduce the concept of CPU interlays which are FPGA-like fabrics that
are integrated directly into the core of a hardened processor. This enables the customization of an instruction set at runtime. While CPU interlays will show best performance with hand-optimized custom instructions, this paper suggests a
semi-automated how that does not need the expertise of an FPGA designer. Using automatic proling together with HLS tools allows the acceleration of user programs with very little human interaction during application design.
By replacing the NEON SIMD unit of an ARM Cortex-A9 with an interlay taking the same die area, we could demonstrate speedups as high as 68 for individual function kernels without touching any RTL code. Furthermore, we show that while HLS compilers can enhance design productivity, it is in some cases required to follow a HLS-friendly coding style for maximizing performance.
capable of high level synthesis were not available at the time. Nowadays, high level synthesis tools have matured to a state allowing system designers to automatically generate hardware implementations from software applications.
At the end of Moore's era, it is required to reinvestigate recongurable custom instructions by taking full advantage of the latest HLS compilers. In this paper we introduce the concept of CPU interlays which are FPGA-like fabrics that
are integrated directly into the core of a hardened processor. This enables the customization of an instruction set at runtime. While CPU interlays will show best performance with hand-optimized custom instructions, this paper suggests a
semi-automated how that does not need the expertise of an FPGA designer. Using automatic proling together with HLS tools allows the acceleration of user programs with very little human interaction during application design.
By replacing the NEON SIMD unit of an ARM Cortex-A9 with an interlay taking the same die area, we could demonstrate speedups as high as 68 for individual function kernels without touching any RTL code. Furthermore, we show that while HLS compilers can enhance design productivity, it is in some cases required to follow a HLS-friendly coding style for maximizing performance.
Original language | English |
---|---|
Title of host publication | International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2017) |
DOIs | |
Publication status | Published - 31 Dec 2017 |