Large Utility Sorting on FPGAs

Kristiyan Manev, Dirk Koch

Research output: Contribution to conferencePaperpeer-review

493 Downloads (Pure)


Sorting is a core operation in many applications. In
particular large problem sorting has received increased attention
due to many big data problems that require sorting. Most large
problem sorters use variants of merge sorting. Traditional merge
sorters are implemented of trees having a linear cost in resources
with respect to the number of sequences merged. In this paper,
we present an FPGA implementation that scales logarithmic and
allows therefore merging several thousand sorted sequences in
one run through the chip. For achieving high throughput, we
use speculative execution techniques, pipelining of critical paths,
simplified inter-level tree communications and parameterizing
stage sizes for allowing trade-offs between utilization/throughput
and stalling rates. Our implementation can merge sort up to
4096 sequences of 64-bit keys at 253MHz (16K sequences at
188MHz). In contrast with previous works, we research and
present an implementation that sorts end-to-end on real hardware
showing that the resource utilization of the needed buffering
infrastructure is much higher than the sorter itself. A case study
utilizing a Xilinx VC709 board merges 1024 sequences of the
Graysort benchmark from and to DRAM at 80% peak DDR-3
memory throughput.
Original languageEnglish
Publication statusAccepted/In press - 16 Sept 2018


Dive into the research topics of 'Large Utility Sorting on FPGAs'. Together they form a unique fingerprint.

Cite this