High-Performance Parallel Radix Sort on FPGA

FCCM Main Page Forums Poster Session 4 – Applications and Architectures High-Performance Parallel Radix Sort on FPGA

Viewing 3 reply threads
  • Author
    Posts
    • #1196
      Ken Eguro
      Keymaster

      High-Performance Parallel Radix Sort on FPGALink for PDF
      Bashar Romanous (University of California, Riverside), Mohammadreza Rezvani (University of California, Riverside), Junjie Huang (University of California, Riverside), Daniel Wong (University of California, Riverside), Evangelos E. Papalexakis (University of California, Riverside), Vassilis J. Tsotras (University of California, Riverside), and Walid Najjar (University of California, Riverside)

    • #1594
      davidsidler
      Participant

      Hey there,

      Interesting work. How many radix bits can you support? Do you have to do multiple passes over the data for large data sets or high cardinality?
      How this compare to the CPU?
      On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?

      At a more high-level, how does your approach compare to sorting networks in existing work?

      David

    • #1607
      Dustin Richmond
      Keymaster

      Thank you for posting your question!

    • #1647
      bashar.romanous
      Participant

      Hello David,

      Thank you for your interest in our work!

      How many radix bits can you support?

      We support radix of 16 bits on Xilinx UltraScale+ XCVU7P chip, but a chip with larger BRAM/URAM such as Xilinx UltraScale+ XCVU13P would make a radix of 20 bits possible.

      Do you have to do multiple passes over the data for large data sets or high cardinality

      The number of iterations is not related to the dataset size but by the radix. So for example a radix of 16 bits with 80-bit keys, means that 80/16 = 5 iterations are needed. The cardinality of the dataset doesn’t affect the sorting time.

      How this compare to the CPU?

      CPU performance starts to degrade slowly after a peak when dataset size grows as cache misses grow and the the dataset competes with the OS for memory resources. The FPGA throughput is constant for large datasets. However, the very high frequency of CPUs cores, and large memory BW, makes the comparison in terms of raw throughput difficult. Hence, the figures were normalized by BW and throughput in terms of sorted records per cycle.

      On the CPU you benefit from write-combining through the use of SIMD stores, are you implementing a similar mechanism on the FPGA? if so at what threshold are you writing to DRAM?

      We didn’t implement a s similar mechanism as there are no spacial locality in the writes. The writes are not contiguous and take place in scattered fashion.

      At a more high-level, how does your approach compare to sorting networks in existing work?

      The time complexity of distribution-based sort, Radix Sort, is O(N*B) where N is the dataset size and B is the number of iterations; which is ratio between the key size to radix. In comparison-based sort, as used in sorting networks, the time complexity is O(N*LOG(N)).

      Sorting networks are limited by their scalability. For a larger dataset you have to run multiple iterations on the sorting network and requires merging of the sorted sub-datasets at the end.

      Kindest regards,
      Bashar

Viewing 3 reply threads
  • You must be logged in to reply to this topic.