FCCM Main Page › Forums › Poster Session 1 – Arithmetic and Security › Proposing a Fast and Scalable Systolic Array for Matrix Multiplication
 This topic has 2 replies, 3 voices, and was last updated 6 days ago by Bahar.

AuthorPosts


April 8, 2020 at 6:12 am #1176Ken EguroKeymaster
Proposing a Fast and Scalable Systolic Array for Matrix Multiplication – Link for PDF
Bahar Asgari (Georgia Institute of Technology), Ramyad Hadidi (Georgia Institute of Technology), and Hyesoon Kim (Georgia Institute of Technology)

May 21, 2020 at 5:47 pm #1717fjhormigoParticipant
Your work seems quite interesting but I don’t have enough information to evaluate properly your proposal.
You account latency based on the number of cycles only, but how your proposal affects the clock frequency compared to the other methods, and then the latency in seconds?
Another key issue is how your proposal affect throughput compared to the others, does it also reduce throughput? said ina different way, what is the initiation interval (the number of cycles between to consecutive calculations) for the three different approach?
Is there any significant change in resource utilization (number of LUTS, DSP?
which sices of matrices are you using on your benchmarks? could you provide more information on how you proceed to compute the speedup and Energy consumption?
Thank you very much. 
May 25, 2020 at 5:38 pm #1719BaharParticipant
Hello!
Many thanks for showing interest in our work. In the following we answer your questions in order:– Our proposed structure does not affect the clock frequency. In other words, increasing or decreasing the clock frequency equally impact all three designs (ours and the two previous systolic arrays) by either removing positive slacks or increasing the number of cycles.
– The maximum attainable throughput (GByte/sec x Ops/Byte) is defined by the depth of the systolic array (Ops/Byte or reuse rate) and memory bandwidth. Therefore, the throughput of samesize systolic arrays (ours and the two previous designs) connected to same memory system (hence same memory bandwidth) are the same and is higher than CPUs, and GPUs. About your question regarding the cycles between consecutive calculation, please note that our design and the TPUstyle systolic array benefit from overlapping load and process phase and reusing the preloaded matrix, while the nonstationary systolic array does not.
– Regardless of their interconnections, all implemented systolic arrays (ours and the two previous systolic arrays) are similar in the total number of multipliers, adders, and registers (as they all store a value in their PEs, either an operand or a partial output). As a result, even though our design uses slightly fewer FFs and LUTs because of its multiplierplusaddertree architecture, we do not see significant differences in resource utilization.
– Our benchmark includes VGGS, VGG16, AlexNet, CifarNet, and ResNet50 consisting of varioussize matrices with dimensions between 16 to 50176. The reported speedup and energy consumption are for performing only the matrix multiplications for the inference using the mentioned sets of DNNs.Please let us know if further clarifications are required.
Thanks!


AuthorPosts
 You must be logged in to reply to this topic.