FCCM 2023 Panel: Implementing and Scaling Large Language Models (LLMs) Efficiently — FPGAs vs. GPUs vs. ASICs

Time:

Wednesday May 10th, 10:15 – 11:15am + 15 minutes slack

Speakers:

Stephen Neuendorrfer (AMD),
Ilya Ganusov (Intel),
Rick Grandy (Nvidia),
Andrew Ling (Groq),
Eriko Nurvitadhi (Mangoboost)
Mohamed Abdelfattah (Cornell Tech)
Organizer: Nachiket Kapre

Large Language Models (LLMs) are all the rage. They are seemingly good at some natural language tasks such as next token prediction and appear to have coherent conversations with humans. They are expected to transform how various industries and services will operate. Microsoft, Google, Meta, Amazon, and other technology companies are scrambling to deploy, and integrate this technology into many products. While training these LLMs can take hundreds of high-end GPUs several weeks or months of computing time, inference can be much faster and leaner.

As of now, it appears that GPUs have won both the training and inference markets for LLMs. A key question that the service providers must address is the cost of providing this expensive feature to customers i.e. cost/query, etc.

This panel seeks to discuss architecture choice through the lens of the cost metric:
0. What key metrics do we need to define (i.e. cost/query) to compare various options for accelerating LLM inference? i.e. Silicon cost, software toolchain cost, operating costs of the datacenter, one-time/periodic training cost, scaling efficiencies, etc.
1. Are FPGAs competitive for LLM inference? How important are custom blocks like AMD AI engines or Intel Tensor Blocks?
2. What is the role of ASICs/accelerators? Amazon, Groq, Cerebras, and a shrinking pool of hardware startups have competitive exotic architectures on paper. How will they offer a competitive solution in practice?
3. What architecture is a better fit for multi-device inference? At present, multi-device inference is necessary to accommodate massive LLM weight matrices.
4. Will LLMs ever scale down to fit single device operation without sacrificing quality (e.g. FlexGen)? How does that change our choice of architecture for inference?
5. GPUs have the unique advantage of a superior software ecosystem for AI (PyTorch, etc) in addition to continuously evolving state-of-the-art hardware. How much will this advantage hurt competitors?
6. Do we anticipate fully customized architectures for LLMs that tailor the compute density, memory capacity, and network capabilities to match LLM requirements?

The panel will be organized as follows. Each panelist will have ~5-10 minutes to present their corporate technology pitch that addresses these questions. We will then open the floor to questions from the audience. You may pre-submit your questions to nachiket@uwaterloo.ca for relaying to the panelists before the conference.

We will post panelist slides here.

Panelist Bio

RICK GRANDY

Rick Grandy is a Principal Solutions Architect on the Professional Visualization (ProViz) team at NVIDIA where he focuses on graphics and machine learning for the Media and Entertainment industry. Rick has nearly 3 decades’ experience in visual effects and animation with Industrial Light and Magic, Sony Pictures Imageworks, Digital Domain, Bad Robot, and others. His work initially centered on the development of digital characters and animation tools but grew to cover the post-production pipeline end-to-end including real-time previsualization, asset management, editorial, ingestion/delivery systems, and workflow automation. At NVIDIA, Rick combines his experience with the latest in production technology and advanced research to help our customers build the solutions necessary to perform magic for audiences everywhere.

STEPHEN NEUENDORRFER

Stephen Neuendorffer is a Fellow in the AMD Research and Development Group working on early development of compilation for compute acceleration, focused on leveraging LLVM and MLIR. Previously, he was product architect of Xilinx Vivado HLS, co-authored a widely used textbook on HLS design for FPGAs, and worked with customers on a wide variety of applications, including video encoders, computer vision, wireless systems, and networking systems. He received B.S. degrees in Electrical Engineering and Computer Science from the University of Maryland, College Park in 1998. He graduated with University Honors, Departmental Honors in Electrical Engineering, and was named the Outstanding Graduate in the Department of Computer Science. He received the Ph.D. degree from the University of California, Berkeley in 2005, after being one of the key architects of Ptolemy II.

ANDREW LING

Andrew Ling is a Sr. Director of ML Compilers and Software at Groq, a well funded ML acceleration startup headquartered in Mountain View. He got his PhD from the University of Toronto and has spent most of his career building compilers for various accelerators such as FPGAs. While leading Groq’s ML compiler effort, Andrew has been able to develop a novel kernel-less approach to ML compilation for Groq’s novel TSP, supporting a vast array of models including Large-Language and Computer Vision Models.

ERIKO NURVITADHI

Eriko Nurvitadhi is a Co-Founder and the Chief Product Officer of MangoBoost, Inc., a well-funded (and actively hiring) startup that develops novel data processing units (DPUs) to dramatically improve server systems performance, scalability, and cost. His technical interests are in hardware accelerator architectures and their ecosystems (systems, software) for key application domains. Previously, he was a Principal Engineer at Intel, where he worked on FPGA and AI technologies, such as co-founding Intel’s Xeon+FPGA and 3D FPGA academic programs, as well as contributing to Intel’s first AI-optimized FPGA. He has 70+ peer-reviewed papers and 100+ patents pending/granted, with H-index of 32. He has served on committees of IEEE/ACM conferences (e.g., FPGA, DAC, FCCM), including as the Technical Program Chair for FCCM 2022. He received a PhD in ECE from Carnegie Mellon University, and an MBA from Oregon State University.

ILYA GANUSOV

Ilya Ganusov is a Senior Principal Engineer and Director of the FPGA Core Architecture group at Intel. His research and development efforts focus on advancing FPGA core architecture, developing architectural tools and methodologies, and technology pathfinding. Since joining Intel in 2018, he has played a critical role in the development of foundational Agilex architecture and introducing application-specific optimizations for AI, HPC, and wireless communication applications. Prior to Intel, Ilya designed FPGAs at Xilinx and Achronix. Ilya has co-authored over 30 granted patents and 12 peer-reviewed papers. Ilya received his Ph.D. in Electrical and Computer Engineering from Cornell University.

MOHAMED S. ABDELFATTAH

Mohamed Abdelfattah is an Assistant Professor at Cornell Tech and in the Electrical and Computer Engineering Department at Cornell University. His research interests include deep learning systems, automated machine learning, hardware-software codesign, reconfigurable computing, and FPGA architecture. Mohamed’s goal is to design the next generation of machine-learning-centric computer systems for both datacenters and mobile devices. Mohamed received his BSc from the German University in Cairo, his MSc from the University of Stuttgart, and his PhD from the University of Toronto. His PhD was supported by the Vanier Canada Graduate Scholarship and he received three best paper awards for his work on embedded networks-on-chip for FPGAs. His PhD work garnered much industrial interest and has since been adopted by multiple semiconductor companies in their latest FPGAs. After his PhD, Mohamed spent time at Intel’s programmable solutions group, and most recently at Samsung where he led a research team focused on hardware-aware automated machine learning.