

#### OCP FUTURE TECHNOLOGIES SYMPOSIUM

# **OCP Global Summit**

October 18, 2023 | San Jose, CA



## Architectural Challenges and Innovation for Compute Infrastructure Co-Design

Peipei Zhou

Assistant Professor, University of Pittsburgh

Scaling Innovation Through Collaboration



OCTOBER 17-19, 2023 SAN JOSE, CA



#### **Generative AI Models: ChatGPT**



~ Path to 1 million users\* (# of days from launch)

Sources: Google, Subredditstats, Media Reports



OCTOBER 18, 2023 SAN JOSE, CA

#### **Generative AI Models: Stable Diffusion, Dall-E**





OCTOBER 18, 2023 SAN JOSE, CA

### **Transformer Models**



Muti-Head Attention (MHA) Module



OCTOBER 18, 2023 SAN JOSE, CA

## Kernel Breakdown

OCTOBER 18, 2023

SAN JOSE, CA

DCP

• Profiling Transformer based model, DeiT-T, on Nvidia GPU T4 (TSMC 12 nm)

DDR Patch Embed MM BMM Reformat Transpose Softmax Layernorm GELU



- 1 Low Tensor Cores utilization for INT8 MM kernels.
- 2 TensorRT adopts an implicit quantization policy, which leads to BMM computing in FP32, which could originally be in INT8.
- 3 The quan/dequan between FP32 and INT8 consumes non-negligible GPU cycles
- **4** The data layout change also consumes nonnegligible GPU cycles
- 5 The nonlinear kernels, e.g., Softmax, GeLU, Layernorm, take significant GPU cycles

#### FPGA vs. GPU?

🗧 FPGA U250, HeatViT

#### 😼 GPU T4, TensorRT



#### **GPU+FPGA?**

CO

🛚 FPGA U250, HeatViT

GPU T4, TensorRTACAP VCK190, EQ-ViT (ours)

| 60                                                       |                  |                          |         |           |             |
|----------------------------------------------------------|------------------|--------------------------|---------|-----------|-------------|
| 00                                                       | 50.3             | Hardware Specification   | FP32    | INT8      | Off-chip BW |
| 50 —                                                     |                  | AMD FPGA U250 [35]       | 1.2 T   | 6.95 T    | 77 GB/s     |
|                                                          |                  | Nvidia GPU T4 [36]       | 8.1 T   | 130 T     | 320 GB/s    |
| <u>୍</u> ର 40 —                                          |                  | AMD ACAP VCK190 [37]     | 6.4 T   | 102.4 T   | 25.6 GB/s   |
| للا<br>ل                                                 |                  |                          |         |           |             |
| Latency (ms)<br>05 05 05 05 05 05 05 05 05 05 05 05 05 0 |                  |                          |         |           |             |
| en                                                       |                  |                          |         |           |             |
| 0                                                        |                  |                          |         |           |             |
|                                                          | 6.69             | 7.3                      |         |           |             |
| 10 —                                                     |                  | 4.09                     | ours)   |           |             |
| 0 —                                                      |                  |                          | oursy   |           |             |
| 0                                                        | FP32             | INT8                     |         |           |             |
|                                                          | OCTOBER 18, 2023 |                          |         |           |             |
| FUTURE<br>TECHNOLOGIES<br>SYMPOSIUM                      | SAN JOSE, CA     | Scaling Innovation Throu | gh Coll | aboration |             |
| $\tau \uparrow \tau$                                     |                  |                          |         |           |             |

### **Versal ACAP Architecture**





OCTOBER 18, 2023 SAN JOSE, CA

#### **Heterogeneous Accelerator Architecture**





OCTOBER 18, 2023 SAN JOSE, CA

## **Fine-Grained Pipeline**





OCTOBER 18, 2023 SAN JOSE, CA

### **INT Non-linear Functions (Softmax, GELU)**





OCTOBER 18, 2023 SAN JOSE, CA

#### **Reduces Latency by 10x over Nvidia GPU T4**



## Scale-Out?

DCP

- From Heterogeneous Models to Heterogeneous System
- Computation-Communication Aware



OCTOBER 18, 2023

SAN JOSE, CA



H2H: heterogeneous model to heterogeneous system mapping with computation and communication awareness, DAC 2022

## Lower Latency, Lower Energy



Figure 4: The latency and energy performance comparison.

H2H: heterogeneous model to heterogeneous system mapping with computation and communication awareness, DAC 2022

FUTURE FUTURE SYMPOSIUM

OCTOBER 18, 2023 SAN JOSE, CA

## **Open Source?**

- <u>https://github.com/arc-research-lab/CHARM</u>
- https://dl.acm.org/doi/10.1145/3543622.3573210

| Authors: 📳 Jinming Zhuang, 📳 Jason Lau, 😮 Hanchen Ye, 📳 Zhuoping Yang, 📳 Yubo Du, 🙁 Jack Lo,       |   |
|----------------------------------------------------------------------------------------------------|---|
| 🔋 Kristof Denolf, 😩 Stephen Neuendorffer, 🙁 Alex Jones, 🙎 Jingtong Hu, 🙁 Deming Chen, 💽 Jason Cong | , |
| Peipei Zhou Authors Info & Claims                                                                  |   |

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays • February 2023 • Pages 153–164 • https://doi.org/10.1145/3543622.3573210

Published: 12 February 2023 Publication History

OCTOBER 18, 2023

SAN JOSE, CA

Check for updates

🚚 0 🛹 1,450

DCP

About

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture (Full Paper accepted to FPGA2023!)





- MIT license
- -∿- Activity
- ☆ 85 stars
- 5 watching
- ជំ 11 forks

Report repository

#### Scaling Innovation Through Collaboration

🔎 PDF

🗟 eReader

## **Chiplet?**

- H2H-> H2H2H
- Heterogeneous Models to Heterogeneous Chiplet Systems with Heterogeneous Components
- Computation & Communication Aware
- Hierarchical Scheduling & Mapping
- Latency vs Throughput



OCTOBER 18, 2023 SAN JOSE, CA

## Sustainability?



NSF CCF#2324864: Collaborative Research: DESC: Type II: REFRESH: Revisiting Expanding FPGA Real-estate for Environmentally Sustainability Heterogeneous-Systems



OCTOBER 18, 2023 SAN JOSE, CA

## Sustainability?



Fig. 2: REFRESH interposer for integration of homogeneous and heterogeneous monolithic and/or chiplet-based FPGAs.

NSF CCF#2324864: Collaborative Research: DESC: Type II: REFRESH: Revisiting Expanding FPGA Real-estate for Environmentally Sustainability Heterogeneous-Systems

FUTURE FUTURE SYMPOSIUM

OCTOBER 18, 2023 SAN JOSE, CA

Peipei Zhou is an assistant professor of the Electrical Computer Engineering department at the University of Pittsburgh. Her research interests include design automation, hardware/software co-design, AI chip design, etc. She has participated in >\$11M Federal Funds (>\$2M as Lead PI).

Her work in FPGA acceleration for deep learning won the 2019 Donald O. Pederson Best Paper Award from the IEEE Council for Design Automation (CEDA). Her works have also won 2018 ISPASS Best Paper Nominee and 2018 ICCAD Best Paper Nominee.

https://peipeizhou-eecs.github.io/ peipei.zhou@pitt.edu





 $202\overline{3}$ 



# FUTURE TECHNOLOGIES SYMPOSIUM

OCP Global Summit | October 18, 2023 | San Jose, CA