

1

# CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture



Peipei Zhou's Group https://peipeizhou-eecs.github.io/

Customized Architecture and Programming Heterogeneous Computing with FPGA, GPU, ASIC, NPU



# 2023/04 @ 2023 CMU Crossroads Peipei Zhou, Assistant Professor

- Experience:
  - 2012-2019: CS Ph.D. @ UCLA with Prof. Jason Cong
  - 2019-2021: Staff Software Engr. @ Enflame, AI ASIC Startup
  - 2021/09 now: Assistant Professor @ Pitt-ECE
- Research:
  - Architecture: FPGA, AI ASIC, GPU, etc.
  - Abstraction: Compiler Support
  - Application: Artificial Intelligence, Genome Pipeline
- Teaching:
  - Reconfigurable Computing in Deep Learning, 2022 Spring
  - Computer Organization and Architecture, 2022 Fall
  - Embedded System, Reconfigurable Computing in Deep Learning, 2023 Spring



Peipei Zhou's Group https://peipeizhou-eecs.github.io/



Xilinx ACAP: Heterogeneous SoC



https://peipeizhou-eecs.github.io/

#### **Microsoft Azure**

Data Center amazon Scoogle Cloud

IMSB'15:CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing





Application

### Architecture





and Acceleration for Deep **Convolutional Neural** Networks [Best Paper]

Chip

ICCAD'16: Caffeine, Towards

**Uniformed Representation** 

FCCM'16: Energy Efficiency of **Full Pipelining** 

Q,Q

DAC'16: Bandwidth **Optimization Through On-Chip Memory Restructuring** for HLS

FCCM'14: A Fully Pipelined and Dynamically Composable Architecture of CGRA.

FPGA'16: Accelerator-Rich Architecture

FCCM'18: Latte: Locality Aware Transformation for **High-Level Synthesis** 

ICCAD'18: SODA: Stencil with Optimized Dataflow Architecture ( Best Paper Nominee).

Node

**Paper Nominee**)

ISPASS'18: Doppio: I/O-Aware

Performance Analysis, Modeling

and Optimization for In-Memory

Computing Framework ( Best

Spark

FCCM'18: ST-Accel: A High-Level

Streaming Applications on FPGA

**Programming Platform for** 

FCCM'20: Algorithm-Hardware 🔊 Co-design for BQSR Accelerat in Genome Analysis ToolKit

FPGA'21: MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators 🚺 amazon

TECS'21: Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices TODAES'21: EF-Train: Enable Efficient Ondevice CNN Training on FPGA Through Dato Reshaping for Online Adaptation or Personalization.



#### **Microsoft Azure**

Data Center amazon Scoogle Cloud

IMSB'15:CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing



FPGA'21: MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators 🚺 amazon

ICCAD'18: SODA: Stencil with Optimized Dataflow Architecture ( Best Paper Nominee).

FCCM'20: Algorithm-Hardware 🔊 Co-design for BQSR Accelerat in Genome Analysis ToolKit

TECS'21: Algorithm-hardware Co-design of

Attention Mechanism on FPGA Devices TODAES'21: EF-Train: Enable Efficient Ondevice CNN Training on FPGA Through Dato Reshaping for Online Adaptation or Personalization.

### Application





### Architecture





Chip

FPGA23: CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture, Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 12 pages. (🐣 New Paper). https://doi.org/10.1145/3543622

.3573210.

DAC23: High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design **Challenges and DSE Perspectives** To appear in Proceedings of the 60th ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC '23), July 9–13, 2023, San Francisco, CA, USA. (🐣 New Paper)

ISPASS'18: Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework ( Best **Paper Nominee**)

Spark

Node

FCCM'18: ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA

## "FPGA is Dead" vs "Long Live the FPGA"

• Single GEMM Energy Efficiency Comparisons



## "FPGA is Dead" vs "Long Live the FPGA"

• Single GEMM Energy Efficiency Comparisons

### **INT8 Energy Efficiency**



• End-to-end Application

### Fp32 End-to-end Application



MLP

### "X FPGA": AMD/Xilinx Versal ACAP

#### Health: Medical Image Segmentation



#### Recommendation Systems





#### Image Recognition



#### Natural Language processing



shutterstock.com · 1257901324

### **Background Story**

### Versal ACAP Overview





### Versal ACAP Architecture



### Challenge 1

The computation scaling vs. off-chip communication scaling



### Versal Programming Model



### Versal Programming Model



### Versal Programming Model



## Challenge 2

• Huge computation resources in one monolithic accelerator vs. MM layers of small sizes

Matrix Multiply on 384 AIEs



### Challenge 2

• Huge computation resources in one monolithic accelerator vs. MM layers of small sizes



Shape mismatch, waste on both computation and communication

## **Motivation Example**



# **Motivation Example**



## Challenge 3

High level complexity

![](_page_17_Figure_2.jpeg)

### Huge Programming Efforts

![](_page_17_Figure_4.jpeg)

## "CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture"

### Challenges

- Heterogeneity makes it non-trivial to make a system design with good performance on Versal
- Off-chip bandwidth doesn't scaling as fast as computation
- High parallelism makes one-size-fit-forall solution inefficient

### **Solutions**

• White-box open-sourced Framework https://github.com/arc-research-lab/CHARM

• Hugely explore the on-chip reuse

• Two-step Accelerator Composing Algorithm

![](_page_18_Picture_9.jpeg)

# **CHARM Compilation Flow**

CHARM Input and Output (IOP)

- Input
  - 1) MM-based AI Model ;
  - 2) Profiling of Off-chip Bandwidth ;
  - 3) Hardware resource constrains ;

• Output

Bitstream Running on the AIE and PL (AIE/V++ Compiler)
 Executable Running on ARM CPU (GCC Cross Compilation)

![](_page_19_Figure_8.jpeg)

### **CHARM Framework Overview**

**CHARM Components** 

- **(1)** Single Accelerator Design Space Exploration
- **(2)** Diverse Accelerator Composer
- 3 Runtime Configuration
- 4 Automatic Code Generator

![](_page_20_Figure_6.jpeg)

### **CHARM Framework Overview**

**CHARM Components** 

**(1)** Single Accelerator Design Space Exploration

1) Fully-pipelined AIE Processor Design

2) Hugely IO Reused AIE Array Design

3) On-chip Buffer Reused PL Design

![](_page_21_Figure_6.jpeg)

### 1 Single Accelerator Design Space Exploration

### 1) Fully-pipelined AIE Processor Design

```
#define TI 32
                 3 Lines of Code
#define TK 32
#define TJ 32
void mm kernel(
    input window float * restrict L,// LHS
    input window float * restrict R, // RHS
    output_window_float * restrict 0 ) { // Output
    preload(L,R);
    for(int m.3 = 0; m.3 < TI/PI; m.3++)</pre>
    chess_prepare_for_pipelining
    chess_loop_range(TI/PI, ) {
        for(int n.3 = 0; n.3 < TJ/PJ; n.3++)</pre>
        chess prepare for pipelining
        chess loop range(TJ/PJ, ) {
            v8float acc0 = null v8float();
v8float>a100_Lines_of_Code
            for(int k.3 = 0; k.3 < TK/PK - 1; k.3++)</pre>
            chess_prepare_for_pipelining
            chess loop range(TK/PK - 1, ) {
                 [acc0; acc1] = MatMul_without_store(
                     L(m.3, k.3), R(k.3, j.3), [acc0; acc1]);
            MatMul with store(L(m.3, TK/PK - 1), R(TK/PK - 1,
                              j.3), [acc0; acc1], 0(i, j));
```

![](_page_22_Figure_4.jpeg)

#### Table 2: Single AIE MM comparison under fp32 data type.

|                          | H-GCN[48] |        | CHARM (this work) |        |          |
|--------------------------|-----------|--------|-------------------|--------|----------|
| Size: M x K x N          | MACs/Cyc  | Eff    | MACs/Cyc          | Eff    | Eff gain |
| $16 \times 16 \times 16$ | 2.34      | 29.30% | 6.18              | 77.22% | 2.64x    |
| $32 \times 32 \times 32$ | 3.64      | 45.50% | 7.57              | 94.70% | 2.08x    |
| $64 \times 64 \times 8$  | 3.64      | 45.50% | 7.54              | 94.29% | 2.07x    |

[48] Zhang, Chengming, et al. "H-GCN: A graph convolutional network accelerator on Versal ACAP architecture." FPL'22.

**(1)** Single Accelerator Design Space Exploration

![](_page_23_Figure_3.jpeg)

**(1)** Single Accelerator Design Space Exploration

![](_page_24_Figure_3.jpeg)

**(1)** Single Accelerator Design Space Exploration

![](_page_25_Figure_3.jpeg)

**(1)** Single Accelerator Design Space Exploration

![](_page_26_Figure_3.jpeg)

**(1)** Single Accelerator Design Space Exploration

![](_page_27_Figure_3.jpeg)

**(1)** Single Accelerator Design Space Exploration

![](_page_28_Figure_3.jpeg)

1 Single Accelerator Design Space Exploration

![](_page_29_Figure_2.jpeg)

EST: 2.8 TFLOPs

### **CHARM Framework Overview**

#### **CHARM Components**

- (2) Diverse Accelerator Composer
  - 1) Workload Assignment
  - 2) Hardware Resource Partitioning

![](_page_30_Figure_5.jpeg)

## **CHARM: Diverse Accelerator Composing**

**Two-step Search** Algorithm How to assign **M layers** in one AI Model to **N accelerators**? 5 layers LO 2 ACCs ACC0 ACC1 URAM AIE BRAM **PLIO** UT

1<sup>st</sup> Step: Workload Assignment

2

2<sup>nd</sup> Step: Hardware Resource Partitioning

## **CHARM: Diverse Accelerator Composing**

![](_page_32_Figure_1.jpeg)

## **CHARM: Diverse Accelerator Composing**

![](_page_33_Figure_1.jpeg)

Single Accelerator Design

### 5.54x, 1.93x gain on Throughput and Energy.Eff

![](_page_34_Figure_3.jpeg)

DSE within 5% error rate

Square Matrix Size

AutoSA': Wang, Jie, Licheng Guo, and Jason Cong. "AutoSA: A polyhedral compiler for high-performance systolic arrays on FPGA." *The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. 2021.

• Applications

### Bert-Large (Natural Language Processing)

#### Example: French to English translation

![](_page_35_Figure_4.jpeg)

### Neural Collaborative Filtering

### (Recommendation)

![](_page_35_Figure_7.jpeg)

Multilayer Perceptrons (Regression, Classification)

### Vision Transformer (Classification)

![](_page_35_Figure_10.jpeg)

![](_page_35_Picture_11.jpeg)

• Applications Characteristics

**Even Larger Shape Variance** 

![](_page_36_Figure_3.jpeg)

### Large layers Dominate the Application

![](_page_36_Picture_5.jpeg)

![](_page_36_Picture_6.jpeg)

NCF

MLP

### Composing Diverse Accelerators

Throughput Comparison Among Different CHARM Strategies

BERT: 1.46 TFLOPS, 5.3x gain over one monolithic design ViT: 1.61 TFLOPS, 32.5x gain over one monolithic design NCF: 1.74 TFLOPS for one accelerator design MLP: 2.94 TFLOPS for one accelerator design

![](_page_37_Figure_5.jpeg)

Implementation Layout

Two Diverse Accelerators Design of BERT

![](_page_38_Figure_3.jpeg)

![](_page_38_Picture_4.jpeg)

### **Thank You & Questions**

1. Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex K. Jones, Jingtong Hu, Deming Chen, Jason Cong, **Peipei Zhou** (2023). CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture (Implemented Project Implemented Project Implemented Project Implemented Project Implemented Programmable Gate Arrays (FPGA '23), Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3543622.3573210.

2. Jinming Zhuang, Zhuoping Yang, **Peipei Zhou** (2023). High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives ( Architecture, Project Architecture, Design Challenges and DSE Perspectives ( Architecture, Project Architecture, Design Automation Conference, San Francisco, California, USA, (DAC '23), July 9–13, 2023, San Francisco, CA, USA. Full Paper Accepted (acceptance ratio is 23 percent).

https://peipeizhou-eecs.github.io/

https://github.com/arc-research-lab/CHARM

https://dl.acm.org/doi/10.1145/3543622.3573210

![](_page_39_Picture_6.jpeg)

![](_page_39_Picture_7.jpeg)

![](_page_39_Picture_8.jpeg)

![](_page_39_Picture_9.jpeg)

![](_page_39_Picture_10.jpeg)

![](_page_39_Picture_11.jpeg)

![](_page_39_Picture_12.jpeg)

### **Backup Slides: Experiment Results**

### Search Time Comparison for BERT

Two step search algorithm: 170s VS exhaustive search: 33mins

![](_page_40_Figure_3.jpeg)