# EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) 2024 P. Dong\*, J. Zhuang\*, Z. Yang, S. Ji, Y. Li, D. Xu, H. Huang, J. Hu, A.K. Jones, Y. Shi, Y. Wang, P. Zhou

\*Co-first authors

Massachusetts Institute of Technology; Brown University; Northeastern University; North Carolina State University; Syracuse University; University of Maryland, College Park; University of Pittsburgh; University of Notre Dame

peggy281@mit.edu jinming\_zhuang@brown.edu yanz.wang@northeastern.edu peipei\_zhou@brown.edu





Autonomous Driving





Autonomous Driving







Autonomous Driving







Radio Access Network



Autonomous Driving







Radio Access Network



#### FPGA vs. GPU?



### FPGA vs. GPU?



#### Hardware Specification

| Platform               | FP32  | INT8   | Off-Chip<br>BW |
|------------------------|-------|--------|----------------|
| NVIDIA A10G<br>8nm GPU | 35 T  | 140 T  | 600 GB/s       |
| AMD U250<br>16nm FPGA  | 1.2 T | 6.95 T | 77 GB/s        |



**DeiT-T Latency** 

FPGA U250, HeatViT

A10G GPU, TensorRT



Hardware Specification





**DeiT-T Latency** 

EMBEDDED SYSTEMS WEEK

📕 FPGA U250, HeatViT 🛛 📕 A

📕 A10G GPU, TensorRT

Hardware Specification





**DeiT-T Latency** 

FPGA U250, HeatViT 🛛 📕 A10G GPU, TensorRT



#### Hardware Specification























EMBEDDED SYSTEMS WEEK

#### FPGA U250, HeatViT 🛛 📕 A10G GPU, TensorRT **Hardware Specification** 60 **Platform** INT8 **FP32** 50.3 BW 50 NVIDIA A10G 35 T 140 T 600 GB/s 40 8nm GPU Latency (ms) 30 AMD U250 6.95 T 1.2 T 77 GB/s 16nm FPGA 20 7.3 10 2.2 1.78 0 **FP32** INT8

### FPGA + Vector Processor?

**DeiT-T Latency** 



EMBEDDED SYSTEMS WEEK















• Time breakdown of EQ-ViT on Versal and TensorRT on A10G GPU for DeiT-T



• 1 Non-MM kernels with <1% operations, take about 41% time





- 1 Non-MM kernels with <1% operations, take about 41% time
- 2 Tensor core utilization is not high enough (MM): ~23 TOPS, 16% of INT8 throughput (140TOPS)





- 1 Non-MM kernels with <1% operations, take about 41% time
- 2 Tensor core utilization is not high enough (MM): ~23 TOPS, 16% of INT8 throughput (140TOPS)
- 3 TensorRT adopts an implicit quantization policy: BMM FP32 data type





- 1 Non-MM kernels with <1% operations, take about 41% time
- 2 Tensor core utilization is not high enough (MM): ~23 TOPS, 16% of INT8 throughput (140TOPS)
- 3 TensorRT adopts an implicit quantization policy: BMM FP32 data type





- 1 Non-MM kernels with <1% operations, take about 41% time
- 2 Tensor core utilization is not high enough (MM): ~23 TOPS, 16% of INT8 throughput (140TOPS)
- 3 TensorRT adopts an implicit quantization policy: BMM FP32 data type





- 1 Non-MM kernels with <1% operations, take about 41% time
- 2 Tensor core utilization is not high enough (MM): ~23 TOPS, 16% of INT8 throughput (140TOPS)
- 3 TensorRT adopts an implicit quantization policy: BMM FP32 data type











- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints



- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design



- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design





HW Capability



Accuracy& Latency Cons



- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design




- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design





- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design





- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design





• Specialized MM kernel design





• Specialized MM kernel design





• Specialized MM kernel design



EMBEDDED SYSTEMS WEEK

• Specialized MM kernel design

3D-Parallelism on two hierarchies: 3D-AIE Array (A, B, C), 3D-SIMD Instruction (PI, PK, PJ)





• Specialized MM kernel design

3D-Parallelism on two hierarchies: 3D-AIE Array (A, B, C), 3D-SIMD Instruction (PI, PK, PJ)





• Fine-grain pipelined non-MM kernel design



• Fine-grain pipelined non-MM kernel design





• Fine-grain pipelined non-MM kernel design







• Fine-grain pipelined non-MM kernel design



• Non-linear Kernels



• Fine-grain pipelined non-MM kernel design



• Non-linear Kernels



• Fine-grain pipelined non-MM kernel design



Non-linear Kernels



- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design





- Inputs
  - 1) Transformer models
  - 2) Accuracy constraint
  - 3) Latency constraint
  - 4) Hardware constraints

- Outputs
  - 1) Quantization strategy
  - 2) Hardware optimization strategy
  - 3) Automatic generated hardware design





#### Can We Quantize ViTs into low-bit (e.g. 8) for enhanced Accuracy?



Can We Quantize ViTs into low-bit (e.g. 8) for enhanced Accuracy?

EMBEDDED SYSTEMS WEEK

#### **Quantization Algorithm:**

- ViT Quantization
  - No papers quantize ViTs into 8-bit with higher acc

| Method         | #Bits    | DeiT-T [43] | DeiT-S [43] | <b>DeiT-B</b> [43] | Swin-T [33] | Swin-S [33] |  |  |
|----------------|----------|-------------|-------------|--------------------|-------------|-------------|--|--|
| Full Precision | 32/32/32 | 72.21       | 79.85       | 81.85              | 81.35       | 83.2        |  |  |
| PTO            |          |             |             |                    |             |             |  |  |
| MinMax         | 8/8/8    | 70.94       | 75.05       | 78.02              | 64.38       | 74.37       |  |  |
| EMA            | 8/8/8    | 71.17       | 75.71       | 78.82              | 70.81       | 75.05       |  |  |
| Percentile     | 8/8/8    | 71.47       | 76.57       | 78.37              | 78.78       | 78.12       |  |  |
| OMSE           | 8/8/8    | 71.3        | 75.03       | 79.57              | 79.3        | 78.96       |  |  |
| Bit-Split      | 8/8/8    | _           | 77.06       | 79.42              | _           | -           |  |  |
| PTQ for ViT    | 8/8/8    | _           | 77.47       | 80.48              | _           | -           |  |  |
| FQ-ViT         | 8/8/8    | 71.61       | 79.17       | 81.2               | 80.51       | 82.71       |  |  |
|                |          |             |             |                    |             |             |  |  |

Can We Quantize ViTs into low-bit (e.g. 8) for enhanced Accuracy?

#### **Quantization Algorithm:**

- ViT Quantization
  - No papers quantize ViTs into 8-bit with higher acc

| Method         | #Bits    | <b>DeiT-T</b> [43] | DeiT-S [43] | DeiT-B [43] | Swin-T [33] | Swin-S [33] |  |  |  |
|----------------|----------|--------------------|-------------|-------------|-------------|-------------|--|--|--|
| Full Precision | 32/32/32 | 72.21              | 79.85       | 81.85       | 81.35       | 83.2        |  |  |  |
| PTQ            |          |                    |             |             |             |             |  |  |  |
| MinMax         | 8/8/8    | 70.94              | 75.05       | 78.02       | 64.38       | 74.37       |  |  |  |
| EMA            | 8/8/8    | 71.17              | 75.71       | 78.82       | 70.81       | 75.05       |  |  |  |
| Percentile     | 8/8/8    | 71.47              | 76.57       | 78.37       | 78.78       | 78.12       |  |  |  |
| OMSE           | 8/8/8    | 71.3               | 75.03       | 79.57       | 79.3        | 78.96       |  |  |  |
| Bit-Split      | 8/8/8    | _                  | 77.06       | 79.42       | _           | -           |  |  |  |
| PTQ for ViT    | 8/8/8    | _                  | 77.47       | 80.48       | -           | -           |  |  |  |
| FQ-ViT         | 8/8/8    | 71.61              | 79.17       | 81.2        | 80.51       | 82.71       |  |  |  |
|                |          |                    |             |             |             |             |  |  |  |





EMBEDDED SYSTEMS WEEK

- Two Special Data Distribution inside ViTs
  - Long-Tail Distribution

Long-Tail Distribution: Attention Matrix & Act After GELU



- Two Special Data Distribution inside ViTs
  - Long-Tail Distribution



Long-Tail Distribution: Attention Matrix & Act After GELU



- Two Special Data Distribution inside ViTs
  - Long-Tail Distribution



Long-Tail Distribution: Attention Matrix & Act After GELU



- Two Special Data Distribution inside ViTs
  - Long-Tail Distribution



Long-Tail Distribution: Attention Matrix & Act After GELU



- Two Special Data Distribution inside ViTs
  - Long-Tail Distribution



Long-Tail Distribution: Attention Matrix & Act After GELU



- Two Specific Data Distribution inside ViTs
  - Long-Tail Distribution





#### • Two Specific Data Distribution inside ViTs

- Long-Tail Distribution
- Substantial Outliers



#### • Two Specific Data Distribution inside ViTs

- Long-Tail Distribution
- Substantial Outliers





- Two Specific Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers











- Long-Tail Distribution
- Substantial Outliers







EMBEDDED SYSTEMS

WEEK



- Long-Tail Distribution
- Substantial Outliers







EMBEDDED SYSTEMS

WEEK

Long-Tail Distribution: Attention Matrix & Act After GELU



- Two Specific Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers









- Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers





- Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers





- Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers





- Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers





- Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers


# EQ-ViT Data Analysis

- Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers

Long-Tail Distribution: Attention Matrix & Act After GELU

EMBEDDED SYSTEMS WEEK

Channel-wise Outlier: Fixed Layer & Fixed Channel & Fixed Data Range



# **EQ-ViT Software Solution**

- Two Specific Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers
- Sub-8-bit: Activation-aware Full Quantization
  - Log2 Quantization









- Two Specific Data Distribution inside ViTs
  - Long-Tail Distribution 0
  - Substantial Outliers 0
- Sub-8-bit: Activation-aware Full Quantization
  - Log2 Quantization Ο



#### • Two Specific Data Distribution inside ViTs

- Long-Tail Distribution
- Substantial Outliers

#### • Sub-8-bit: Activation-aware Full Quantization

- Log2 Quantization
- Outlier-aware Training w/ 2<sup>A</sup>X Adaption





#### Two Specific Data Distribution inside ViTs

- Long-Tail Distribution
- Substantial Outliers

#### • Sub-8-bit: Activation-aware Full Quantization

- Log2 Quantization
- Outlier-aware Training w/ 2<sup>A</sup>X Adaption



Layer-wise Uniform Quantization with  $2^x$ 



#### Two Specific Data Distribution inside ViTs

- Long-Tail Distribution
- Substantial Outliers

#### Sub-8-bit: Activation-aware Full Quantization

- Log2 Quantization
- Outlier-aware Training w/ 2<sup>A</sup>X Adaption





 $2^x$  Can be efficiently supported by Bitshift on FPGA board.

Layer-wise Uniform Quantization with  $2^x$ 



- Two Specific Data Distribution inside ViTs
  - Long-Tail Distribution
  - Substantial Outliers
- Sub-8-bit: Activation-aware Full Quantization
  - Log2 Quantization
  - Outlier-aware Training w/ 2<sup>A</sup>X Adaption
  - w/ Token Pruning Regularization



- Long-Tail Distribution
- Substantial Outliers
- Sub-8-bit: Activation-aware Full Quantization
  - Log2 Quantization
  - Outlier-aware Training w/ 2<sup>A</sup>X Adaption
  - w/ Token Pruning Regularization



Figure 4: Activation Quantization With Token Pruning.





#### • Two Specific Data Distribution inside ViTs

- Long-Tail Distribution
- Substantial Outliers

#### Sub-8-bit: Activation-aware Full Quantization

- Log2 Quantization
- Outlier-aware Training w/ 2<sup>A</sup>X Adaption
- w/ Token Pruning Regularization



Figure 4: Activation Quantization With Token Pruning.



#### • Application accuracy performance



**On ImageNet**: EQ-ViT can enhance task accuracy up to 2.4% over the baseline, better up to 6.2% higher than other SOTA;

On Cifar-100: EQ-ViT can enhance task accuracy up to 1.4% over the baseline, better up to 1.8% higher than other SOTA.



#### Application accuracy performance



**On ImageNet**: EQ-ViT can enhance task accuracy up to 2.4% over the baseline, better up to 6.2% higher than other SOTA;

On Cifar-100: EQ-ViT can enhance task accuracy up to 1.4% over the baseline, better up to 1.8% higher than other SOTA.



#### • Application accuracy performance



**On ImageNet**: EQ-ViT can enhance task accuracy up to 2.4% over the baseline, better up to 6.2% higher than other SOTA; **On Cifar-100**: EQ-ViT can enhance task accuracy up to 1.4% over the baseline, better up to 1.8% higher than other SOTA.





• Hardware performance comparisons across different solutions



• Hardware performance comparisons across different solutions





- Hardware performance comparisons across different solutions
  - EQ-ViT on VCK190 achieves 13.1x and 3.4x average latency reduction compared with U250, A10G





- Hardware performance comparisons across different solutions
  - EQ-ViT on VCK190 achieves 13.1x and 3.4x average latency reduction compared with U250, A10G





- Hardware performance comparisons across different solutions
  - EQ-ViT on VCK190 achieves 13.1x and 3.4x average latency reduction compared with U250, A10G
  - Estimation of EQ-ViT on VEK280 shows an another 1.7x average latency reduction over VCK190
     INT8 Latency Comparison, Batch = 6





- Hardware performance comparisons across different solutions
  - EQ-ViT on VCK190 achieves 13.1x and 3.4x average latency reduction compared with U250, A10G
  - Estimation of EQ-ViT on VEK280 shows an another 1.7x average latency reduction over VCK190
     INT8 Latency Comparison, Batch = 6



### **Open-Source Tool**

GitHub Link: <u>https://github.com/arc-research-lab/CHARM</u>

|                                             |                                          | ☆ Edit Pins ▼ ③ Unwatch 6 ▼        | 89 Fork 18 ▼ 🔶 Starred 119 ▼                                               |
|---------------------------------------------|------------------------------------------|------------------------------------|----------------------------------------------------------------------------|
| 양 main → 양 1 Branch ⓒ 1 Tags                | Q Go to file                             | t Add file - <> Code -             | About 鐐                                                                    |
| peipeizhou-eecs Update README.md with CHARM | A 2.0 TRETS journal publication          | 20cc535 · last month 3 259 Commits | CHARM: Composing Heterogeneous<br>Accelerators on Versal ACAP Architecture |
| CACG                                        | Update Buffer Strategy for Multiple Accs | last year                          | fpga deeplearning                                                          |
| CDAC                                        | Update Kernel0                           | 10 months ago                      | design-space-exploration versal<br>high-level-synthesis                    |
| CDSE                                        | Update Templates for kernel6 int8        | 8 months ago                       | electronic-design-automation                                               |
| 🖿 charm                                     | bug fixes + working flow for VCK5000     | last year                          | heterogeneous-computing acap<br>domain-specific-architecture versalacap    |
| config_files                                | Update Bubble Free Send B                | 2 years ago                        |                                                                            |
| example                                     | Update Makefiles                         | last year                          | MIT license                                                                |
| example_new                                 | Update FP32 Example                      | last year                          | -∧- Activity                                                               |
| src src                                     | Change Stack Size to 1024                | last year                          | <ul> <li>Custom properties</li> <li>119 stars</li> </ul>                   |
| src_gen                                     | Update Buffer Type                       | last year                          | ⊙ 6 watching                                                               |
| templates                                   | Support xilinx_vck5000_gen4x8_qdma_2_2   | 02220_1 4 months ago               | %   18 forks                                                               |



# Thank You & Welcome to Questions

# EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

P. Dong\*, J. Zhuang\*, Z. Yang, S. Ji, Y. Li, D. Xu, H. Huang, J. Hu, A.K. Jones, Y. Shi, Y. Wang, P. Zhou

\*Co-first authors

Massachusetts Institute of Technology; Brown University; Northeastern University; North Carolina State University; Syracuse University;

University of Maryland, College Park; University of Pittsburgh; University of Notre Dame

National Science Foundation

