

## **Motivation and Contributions**

## Motivation

- The customized pipeline design has been one of the most important optimizations and widely used to improve the performance of FPGA accelerators.
- The impact of the pipeline II on the energy efficiency of accelerator designs remains unclear.

### Contributions

 Provide a set of high-level yet accurate analytical models to investigate the impact of the pipeline II on the energy consumption of FPGA accelerators designed in high-level synthesis (HLS).



- Il determines performance, resource usage and energy
- Some useful code re-write is needed to achieve efficient architecture There are (N/II) multipliers and (N/II) adders. N/II independent memory bank for b matrix, each with II columns. Number of cycles: N<sup>2</sup> x II void matrix\_multiply(float a[N][N], float b[N][N], float c[N][N]) { **int** i, j, k, p; k\_loop: for (k = 0; k < N; k++) $i_{loop}: for(i = 0; i < N; i++)$ // i loop PIPELINE II = II i p\_loop: for(p = 0; p < N; p += N/II\_i) {
  #pramga HLS PIPELINE II = 1</pre> j\_loop: **for**(j = 0; j < N/II\_i; j++) { #pragma HLS UNROLL c[i][p+j] += a[i][k] \* b[k][p+j]; } } } ] Interconnect • Wire transferring broadcast data  $E_{wire.share.A} \propto \frac{N}{II} \times N^2 = \frac{N^3}{II}$ N x II < BRAM 18 K, • DSP area dominated, II increases, PE area scales as N/II  $E_{wire.share.A} \propto N^2 \times N^2 = N^4$ • N x II > BRAM 18 K,



- Provide insight into Matrix-Multiply with II > 1 is optimal
- Identify Sources of inefficient mapping in commercial HLS flow



• BRAM area dominated, II further increases, PE area does not



Total

$$E_{total} = E_{compute} + E_{memory} + E_{wire} + E_{leak}$$
$$\left( N^3 \left( c1 + \frac{c2}{II} \right), \right)$$





 the wiring between the private b and c memory banks and the PE logic also grows as the square root of the memory capacity

## Results

#### Experimental Setup

- Virtex Virtix-7 XC7VX485T chip using Vivado 2015.1.5
- Simulated each mapped design in Vivado with random a and b matricies
- Switching Activity Interchange format (SAIF) file generated from post implementation simulation to estimate the energy

#### ◆ N = 64, II =8 is optimal,

#### ◆ N = 64, II = 1 is within 5% of II =8





# $= \begin{cases} \text{if } N \times II \leq \text{BRAM18K} \\ N^3 \left( c3 + c4 \times N + c5 \times II^{0.5} \right), \\ \text{if } N \times II > \text{BRAM18K} \end{cases}$

Energy decreases before N x II < BRAM18K due to interconnect saving</li>
 Energy increases after N x II > BRAM18K

# **Conclusion & Future Work**

#### Conclusion

- Interconnect energy within our matrix-multiply kernel is minimized for an II > 1
- With efficient power gating or alternate use of chip resources = > minimum total energy at a point other than the fully pipelined, II = 1 point.

### Future Work

- The energy modeling framework illustrated here can be adaptable to other kernels.
- Characterize how these components scale for other tasks & develop a suitably parameterized energy model.



