# ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems Great Lakes Symposium on VLSI (GLSVLSI) 2025, New Orleans, LA, USA Shixin Ji\*, Xingzhen Chen\*, Wei Zhang\*, Zhuoping Yang\*, Jinming Zhuang\*, Sarah Schultz\*, Yukai Song ‡, Jingtong Hu ‡, Alex K. Jones§, Zheng Dong†, Peipei Zhou\* Brown University\*; Wayne State University†; University of Pittsburgh‡; Syracuse University§ shixin\_ji@brown.edu peipei\_zhou@brown.edu https://peipeizhou-eecs.github.io/ #### Proliferate of Real-time Safety-critical Systems Autonomous driving Robotics<sup>[2]</sup> **UAVs** #### If the accelerator does not finish on time: dues not milish on time. #### If the accelerator does not finish on time: #### If the accelerator does not finish on time: #### If the accelerator does not finish on time: If the accelerator does not finish on time: #### If the accelerator does not finish on time: #### If the accelerator does not finish on time: Real-time safety critical system: Each frame/job matters A task set: multiple tasks ★ → □ → □ Lat Latency bounded > System predictable Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: T=0: execute task 1 job 1 first, postpone task 2 job 1 Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: # Motivation Example: 2 Tasks, 1 Single Acc Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: # Motivation Example: 2 Tasks, 1 Single Acc Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: #### Motivation Example: 2 Tasks, 1 Single Acc Task 1: 10 ms execution time, 20 ms period Task 2: 190 ms execution time, 400 ms period (assuming ddl = period) Non-preemptive scheduling: optimize the latency --> 5x faster optimize the latency --> 5x faster Still miss deadline! optimize the latency --> 5x faster Still miss deadline! optimize the latency --> 5x faster Still miss deadline! optimize the latency --> 5x faster Still miss deadline! optimize the latency --> 5x faster Still miss deadline! optimize the latency --> 5x faster Still miss deadline! - Endless latency acceleration is impractical - more feasible solution needed Assuming if we can swap out the ongoing tasks and resume its execution later Assuming if we can swap out the ongoing tasks and resume its execution later Assuming if we can swap out the ongoing tasks and resume its execution later Assuming if we can swap out the ongoing tasks and resume its execution later T=20: swap out the task 2 job 1 Assuming if we can swap out the ongoing tasks and resume its execution later T=20: swap out the task 2 job 1 Assuming if we can swap out the ongoing tasks and resume its execution later T=20: swap out the task 2 job 1 Assuming if we can swap out the ongoing tasks and resume its execution later - T=20: swap out the task 2 job 1 - T=30: swap in the task 2 job 1 Assuming if we can swap out the ongoing tasks and resume its execution later - T=20: swap out the task 2 job 1 - T=30: swap in the task 2 job 1 Assuming if we can swap out the ongoing tasks and resume its execution later - T=20: swap out the task 2 job 1 - T=30: **swap in** the task 2 job 1 Assuming if we can swap out the ongoing tasks and resume its execution later - T=20: swap out the task 2 job 1 - T=30: swap in the task 2 job 1 - The tasks can continue running, and the system won't fail Assuming if we can swap out the ongoing tasks and resume its execution later - T=20: swap out the task 2 job 1 - T=30: swap in the task 2 job 1 - The tasks can continue running, and the system won't fail Preemption (swap out + swap in) is a viable solution! **Existing FPGA Accelerators** Low latency, high throughput Can not be applied to real-time, safety-critical systems Low latency, high throughput Can not be applied to real-time, safety-critical systems ? How can we enable them to support real-time safety-critical systems Low latency, high throughput Can not be applied to real-time, safety-critical systems How can we enable them to support real-time safety-critical systems Fixed Acc shape Low latency, high throughput Can not be applied to real-time, safety-critical systems How can we enable them to support real-time safety-critical systems Configs Fixed Acc shape **Existing FPGA Accelerators** 11 **Existing FPGA Accelerators** 11 Preemption Meet deadlines 11 ## How to Implement Preemption in Accelerator Hardware ## How to Implement Preemption in Accelerator Hardware Existing accs can not meet deadline Existing accs can not meet deadline Accelerator customization to enable preemption Existing accs can not meet deadline Accelerator customization to enable preemption Dynamic Scheduling Algorithm Existing accs can not meet deadline Accelerator customization to enable preemption Dynamic Scheduling Algorithm Preemption incurs overhead Existing accs can not meet deadline Accelerator customization to enable preemption Dynamic Scheduling Algorithm Fast & stable earliest deadline first (EDF) scheduler on FPGA platform Preemption incurs overhead Existing accs can not meet deadline Accelerator customization to enable preemption Dynamic Scheduling Algorithm Fast & stable earliest deadline first (EDF) scheduler on FPGA platform Preemption incurs overhead SW optimization to select the best preemption point placement Existing accs can not meet deadline Accelerator customization to enable preemption #### **ART hardware framework** - Light weight invasive HW changes - control logic design enabling preemption Dynamic Scheduling Algorithm Fast & stable earliest deadline first (EDF) scheduler on FPGA platform On-chip low-latency EDF scheduler Preemption incurs overhead SW optimization to select the best preemption point placement Existing accs can not meet deadline Accelerator customization to enable preemption #### **ART hardware framework** - Light weight invasive HW changes - control logic design enabling preemption Dynamic Scheduling Algorithm Fast & stable earliest deadline first (EDF) scheduler on FPGA platform On-chip low-latency EDF scheduler Preemption incurs overhead SW optimization to select the best preemption point placement #### **ART software framework** Optimal preemption point selection to reduce preemption overhead & schedulability guaranteed Existing accs can not meet deadline Accelerator customization to enable preemption #### **ART hardware framework** - Light weight invasive HW changes - control logic design enabling preemption Dynamic Scheduling Algorithm Fast & stable earliest deadline first (EDF) scheduler on FPGA platform On-chip low-latency EDF scheduler Preemption incurs overhead SW optimization to select the best preemption point placement #### **ART software framework** Optimal preemption point selection to reduce preemption overhead & schedulability guaranteed ART enables existing FPGA HLS accelerators to be used in the real-time safety-critical scenario ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features Inputs ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features ART customizes a baseline FPGA HLS accelerator to support real-time features void baseline\_acc(/\*addr\*/,/\*specs\*/){...} Baseline Acc FPGA ``` void baseline_acc(/*addr*/,/*specs*/){...} void kernel_mgmt(...){ if (swap_in) {...} if (swap_out) {...}//preemption support get_addr(...); get_specs(...); baseline_acc(...);//Acc control } ``` ``` void baseline_acc(/*addr*/,/*specs*/){...} void kernel_mgmt(...){...} (c-synth, system integration) ``` ``` module top (...); scheduler u_sche (...); kernel_mgmt u_acc (...); user_control u_control (...); endmodule ``` ``` void baseline_acc(/*addr*/,/*specs*/){...} void kernel_mgmt(...){...} (c-synth, system integration) ``` ``` module top (...); scheduler u_sche (...); kernel_mgmt u_acc (...); user_control u_control (...); endmodule ``` ART uses overlay to handle different tasks within a task set Different DNN layers --> Same accelerator, different configuration files(metadata) ART uses overlay to handle different tasks within a task set Different DNN layers --> Same accelerator, different configuration files(metadata) ART segments the DNN models in unit of DNN layers Assume intermediate data is on DDR ART uses overlay to handle different tasks within a task set Different DNN layers --> Same accelerator, different configuration files(metadata) ART segments the DNN models in unit of DNN layers Assume intermediate data is on DDR ART uses overlay to handle different tasks within a task set Different DNN layers --> Same accelerator, different configuration files(metadata) ART segments the DNN models in unit of DNN layers Assume intermediate data is on DDR ART uses overlay to handle different tasks within a task set Different DNN layers --> Same accelerator, different configuration files(metadata) ART segments the DNN models in unit of DNN layers Assume intermediate data is on DDR - Iteration boundaries - Address info 2025/6/30 16 ART uses overlay to handle different tasks within a task set Different DNN layers --> Same accelerator, different configuration files(metadata) ART segments the DNN models in unit of DNN layers Assume intermediate data is on DDR Dedicated for input task set - . . . . . . . - Address info T=0: Task 1 release a job 1, ddl = 20 Task 1: 2 segments: T=0: Task 1 release a job 1, ddl = 20 Task 1: 2 segments: T=0: Task 1 release a job 1, ddl = 20 Task 1: 2 segments: T=0: Task 1 release a job 1, ddl = 20 Task 1: 2 segments: # T=1: Task2 release a job1, ddl=10 Task 2: 1 segment: T=1: Task2 release a job1, ddl=10 Task 2: 1 segment: T=2: Task 1 job 1 finishes its first segment T=2: Task 1 job 1 finishes its first segment T=2: Task 1 job 1 finishes its first segment T=2: Task 1 job 1 finishes its first segment T=2: Task 1 job 1 finishes its first segment T=2: Task 1 job 1 finishes its first segment T=2: Task 1 job 1 finishes its first segment T=6: Task 2 finishes T=6: Task 2 finishes T=6: Task 2 finishes T=6: Task 2 finishes T=6: Task 2 finishes Preemption overhead depends on Acc execution model: ## Preemption overhead depends on Acc execution model: ### Preemption overhead depends on Acc execution model: Sequentially launches DNN layers, layer intermediate data is **on DDR** --> No on-chip intermediate data between layers 2025/6/30 21 #### Preemption overhead depends on Acc execution model: Sequentially launches DNN layers, layer intermediate data is **on DDR** --> No on-chip intermediate data between layers #### Preemption overhead: - Scheduling operations - Configuring the controller and accelerator #### Preemption overhead depends on Acc execution model: Sequentially launches DNN layers, layer intermediate data is **on DDR** --> No on-chip intermediate data between layers #### Preemption overhead: - Scheduling operations - Configuring the controller and accelerator #### Preemption overhead depends on Acc execution model: Acc 1 2 3 ... n Sequentially launches DNN layers, layer intermediate data is **on DDR** --> No on-chip intermediate data between layers Preemption overhead: - Scheduling operations - Configuring the controller and accelerator Sequentially launches DNN layers, layer intermediate data is **on chip** --> on-chip forward data between layers #### **Preemption overhead depends on Acc execution model:** Sequentially launches DNN layers, layer intermediate data is **on DDR** --> No on-chip intermediate data between layers Preemption overhead: - Scheduling operations - Configuring the controller and accelerator Sequentially launches DNN layers, layer intermediate data is **on chip** --> on-chip forward data between layers Preemption overhead: - Scheduling operations - Configuring the controller and accelerator - Store/load the intermediate data ### Preemption overhead depends on Acc execution model: layer intermediate data is on DDR layer intermediate data is on chip 22 ART SW framework optimizes preemption overheads: #### Preemption overhead depends on Acc execution model: layer intermediate data is on DDR layer intermediate data is on chip #### **ART SW framework optimizes preemption overheads:** Task Configs #### Preemption overhead depends on Acc execution model: layer intermediate data is on DDR layer intermediate data is on chip #### **ART SW framework optimizes preemption overheads:** #### Preemption overhead depends on Acc execution model: layer intermediate data is on DDR layer intermediate data is on chip #### **ART SW framework optimizes preemption overheads:** #### Preemption overhead depends on Acc execution model: Acc 1 2 3 ... n layer intermediate data is on DDR layer intermediate data is on chip #### **ART SW framework optimizes preemption overheads:** **Reduced worst-case execution time** ## **ART Evaluation** PE array(Al engines) We implement ART on the AMD VCK190 Evaluation board CHARM as the baseline accelerator • ART is lightweight: <0.5 % resource utilization | Design | AIE | LUT | REG | BRAM | URAM | DSP | |----------|---------------|-----------------|-----------------|---------------|---------------|---------------| | Baseline | 384<br>96.00% | 92549<br>10.29% | 155526<br>8.64% | 625<br>64.63% | 384<br>82.94% | 262<br>13.31% | | ART | 96.00% | 2873 | 2370 | 4 | 0 | 0 | | | 0% | 0.32% | 0.13% | 0.41% | 0% | 0% | | Total | 384 | 95422 | 157896 | 629 | 384 | 262 | | | 96.00% | 10.60% | 8.77% | 65.04% | 82.94% | 13.31% | #### **ART Evaluation** ART can satisfy schedulability whereas the baseline accelerator cannot • Before: QoS 17%, After: QoS 100% • Overhead: <0.1% degradation in throughput #### **ART Evaluation** - Preemption mechanism improves worst-case response time - ART SW framework optimizes both the worst-case response time and execution time 2025/6/30 25 #### **Thank You** # ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems Shixin Ji\*, Xingzhen Chen\*, Wei Zhang\*, Zhuoping Yang\*, Jinming Zhuang\*, Sarah Schultz\*, Yukai Song ‡, Jingtong Hu ‡, Alex K. Jones§, Zheng Dong†, Peipei Zhou\* Brown University\*; Wayne State University†; University of Pittsburgh‡; Syracuse University§ shixin\_ji@brown.edu peipei\_zhou@brown.edu https://peipeizhou-eecs.github.io/