Peipei Zhou

Assistant Professor of Engineering

Brown University

Biography

Peipei Zhou is a Tenure-Track Assistant Professor at Brown University, School of Engineering. She leads the Customized Computer Architecture Research Lab at Brown University. Dr. Zhou received her B.S. in Electrical and Computer Engineering from Southeast University, Chien-Shiung Wu Honor College in 2012, her M.S. in Electrical and Computer Engineering in 2014, and her Ph.D. in Computer Science in 2019, both from University of California, Los Angeles.

Zhou’s research focuses on creating Customized Computer Architecture and Programming Abstraction for Applications including Healthcare, e.g., Precision Medicine, and Artificial Intelligence. She is the recipient of the 2026 NSF CAREER Award 🏆. She has been selected for the 2026 National Academy of Engineering’s Grainger Foundation Frontiers of Engineering Symposium 🏆, recognizing her exceptional research and technical leadership among early-career engineers. She also received “Outstanding Recognition in Research” Award from UCLA Samueli School of Engineering in 2019 🏆. Her research has received the 2025 IEEE/ACM ICCAD 10-Year Retrospective Most Influential Paper Award 🏆, 2026 ACM International Green and Sustainable Computing (IGSC) Best Paper Award 🏆, and 2019 IEEE TCAD Donald O. Pederson Best Paper Award 🏆. Additional recognitions include 🏆2025 ACM/SIGDA FPGA Best Paper Nominee, 2024 IEEE IGSC Best Viewpoint Paper, 2023 ACM/IEEE IGSC Best Viewpoint Paper Finalist, the 2018 IEEE ISPASS Best Paper Nominee, and the 2018 IEEE/ACM ICCAD Best Paper Nominee🏆.

I’m actively recruiting PhD students and research interns! Self-motivated students with relevant research and project experience (compiler, GPU and FPGA programming, artificial intelligence algorithm and application development, etc.) are highly encouraged to contact me via email.
Download my CV.
Website at Brown
Researchers@Brown
Former Website at UCLA

Interests

Application & Algorithm: Artificial Intelligence, Healthcare
Abstraction: Programming, Modeling and Optimization
Architecture: Heterogeneous Computing with FPGA, GPU, ASIC, NPU

Education

PhD in Computer Science, 2019

University of California, Los Angeles
MSc in Electrical Engineering, 2014

University of California, Los Angeles
BSc in Electrical Engineering, 2012

Southeast University, Chien-Shiung Wu Honor College

Research Focus

Application

Health & Artificial Intelligence

Abstraction

Software

Accelerator

Hardware

Experience

Tenure-Track Assistant Professor

Brown University

Sep 2024 – Present Providence, RI

Responsibilities include:

Tenure-Track Assistant Professor
Computer Engineering Concentration Advisor

Tenure-Track Assistant Professor

University of Pittsburgh

Sep 2021 – Aug 2024 Pittsburgh, PA

Responsibilities include:

Tenure-Track Assistant Professor

Staff Software Engineering

Enflame

Aug 2019 – Aug 2021 Shanghai

Responsibilities include:

High-Performance Convolution Neural Network Library for Deep Learning ASIC Acclerator
Pre-silicon Architecture Exploration and Performance Modeling
Post-silicon ASIC Bring-Up and System Software Optimization

Software Engineering Intern

Falcon Computing Solutions

Jun 2018 – Mar 2019 Los Angeles, CA

Responsibilities include:

Resource allocation and scheduling optimization for Genome Analysis Toolkit(GATK4) in the cloud.
Cost-Optimal heterogeneous computing by orchestrating seas of datacenter scale resources including computing (CPUs+GPU+FPGA accelerators), storage (HDD, SSD, local disk)

Software Engineering Intern

Microsoft

Jun 2017 – Sep 2017 Redmond, WA

Responsibilities include:

Developed a scalable end-to-end tool to generate >2M grammar fixed (preposition, article and etc.) sentences from Wikipedia.
Implemented LSTM-based neural network model for translation tasks

Research Intern

Microsoft Research

Jun 2014 – Sep 2014 Redmond, WA

Responsibilities include:

Implemented image compression algorithm in C++(software reference code) and also implemented hardware accelerator on FPGA

Honors

Caffeine won IEEE Transactions on Computer-Aided Design Donald O. Pederson Best Paper Award

IEEE Jun 2019

Donald O. Pederson Best Paper Award is dedicated to award the best paper published in IEEE TCAD in the recent two calendar years. Current Associate Editors of the IEEE TCAD nominates the best paper candidates first. Among the papers published in the past two years, the most referenced or downloaded papers are nominated automatically by the entire editorial board for review and voting. The editorial board nominated five papers this year, and another nine papers are automatically nominated for receiving highest downloads in the past two years. After the voting, a confidential review committee reviews the top five papers before deciding the final winners. The selection committee unanimously agreed to declare two of the candidates to be co-winners. The award is recognized at the Design Automation Conference (DAC) in Las Vegas on Jun. 4th, 2019.

See certificate

UCLA Outstanding Ph.D. Researcher Award

UCLA Samueli School of Engineering Jun 2019

2019 UCLA Computer Science Department Outstanding Ph.D. Researcher Award

See certificate

Phi Tau Phi Scholarship

Phi Tau Phi Scholastic Honor Society West America Chapter Jun 2018

One of five recipients of 2018 Phi Tau Phi Scholarship in recognition of academic achievements and scholarly contributions.

See certificate

Projects

Recent & Upcoming Talks

Efficient Programming on Heterogeneous Accelerators

How to map deep learning workloads onto heterogeneous SoCs with FPGAs and tensor cores? How to explore latency-throughput tradeoffs? …

May 9, 2025 12:00 PM — 4:00 PM Sharp Laboratory, University of Delaware, Newark, DE 19716, USA

Peipei Zhou

Architectural Challenges and Innovation for Compute Infrastructure Co-Design

How to use heterogeneous computing to accelerate real-time transformer inference? How to scale out computing in data centers, and how …

Oct 18, 2023 12:00 PM — 4:00 PM Open Compute Project (OCP) Summit 2023, San Jose, CA

Peipei Zhou

Architectural Challenges and Innovation for Compute Infrastructure Co-Design

CHARM Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

Which platform beats 7nm GPU A100 in energy efficiency? AMD Versal ACAP (FPGA+AI Chip) How to program AMD Versal ACAP, i.e., FPGA + AI …

Apr 7, 2023 12:13 PM — 4:00 PM CMU, Pittsburgh, PA

Peipei Zhou

Customizable Domain Specific Computing

Dec 10, 2018 12:13 PM — 4:00 PM Washington, D.C

Jason Cong, Peipei Zhou

Featured Publications

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Alex Jones, Peipei Zhou

June 2026 Proceedings of the ACM International Green and Sustainable Computing Conference 2026, IGSC ’26, June 22 - June 24, 2026, Canandaigua, NY, USA. Full Paper Accepted!

Advancing Environmental Sustainability in Data Centers via Carbon Depreciation Models (🔥📣🏆IGSC 2026 Best Paper Award🔥📣🏆! )

Recent improvements in energy efficiency and renewable energy integration have increased the relative importance of embodied carbon in data centers, motivating improved provisioning strategies. Conventional approaches primarily minimize operational energy, but this perspective is increasingly insufficient for sustainability. In this paper, we propose carbon depreciation models to encourage longer hardware lifetimes. Carbon depreciation assigns a larger portion of embodied carbon to newly provisioned servers, discouraging unnecessary deployment of new hardware. As a result, new servers are provisioned mainly for jobs with strict quality-of-service (QoS) constraints, while older servers, whose embodied carbon has largely been recovered, are used for other workloads. We further argue that both embodied carbon and operational carbon from server idle time should be recovered during active jobs, encouraging provisioning strategies that maintain high utilization. We show that prior carbon accounting strategies can be counterproductive: under a greedy scheduler minimizing carbon under QoS constraints, jobs are priced as 25% cheaper on new hardware than on older hardware. In contrast, our approach uses a greedy scheduler that prioritizes older hardware through non-linear carbon depreciation, promoting sustainable provisioning. Experimental results show carbon reductions of 28–57%, depending on server lifetime assumptions.

Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, Peipei Zhou

June 2026 Proceedings of the ACM Great Lakes Symposium on VLSI 2026, GLSVLSI ’26, June 22 - June 24, 2026, Canandaigua, NY, USA. Full Paper Accepted!

DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration (🔥📣New Paper & Project🔥📣! )

As deep neural networks develop significantly more diverse andcomplex, achieving high performance and efficiency on complicatedDNN models faces pressing challenges. Modern DNN workloadsare increasingly diverse in operation types, tensor shapes, andexecution dependencies, making it difficult to sustain high hardwareefficiency across models. In addition, a generic accelerator oftenincurs substantial overhead when executing diverse workloads.

To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow viaa proposed ISA, enabling fine-grained control of data movement,computation, and synchronization at the layer level. To supportflexibility while achieving high performance, DORA adopts a novelon-chip memory management and computation parallelism man-agement mechanism. DORA proposes a compilation frameworkthat can generate instructions for given DNN workloads after atwo-stage design space exploration. DORA framework also incorpo-rates a MILP-based and a heuristic-based search engine to generatethe schedule solution for different needs and constraints.

We prototype DORA on the AMD Versal VCK190 platform,demonstrating its deployability on existing reconfigurable systems.Experimental results show that DORA maintains stable efficiency,with less than 5% variation on a single vector processor acrossworkloads exhibiting up to 6× variation in operation counts. Com-pared to state-of-the-art accelerators, DORA consistently achieveshigher performance, delivering up to 5× throughput improvement.The heuristic-based scheduler further achieves up to 90% opti-mality under practical time constraints. DORA is open-sourcedat https://github.com/arc-research-lab/DORA.git.

Xingzhen Chen, Shixin Ji, Zheng Dong, Peipei Zhou

June 2026 Proceedings of the ACM International Green and Sustainable Computing Conference 2026, IGSC ’26, June 22 - June 24, 2026, Canandaigua, NY, USA. Full Paper Accepted!

To Overlay or to Customize? Revisiting Architectural Choices in Heterogeneous Systems (🔥📣New Paper & Project🔥📣! )

Autonomous Driving Systems (ADS) increasingly rely on diverse deep neural networks to support perception, prediction, planning, and control under strict real-time constraints. FPGA-based heterogeneous computing provides an attractive platform for DNN workloads, but it raises a fundamental deployment question: should the system rely on a flexible overlay architecture, or repeatedly load customized bitstreams optimized for dedicated models, which should be treated as a first-class systems problem rather than a purely architectural one? Overlay-based execution offers fast model switching and better adaptability relying on lightweight instruction or parameter updates, while customized architectures can provide higher model-wise efficiency at the cost of reconfiguration latency and reduced flexibility. However, the boundary between these two design choices remains unclear in realistic ADS scenarios. In this work, we present a systematic study of this trade-off from a deployment-centric perspective, focusing on an autonomous driving scenario. Instead of treating overlay and customized acceleration as isolated design points, we analyze when each approach is preferable under practical conditions, including workload variation, architectural design, reconfiguration latency, and switching frequency. Our analysis shows that overlay-based architecture is more suitable for highly frequent model switching under the state-of-the-art architecture. However, as bitstream reload overhead continues to reduce, customized architectures may become increasingly attractive, especially for workloads with efficiency requirements. Conversely, if overlay architectures become more capable and flexible, they may further expand their advantage over customized architectures. These observations provide design insights for future architectural design, and the optimal deployment strategy will be flipped according to the technique development.

Shixin Ji, Jinming Zhuang, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Peipei Zhou

June 2026 Proceedings of the ACM Great Lakes Symposium on VLSI 2026, GLSVLSI ’26, June 22 - June 24, 2026, Canandaigua, NY, USA. Full Paper Accepted!

μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP (🔥📣New Paper & Project🔥📣! )

Heterogeneous reconfigurable platforms with tensor cores, suchas AMD ACAP, are increasingly adopted for deep neural network(DNN) inference due to their high throughput and flexibility. How-ever, their suitability for microsecond-scale inference on small prob-lem sizes remains underexplored. In jet-tagging applications inhigh-energy physics, inefficient on-chip communication and largeinter-layer latency prevent existing frameworks from meeting the1-𝜇s latency budget. Moreover, hardware overheads such as syn-chronization and VLIW processor prologue are often overlooked,making it infeasible to optimize accelerators correctly. To addressthese problems, we propose µ-ORCA, a customized heterogeneousaccelerator framework for ultra-low-latency model inference. µ-ORCA enables direct inter-layer communication between DNN lay-ers on the AIE array, instead of using shared memory tiles or FPGAfabric. Moreover, a 512-bit/cycle cascade connection is applied in-stead of a 32-bit/cycle DMA connection. µ-ORCA also provides anoverhead-aware performance model that adapts to different NNlayer sizes, and conducts design space exploration to optimize end-to-end latency. µ-ORCA supports MLP and DeepSets models withnon-MM kernels, including bias, ReLU, and global aggregation onAIE. We evaluate µ-ORCA on the AMD ACAP VEK280 platform.Experimental results show that µ-ORCA achieves average latencyreduction of >1.70× and >1.83× compared with different state-of-the-art ACAP frameworks, and achieves 0.93 𝜇s latency for a 6-layerreal-world DeepSets model, satisfying the latency budget. We opensource µ-ORCA at https://github.com/arc-research-lab/u-ORCA.

Shixin Ji, Jinming Zhuang, Sarah Schultz, Zhuoping Yang, Xingzhen Chen, Zheng Dong, Alex K. Jones, Yihui Ren, Peipei Zhou

March 2026 Proceedings of the ACM/IEEE Design Automation Conference, DAC ’26, July 26 - July 29, 2026, Long Beach, CA, USA. Full Paper Accepted!

PHAROS: Pipelined Heterogeneous Accelerators for Real-time Safety-critical Systems With Deadline Compliance (🔥📣New Paper & Project🔥📣! )

Spatially partitioned heterogeneous accelerators (HAs) are increasingly adopted in embedded systems for their performance and flexibility. Yet most existing HA design frameworks optimize primarily for throughput or quality-of-service (QoS) metrics. They often overlook safety-critical real-time requirements, including hardware support for predictable execution, real-time-aware design space exploration (DSE), and rigorous schedulability analysis. These requirements are essential in safety-critical applications such as smart transportation, where schedulability guarantees directly affect system safety. To address this gap, we present PHAROS, a real-time-centric HA design framework. PHAROS introduces preemption mechanisms and scheduler designs for spatially partitioned HAs under first-in-first-out (FIFO) and earliest-deadline-first (EDF) policies. Leveraging modern real-time theory, we further develop a soft real-time (SRT) schedulability-oriented DSE with objectives and constraints tailored to SRT schedulability. Through comprehensive modeling, analysis, and evaluation across diverse applications, we show that PHAROS’s schedulability-oriented DSE discovers more feasible configurations for a broader range of task sets than throughput-oriented DSE baselines, while delivering improved real-time performance. We also provide response-time analyses for the supported scheduling algorithms.

Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, Peipei Zhou

March 2026 Proceedings of the ACM/IEEE Design Automation Conference, DAC ’26, July 26 - July 29, 2026, Long Beach, CA, USA. Full Paper Accepted!

FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration(🔥📣New Paper & Project🔥📣! )

With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency.

To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x∼5x throughput and hardware efficiency on various diverse workloads.

Weisong Shi, Zheng Dong, Peipei Zhou

February 2026 Special Issue on Celebrating the 40th Anniversary of Journal of Computer Science and Technology JCST. Full Paper Accepted! https://rdcu.be/e7m2b

Physical Intelligence on the Edge: A Vision for the Decade Ahead (🔥📣New Paper & Project🔥📣! )

This article examines key challenges in computing systems research under the emerging paradigm of Physical Intelligence on the Edge (PIE), in which raw sensor streams are transformed into real-time, safety-critical intelligence that can act in the physical world. It traces the evolution of computing architectures from centralized systems to distributed systems and edge computing, and argues that PIE constitutes a qualitative shift: the edge becomes the primary platform for tightly integrating sensing, reasoning, and actuation under stringent real-time constraints. The article identifies five emerging research thrusts—embodied spatial reasoning, embodied temporal reasoning, edge-native customization, symbiosis, and sustainability. Using a hypothetical PIE scenario, it exposes a fundamental gap between the capabilities of current systems and the requirements of future PIE-enabled autonomy: while today’s edge platforms can execute individual components of perception and inference, they remain unable to autonomously close the sense-think-act loop with certifiable guarantees on timing and safety. This vision is further substantiated by recent industrial progress, including several compelling demonstrations showcased at CES 2026 by leading companies such as NVIDIA and AMD. The article concludes by calling for a paradigm shift in systems thinking—from efficiently transporting and processing data (bits) to predictably and safely influencing the physical world (atoms)—thereby positioning edge-native system design as a foundational enabler of next-generation autonomous and robotic systems.

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex Jones, Zheng Dong, Peipei Zhou

December 2025 Proceedings of the 46th IEEE Real-Time Systems Symposium, RTSS 2025, December 2–5, 2025, Boston, MA, USA. Full Paper Accepted! https://ieeexplore.ieee.org/document/11315054

DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety-Critical Systems (🔥📣New Paper & Project🔥📣! )

Deep neural network (DNN) models are increasingly deployed in real-time, safety-critical systems such as autonomous vehicles, driving the need for specialized AI accelerators. However, most existing accelerators support only non-preemptive execution or limited preemptive scheduling at the coarse granularity of DNN layers. This restriction leads to frequent priority inversion due to the scarcity of preemption points, resulting in unpredictable execution behavior and, ultimately, system failure.

To address these limitations and improve the real-time performance of AI accelerators, we propose DERCA, a novel accelerator architecture that supports fine-grained, intra-layer flexible preemptive scheduling with cycle-level determinism. DERCA incorporates an on-chip Earliest Deadline First (EDF) scheduler to reduce both scheduling latency and variance, along with a customized dataflow design that enables intra-layer preemption points (PPs) while minimizing the overhead associated with preemption. Leveraging the limited preemptive task model, we perform a comprehensive predictability analysis of DERCA, enabling formal schedulability analysis and optimized placement of preemption points within the constraints of limited preemptive scheduling. We implement DERCA on the AMD ACAP VCK190 reconfigurable platform. Experimental results show that DERCA outperforms state-of-the-art designs using non-preemptive and layer-wise preemptive dataflows, with less than 5% overhead in worst-case execution time (WCET) and only 6% additional resource utilization. DERCA is open-sourced on GitHub: https://github.com/arc-research-lab/DERCA.

Zhuoping Yang, Jinming Zhuang, Xingzhen Chen, Alex Jones, Peipei Zhou

November 2025 Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025, Nov. 16 - Nov. 21, 2025, St. Louis, MO, US. Full Paper Accepted! https://dl.acm.org/doi/10.1145/3712285.3759778

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration(🔥📣New Paper & Project🔥📣! )

Graphics Processing Units (GPUs) have become essential for com- putationally intensive applications. However, emerging workloads such as recommender systems, graph analytics, and data analytics often involve processing data exceeding GPU device memory capacity. To mitigate this issue, existing solutions enable GPUs to use CPU DRAM or SSDs as external memory. Among them, the GPU-centric approach lets GPU threads directly initiate NVMe requests, eliminating CPU intervention overhead over traditional methods. However, the SOTA GPU-centric approach adopts a synchronous IO model, and threads must tolerate the long latency in communication before starting any tasks.

In this work, we propose AGILE, a lightweight and efficient asynchronous library allowing GPU threads to access SSDs asynchronously while eliminating deadlock risks. AGILE also integrates a flexible software cache using GPU High-Bandwidth Memory (HBM). We demonstrate that the asynchronous GPU-centric IO achieves up to 1.88× improvement in workloads with different computation-to-communication (CTC) ratios. We also compare AGILE with the SOTA work BaM on Deep Learning Recommendation Models (DLRM) with various settings, and the results show that AGILE achieves 1.75× performance improvement due to its efficient design and the overlapping strategy enabled by an asynchronous IO model. We further evaluate AGILE’s API overhead on graph applications, and the results demonstrate AGILE reduces software cache overhead by up to 3.12× and overhead in NVMe IO requests by up to 2.85×. Compared with BaM, AGILE consumes fewer registers and exhibits up to 1.32× reduction in the usage of registers.

Shixin Ji, Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, Peipei Zhou

June 2025 Proceedings of the Great Lakes Symposium on VLSI 2025, GLSVLSI 2025, June 30 - July 2, New Orleans, LA, US. Full Paper Accepted! https://dl.acm.org/doi/pdf/10.1145/3716368.3735215

ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems(🔥📣New Paper & Project🔥📣! )

Real-time systems are widely applied in different areas like autonomous vehicles, where safety is the key metric. However, on the FPGA platform, most of the prior accelerator frameworks omit discussing the schedulability in such real-time safety-critical systems, leaving deadlines unmet, which can lead to catastrophic system failures. To address this, we propose the ART framework, a hardware-software co-design approach that transforms baseline accelerators into real-time guaranteed accelerators. On the software side, ART performs schedulability analysis and preemption point placement, optimizing task scheduling to meet deadlines and enhance throughput. On the hardware side, ART integrates the Global Earliest Deadline First (GEDF) scheduling algorithm, implements preemption, and conducts source code transformation to transform baseline HLS-based accelerators into designs targeted for real-time systems capable of saving and resuming tasks. ART also includes integration, debugging, and testing tools for full-system implementation. We demonstrate the methodology of ART on two kinds of popular accelerator models and evaluate on AMD Versal VCK190 platform, where ART meets schedulability requirements that baseline accelerators fail. ART is lightweight, utilizing <0.5% resources. With about 100 lines of user input, ART generates about 2.5k lines of accelerator code, making it a push-button solution.

Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou

January 2025 Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2025, Feb. 28 - March 3, Monterey, CA, US. Full Paper Accepted! https://dl.acm.org/doi/10.1145/3706628.3708870

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines(🔥📣 FPGA 2025 Best Paper Candidate🔥📣! )

As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD’s AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE’s specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance.

To address these limitations, we introduce ARIES, an agile multilevel intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is opensourced on GitHub: https://github.com/arc-research-lab/Aries.

Shixin Ji, Jinming Zhuang, Zhuoping Yang, Alex Jones, Peipei Zhou

November 2024 Proceedings of the IEEE 15th International Green and Sustainable Computing Conference, IGSC 2024

Amortizing Embodied Carbon Across Generations (🔥📣Best Viewpoint Paper in IGSC 2024🔥📣! )

Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex Jones, Yiyu Shi, Yanzhi Wang, Peipei Zhou

October 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) in conjunction with (ESWEEK), RALEIGH, NC, USA, Sept. 29-Oct. 4, 2024. Also appears as part of the ESWEEK-TCAD Special Issue, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex K. Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou

September 2024 ACM Transactions on Reconfigurable Technology and Systems

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture (🔥📣New Paper & Project🔥📣! )

Yue Tang, Yukai Song, Naveena Elango, Sheena Ratnam Priya, Alex Jones, Jinjun Xiong, Peipei Zhou, Jingtong Hu

September 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) in conjunction with (ESWEEK), RALEIGH, NC, USA, Sept. 29-Oct. 4, 2024. Also appears as part of the ESWEEK-TCAD Special Issue, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)

CHEF: A Framework for Deploying Heterogeneous Models on Clusters with Heterogeneous FPGAs (🔥📣New Paper & Project🔥📣! )

Zhuoping Yang, Wei Zhang, Shixin Ji, Peipei Zhou, Alex Jones

September 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) in conjunction with (ESWEEK), RALEIGH, NC, USA, Sept. 29-Oct. 4, 2024

Reducing Smart Phone Environmental Footprints with In-Memory Processing (🔥📣New Paper & Project🔥📣! )

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Stephen Cahoon, Jingtong Hu, Yiyu Shi, Alex Jones, Peipei Zhou

July 2024 Proceedings of the IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2024

SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators (🔥📣New Paper & Project🔥📣! )

Ruiyang Qin, Jun Xia, Zhenge Jia, Meng Jiang, Ahmed Abbasi, Peipei Zhou, Jingtong Hu, Yiyu Shi

June 2024 Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC ’24)

Enabling On-Device Self-Supervised LLM Personalization with Selective Synthetic Data (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, Peipei Zhou

February 2024 Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2024, March 3 - March 5, Monterey, CA, US. Full Paper Accepted! https://doi.org/10.1145/3626202.3637569

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (🔥📣New Paper & Project🔥📣! )

Zhuoping Yang, Shixin Ji, Xingzhen Chen, Jinming Zhuang, Weifeng Zhang, Dharmesh Jani, Peipei Zhou

January 2024 Proceedings of the 28th Asia and South Pacific Design Automation Conference, ASPDAC 2024, Incheon Songdo Convensia, South Korea! https://doi.org/10.1109/ASP-DAC58780.2024.10473961

Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous Chiplets (🔥📣New Paper & Project🔥📣! )

Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, Peipei Zhou

July 2023 Proceedings of the 42nd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2023, October 29, 2023 - November 2, 2023, San Francisco, CA, USA. Full Paper Accepted (acceptance ratio is 21 percent)

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Zhuoping Yang, Peipei Zhou

February 2023 Proceedings of the 60th ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC ’23), July 9–13, 2023, San Francisco, CA, USA. Full Paper Accepted (acceptance ratio is 23 percent)

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex K. Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou

January 2023 Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3543622.3573210

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture (🔥📣New Paper & Project🔥📣! )

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises, how can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

Xinyi Zhang, Cong Hao, Peipei Zhou, Alex Jones, Jingtong Hu

March 2022 2022 Design Automation Conference (DAC 2022)

H2H: Heterogeneous Model to Heterogeneous System Mapping with Computation and Communication Awareness

The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. For the algorithm, the heterogeneity in ML models comes from the multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT) models, resulting in diverse deep neural net- work (DNN) layers and computation patterns. For the system, it becomes prevailing to integrate dedicated acceleration components into one system. It thus introduces a new problem, heterogeneous model to heterogeneous system mapping (H2H), in which both computation and communication efficiency need to be considered. While previous mapping algorithms only focus on computation patterns, in this work, we propose a novel mapping algorithm with both computation and communication awareness. By slightly sacrificing computation efficiency, the communication latency is largely reduced. Therefore, the system overall performance is improved and energy is also reduced. The superior performance of our work is evaluated on MAESTRO, achieving 15%-74% latency improvement and 23%-64% energy reduction when compared with the existing computation-prioritized mapping algorithm.

Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, Jason Cong

May 2021 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA 21)

MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators

FPGAs have been widely deployed in public clouds, e.g., Amazon Web Services (AWS) and Huawei Cloud. However, simply offloading accelerated kernels from CPU hosts to PCIe-based FPGAs does not guarantee out-of-pocket cost savings in a pay-as-you-go public cloud. Taking Genome Analysis Toolkit (GATK) applications as case studies, although the adoption of FPGAs reduces the overall execution time, it introduces 2.56× extra cost, due to insufficient application-level speedup by Amdahl’s law. To optimize the out-of-pocket cost while keeping high speedup and throughput, we propose Mocha framework as a distributed runtime system to fully utilize the accelerator resource by accelerator sharing and CPU-FPGA partial task offloading. Evaluation results on HaplotypeCaller (HTC) and Mutect2 in GATK show that on AWS, Mocha saves on the application cost by 2.82x for HTC, 1.06x for Mutect2 and on Huawei Cloud by 1.22x, 1.52x respectively than straightforward CPU-FPGA integration solution with less than 5.1% performance overhead.