Peipei Zhou

Assistant Professor of Engineering

Brown University

Biography

I am currently a Tenure-Track Assistant Professor at Brown University, School of Engineering. I obtained my Ph.D. in Computer Science from University of California, Los Angeles in 2019 supervised by Prof. Jason Cong, who leads UCLA VAST (VLSI Architecture, Synthesis and Technology) Group and CDSC (The Center for Domain-Specific Computing). My major interest is in Customized Computer Architecture and Programming Abstraction for Applications including Healthcare, e.g., Precision Medicine and Artificial Intelligence. I’m honored to receive “Outstanding Recognition in Research” from UCLA Samueli School of Engineering in 2019. I have also received 🏆 2019 TCAD Donald O. Pederson Best Paper Award 🏆 in recognition of best paper published in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) in the two calendar years preceding the award. My papers have also received 🏆 2024 IEEE IGSC Best Viewpoint Paper 🏆, 2025 ACM/SIGDA FPGA Best Paper Nominee 🏆, 2018 IEEE/ACM ICCAD Best Paper Nominee 🏆, 2018 IEEE ISPASS Best Paper Nominee 🏆.

I’m actively recruiting PhD students and research interns! Self-motivated students with relevant research and project experience (compiler, GPU and FPGA programming, artificial intelligence algorithm and application development, etc.) are highly encouraged to contact me via email.
Download my CV.
Website at Brown
Researchers@Brown
Former Website at UCLA

Interests

Application & Algorithm: Artificial Intelligence, Healthcare
Abstraction: Programming, Modeling and Optimization
Architecture: Heterogeneous Computing with FPGA, GPU, ASIC, NPU

Education

PhD in Computer Science, 2019

University of California, Los Angeles
MSc in Electrical Engineering, 2014

University of California, Los Angeles
BSc in Electrical Engineering, 2012

Southeast University, Chien-Shiung Wu Honor College

Research Focus

Application

Health & Artificial Intelligence

Abstraction

Software

Accelerator

Hardware

Experience

Tenure-Track Assistant Professor

Brown University

Sep 2024 – Present Providence, RI

Responsibilities include:

Tenure-Track Assistant Professor
Computer Engineering Concentration Advisor

Tenure-Track Assistant Professor

University of Pittsburgh

Sep 2021 – Aug 2024 Pittsburgh, PA

Responsibilities include:

Tenure-Track Assistant Professor

Staff Software Engineering

Enflame

Aug 2019 – Aug 2021 Shanghai

Responsibilities include:

High-Performance Convolution Neural Network Library for Deep Learning ASIC Acclerator
Pre-silicon Architecture Exploration and Performance Modeling
Post-silicon ASIC Bring-Up and System Software Optimization

Software Engineering Intern

Falcon Computing Solutions

Jun 2018 – Mar 2019 Los Angeles, CA

Responsibilities include:

Resource allocation and scheduling optimization for Genome Analysis Toolkit(GATK4) in the cloud.
Cost-Optimal heterogeneous computing by orchestrating seas of datacenter scale resources including computing (CPUs+GPU+FPGA accelerators), storage (HDD, SSD, local disk)

Software Engineering Intern

Microsoft

Jun 2017 – Sep 2017 Redmond, WA

Responsibilities include:

Developed a scalable end-to-end tool to generate >2M grammar fixed (preposition, article and etc.) sentences from Wikipedia.
Implemented LSTM-based neural network model for translation tasks

Research Intern

Microsoft Research

Jun 2014 – Sep 2014 Redmond, WA

Responsibilities include:

Implemented image compression algorithm in C++(software reference code) and also implemented hardware accelerator on FPGA

Honors

Caffeine won IEEE Transactions on Computer-Aided Design Donald O. Pederson Best Paper Award

IEEE Jun 2019

Donald O. Pederson Best Paper Award is dedicated to award the best paper published in IEEE TCAD in the recent two calendar years. Current Associate Editors of the IEEE TCAD nominates the best paper candidates first. Among the papers published in the past two years, the most referenced or downloaded papers are nominated automatically by the entire editorial board for review and voting. The editorial board nominated five papers this year, and another nine papers are automatically nominated for receiving highest downloads in the past two years. After the voting, a confidential review committee reviews the top five papers before deciding the final winners. The selection committee unanimously agreed to declare two of the candidates to be co-winners. The award is recognized at the Design Automation Conference (DAC) in Las Vegas on Jun. 4th, 2019.

See certificate

UCLA Outstanding Ph.D. Researcher Award

UCLA Samueli School of Engineering Jun 2019

2019 UCLA Computer Science Department Outstanding Ph.D. Researcher Award

See certificate

Phi Tau Phi Scholarship

Phi Tau Phi Scholastic Honor Society West America Chapter Jun 2018

One of five recipients of 2018 Phi Tau Phi Scholarship in recognition of academic achievements and scholarly contributions.

See certificate

Projects

Recent & Upcoming Talks

Efficient Programming on Heterogeneous Accelerators

How to map deep learning workloads onto heterogeneous SoCs with FPGAs and tensor cores? How to explore latency-throughput tradeoffs? …

May 9, 2025 12:00 PM — 4:00 PM Sharp Laboratory, University of Delaware, Newark, DE 19716, USA

Peipei Zhou

Architectural Challenges and Innovation for Compute Infrastructure Co-Design

How to use heterogeneous computing to accelerate real-time transformer inference? How to scale out computing in data centers, and how …

Oct 18, 2023 12:00 PM — 4:00 PM Open Compute Project (OCP) Summit 2023, San Jose, CA

Peipei Zhou

Architectural Challenges and Innovation for Compute Infrastructure Co-Design

CHARM Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

Which platform beats 7nm GPU A100 in energy efficiency? AMD Versal ACAP (FPGA+AI Chip) How to program AMD Versal ACAP, i.e., FPGA + AI …

Apr 7, 2023 12:13 PM — 4:00 PM CMU, Pittsburgh, PA

Peipei Zhou

Customizable Domain Specific Computing

Dec 10, 2018 12:13 PM — 4:00 PM Washington, D.C

Jason Cong, Peipei Zhou

Featured Publications

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex Jones, Zheng Dong, Peipei Zhou

December 2025 To appear in the 46th IEEE Real-Time Systems Symposium, RTSS 2025, December 2–5, 2025, Boston, MA, USA. Full Paper Accepted! https://ieeexplore.ieee.org/document/11315054

DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety-Critical Systems (🔥📣New Paper & Project🔥📣! )

Deep neural network (DNN) models are increasingly deployed in real-time, safety-critical systems such as autonomous vehicles, driving the need for specialized AI accelerators. However, most existing accelerators support only non-preemptive execution or limited preemptive scheduling at the coarse granularity of DNN layers. This restriction leads to frequent priority inversion due to the scarcity of preemption points, resulting in unpredictable execution behavior and, ultimately, system failure.

To address these limitations and improve the real-time performance of AI accelerators, we propose DERCA, a novel accelerator architecture that supports fine-grained, intra-layer flexible preemptive scheduling with cycle-level determinism. DERCA incorporates an on-chip Earliest Deadline First (EDF) scheduler to reduce both scheduling latency and variance, along with a customized dataflow design that enables intra-layer preemption points (PPs) while minimizing the overhead associated with preemption. Leveraging the limited preemptive task model, we perform a comprehensive predictability analysis of DERCA, enabling formal schedulability analysis and optimized placement of preemption points within the constraints of limited preemptive scheduling. We implement DERCA on the AMD ACAP VCK190 reconfigurable platform. Experimental results show that DERCA outperforms state-of-the-art designs using non-preemptive and layer-wise preemptive dataflows, with less than 5% overhead in worst-case execution time (WCET) and only 6% additional resource utilization. DERCA is open-sourced on GitHub: https://github.com/arc-research-lab/DERCA.

Zhuoping Yang, Jinming Zhuang, Xingzhen Chen, Alex Jones, Peipei Zhou

November 2025 Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025, Nov. 16 - Nov. 21, 2025, St. Louis, MO, US. Full Paper Accepted! https://dl.acm.org/doi/10.1145/3712285.3759778

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration(🔥📣New Paper & Project🔥📣! )

Graphics Processing Units (GPUs) have become essential for com- putationally intensive applications. However, emerging workloads such as recommender systems, graph analytics, and data analytics often involve processing data exceeding GPU device memory capacity. To mitigate this issue, existing solutions enable GPUs to use CPU DRAM or SSDs as external memory. Among them, the GPU-centric approach lets GPU threads directly initiate NVMe requests, eliminating CPU intervention overhead over traditional methods. However, the SOTA GPU-centric approach adopts a synchronous IO model, and threads must tolerate the long latency in communication before starting any tasks.

In this work, we propose AGILE, a lightweight and efficient asynchronous library allowing GPU threads to access SSDs asynchronously while eliminating deadlock risks. AGILE also integrates a flexible software cache using GPU High-Bandwidth Memory (HBM). We demonstrate that the asynchronous GPU-centric IO achieves up to 1.88× improvement in workloads with different computation-to-communication (CTC) ratios. We also compare AGILE with the SOTA work BaM on Deep Learning Recommendation Models (DLRM) with various settings, and the results show that AGILE achieves 1.75× performance improvement due to its efficient design and the overlapping strategy enabled by an asynchronous IO model. We further evaluate AGILE’s API overhead on graph applications, and the results demonstrate AGILE reduces software cache overhead by up to 3.12× and overhead in NVMe IO requests by up to 2.85×. Compared with BaM, AGILE consumes fewer registers and exhibits up to 1.32× reduction in the usage of registers.

Shixin Ji, Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, Peipei Zhou

June 2025 Proceedings of the Great Lakes Symposium on VLSI 2025, GLSVLSI 2025, June 30 - July 2, New Orleans, LA, US. Full Paper Accepted! https://dl.acm.org/doi/pdf/10.1145/3716368.3735215

ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems(🔥📣New Paper & Project🔥📣! )

Real-time systems are widely applied in different areas like autonomous vehicles, where safety is the key metric. However, on the FPGA platform, most of the prior accelerator frameworks omit discussing the schedulability in such real-time safety-critical systems, leaving deadlines unmet, which can lead to catastrophic system failures. To address this, we propose the ART framework, a hardware-software co-design approach that transforms baseline accelerators into real-time guaranteed accelerators. On the software side, ART performs schedulability analysis and preemption point placement, optimizing task scheduling to meet deadlines and enhance throughput. On the hardware side, ART integrates the Global Earliest Deadline First (GEDF) scheduling algorithm, implements preemption, and conducts source code transformation to transform baseline HLS-based accelerators into designs targeted for real-time systems capable of saving and resuming tasks. ART also includes integration, debugging, and testing tools for full-system implementation. We demonstrate the methodology of ART on two kinds of popular accelerator models and evaluate on AMD Versal VCK190 platform, where ART meets schedulability requirements that baseline accelerators fail. ART is lightweight, utilizing <0.5% resources. With about 100 lines of user input, ART generates about 2.5k lines of accelerator code, making it a push-button solution.

Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, Peipei Zhou

January 2025 Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2025, Feb. 28 - March 3, Monterey, CA, US. Full Paper Accepted! https://dl.acm.org/doi/10.1145/3706628.3708870

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines(🔥📣 FPGA 2025 Best Paper Candidate🔥📣! )

As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD’s AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE’s specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance.

To address these limitations, we introduce ARIES, an agile multilevel intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is opensourced on GitHub: https://github.com/arc-research-lab/Aries.

Shixin Ji, Jinming Zhuang, Zhuoping Yang, Alex Jones, Peipei Zhou

November 2024 Proceedings of the IEEE 15th International Green and Sustainable Computing Conference, IGSC 2024

Amortizing Embodied Carbon Across Generations (🔥📣Best Viewpoint Paper in IGSC 2024🔥📣! )

Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex Jones, Yiyu Shi, Yanzhi Wang, Peipei Zhou

October 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) in conjunction with (ESWEEK), RALEIGH, NC, USA, Sept. 29-Oct. 4, 2024. Also appears as part of the ESWEEK-TCAD Special Issue, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex K. Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou

September 2024 ACM Transactions on Reconfigurable Technology and Systems

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture (🔥📣New Paper & Project🔥📣! )

Yue Tang, Yukai Song, Naveena Elango, Sheena Ratnam Priya, Alex Jones, Jinjun Xiong, Peipei Zhou, Jingtong Hu

September 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) in conjunction with (ESWEEK), RALEIGH, NC, USA, Sept. 29-Oct. 4, 2024. Also appears as part of the ESWEEK-TCAD Special Issue, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)

CHEF: A Framework for Deploying Heterogeneous Models on Clusters with Heterogeneous FPGAs (🔥📣New Paper & Project🔥📣! )

Zhuoping Yang, Wei Zhang, Shixin Ji, Peipei Zhou, Alex Jones

September 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) in conjunction with (ESWEEK), RALEIGH, NC, USA, Sept. 29-Oct. 4, 2024

Reducing Smart Phone Environmental Footprints with In-Memory Processing (🔥📣New Paper & Project🔥📣! )

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Stephen Cahoon, Jingtong Hu, Yiyu Shi, Alex Jones, Peipei Zhou

July 2024 Proceedings of the IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2024

SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators (🔥📣New Paper & Project🔥📣! )

Ruiyang Qin, Jun Xia, Zhenge Jia, Meng Jiang, Ahmed Abbasi, Peipei Zhou, Jingtong Hu, Yiyu Shi

June 2024 Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC ’24)

Enabling On-Device Self-Supervised LLM Personalization with Selective Synthetic Data (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, Peipei Zhou

February 2024 Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2024, March 3 - March 5, Monterey, CA, US. Full Paper Accepted! https://doi.org/10.1145/3626202.3637569

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (🔥📣New Paper & Project🔥📣! )

Zhuoping Yang, Shixin Ji, Xingzhen Chen, Jinming Zhuang, Weifeng Zhang, Dharmesh Jani, Peipei Zhou

January 2024 Proceedings of the 28th Asia and South Pacific Design Automation Conference, ASPDAC 2024, Incheon Songdo Convensia, South Korea! https://doi.org/10.1109/ASP-DAC58780.2024.10473961

Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous Chiplets (🔥📣New Paper & Project🔥📣! )

Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, Peipei Zhou

July 2023 Proceedings of the 42nd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2023, October 29, 2023 - November 2, 2023, San Francisco, CA, USA. Full Paper Accepted (acceptance ratio is 21 percent)

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Zhuoping Yang, Peipei Zhou

February 2023 Proceedings of the 60th ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC ’23), July 9–13, 2023, San Francisco, CA, USA. Full Paper Accepted (acceptance ratio is 23 percent)

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives (🔥📣New Paper & Project🔥📣! )

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex K. Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou

January 2023 Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3543622.3573210

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture (🔥📣New Paper & Project🔥📣! )

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises, how can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

Xinyi Zhang, Cong Hao, Peipei Zhou, Alex Jones, Jingtong Hu

March 2022 2022 Design Automation Conference (DAC 2022)

H2H: Heterogeneous Model to Heterogeneous System Mapping with Computation and Communication Awareness

The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. For the algorithm, the heterogeneity in ML models comes from the multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT) models, resulting in diverse deep neural net- work (DNN) layers and computation patterns. For the system, it becomes prevailing to integrate dedicated acceleration components into one system. It thus introduces a new problem, heterogeneous model to heterogeneous system mapping (H2H), in which both computation and communication efficiency need to be considered. While previous mapping algorithms only focus on computation patterns, in this work, we propose a novel mapping algorithm with both computation and communication awareness. By slightly sacrificing computation efficiency, the communication latency is largely reduced. Therefore, the system overall performance is improved and energy is also reduced. The superior performance of our work is evaluated on MAESTRO, achieving 15%-74% latency improvement and 23%-64% energy reduction when compared with the existing computation-prioritized mapping algorithm.

Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, Jason Cong

May 2021 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA 21)

MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators

FPGAs have been widely deployed in public clouds, e.g., Amazon Web Services (AWS) and Huawei Cloud. However, simply offloading accelerated kernels from CPU hosts to PCIe-based FPGAs does not guarantee out-of-pocket cost savings in a pay-as-you-go public cloud. Taking Genome Analysis Toolkit (GATK) applications as case studies, although the adoption of FPGAs reduces the overall execution time, it introduces 2.56× extra cost, due to insufficient application-level speedup by Amdahl’s law. To optimize the out-of-pocket cost while keeping high speedup and throughput, we propose Mocha framework as a distributed runtime system to fully utilize the accelerator resource by accelerator sharing and CPU-FPGA partial task offloading. Evaluation results on HaplotypeCaller (HTC) and Mutect2 in GATK show that on AWS, Mocha saves on the application cost by 2.82x for HTC, 1.06x for Mutect2 and on Huawei Cloud by 1.22x, 1.52x respectively than straightforward CPU-FPGA integration solution with less than 5.1% performance overhead.