I joined the University of Pittsburgh, ECE department as a Tenure-Track Assistant Professor starting September 2021. I obtained my Ph.D. in Computer Science from University of California, Los Angeles in 2019 supervised by Prof. Jason Cong, who leads UCLA VAST(VLSI Architecture, Synthesis and Technology) Group and CDSC (The Center for Domain-Specific Computing). My major interest is in Customized Computer Architecture and Programming Abstraction for Applications including Healthcare, e.g., Precision Medicine and Artificial Intelligence. I’m honored to receive “Outstanding Recognition in Research” from UCLA Samueli School of Engineering in 2019. I have also received 🏆 2019 TCAD Donald O. Pederson Best Paper Award 🏆 in recognition of best paper published in the IEEE Transactions on CAD in the two calendar years preceding the award. My paper has also received 2018 ICCAD Best Paper Nominee 🏆, 2018 ISPASS Best Paper Nominee 🏆.
I’m actively recruiting PhD students and research interns! Self-motivated students with relevant research and project experience (compiler, GPU and FPGA programming, artificial intelligence algorithm and application development, etc.) are highly encouraged to contact me via email.
Download my CV.
Former Website at UCLA
PhD in Computer Science, 2019
University of California, Los Angeles
MSc in Electrical Engineering, 2014
University of California, Los Angeles
BSc in Electrical Engineering, 2012
Southeast University, Chien-Shiung Wu Honor College
Health & Artificial Intelligence
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises, how can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.
The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. For the algorithm, the heterogeneity in ML models comes from the multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT) models, resulting in diverse deep neural net- work (DNN) layers and computation patterns. For the system, it becomes prevailing to integrate dedicated acceleration components into one system. It thus introduces a new problem, heterogeneous model to heterogeneous system mapping (H2H), in which both computation and communication efficiency need to be considered. While previous mapping algorithms only focus on computation patterns, in this work, we propose a novel mapping algorithm with both computation and communication awareness. By slightly sacrificing computation efficiency, the communication latency is largely reduced. Therefore, the system overall performance is improved and energy is also reduced. The superior performance of our work is evaluated on MAESTRO, achieving 15%-74% latency improvement and 23%-64% energy reduction when compared with the existing computation-prioritized mapping algorithm.
FPGAs have been widely deployed in public clouds, e.g., Amazon Web Services (AWS) and Huawei Cloud. However, simply offloading accelerated kernels from CPU hosts to PCIe-based FPGAs does not guarantee out-of-pocket cost savings in a pay-as-you-go public cloud. Taking Genome Analysis Toolkit (GATK) applications as case studies, although the adoption of FPGAs reduces the overall execution time, it introduces 2.56× extra cost, due to insufficient application-level speedup by Amdahl’s law. To optimize the out-of-pocket cost while keeping high speedup and throughput, we propose Mocha framework as a distributed runtime system to fully utilize the accelerator resource by accelerator sharing and CPU-FPGA partial task offloading. Evaluation results on HaplotypeCaller (HTC) and Mutect2 in GATK show that on AWS, Mocha saves on the application cost by 2.82x for HTC, 1.06x for Mutect2 and on Huawei Cloud by 1.22x, 1.52x respectively than straightforward CPU-FPGA integration solution with less than 5.1% performance overhead.