Architecture-Tailored Parallelization for Accessible Large Model Era
Xupeng Miao (Carnegie Mellon University)
Colloquium
Tuesday, March 5, 2024, 3:30 pm
Abstract
In this talk, I will introduce my work on machine learning (ML) parallelization, a critical endeavor to bridge the significant gap between diverse ML programs and multitiered computing architectures. Specifically, I will explore ML parallelization at three distinct yet interconnected levels. First, I will show that by leveraging the unexplored space of model partitioning strategies, distributed ML training can be 3-20x faster than existing systems by improving communication efficiency. I will highlight two innovative distributed ML systems, including HET for sparse embedding models and Galvatron for dense Transformer models, respectively. Second, I will discuss how to improve GPU utilization through ML parallelization. I will present SpecInfer, a system that reduces large language model (LLM) serving latency by 1.5-3.5x compared to existing systems by leveraging a novel tree-based speculative inference and verification mechanism. Third, I will demonstrate how ML parallelization popularizes LLMs by extending its boundaries throughout inter-cloud environments. I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will conclude with a discussion on push my research forward to a holistic and unified infrastructure for democratizing ML.
Bio
Xupeng Miao is currently a postdoc researcher at Carnegie Mellon University working with Prof. Zhihao Jia and Prof. Tianqi Chen. Before that, he received his Ph.D. degree from Peking University advised by Prof. Bin Cui. He is broadly interested in machine learning systems, data management, and distributed computing. His research has resulted in 30+ publications (with 13 first-authored papers) in top-tier conferences, including OSDI, ASPLOS, SIGMOD, VLDB, NSDI, NeurIPS and so on. Recently, he has focused on building efficient, scalable, and affordable software systems (e.g., FlexFlow Serve) for large language models. His work was recognized through the 2022 ACM China Doctoral Dissertation Award and the Best Scalable Data Science Paper Award of VLDB 2022.