Scalable and Efficient Systems for Large Language Models
Lianmin Zheng (UC Berkeley)
Colloquium
Monday, April 8, 2024, 3:30 pm
Abstract
Large Language Models (LLMs) have been driving recent breakthroughs in AI. These advancements would not have been possible without the support of scalable and efficient infrastructure systems. In this talk, I will introduce several systems I have designed and built to support the entire model lifecycle, from training to deployment to evaluation. First, I will present Alpa, a system for large-scale model-parallel training that automatically generates execution plans unifying inter- and intra-operator parallelism. Next, I will discuss efficient deployment systems, covering both the frontend programming interface and backend runtime optimizations for high-performance inference. Finally, I will complete the model lifecycle by presenting our model evaluation efforts, including the crowdsourced live benchmark platform, Chatbot Arena, and the automatic evaluation pipeline, LLM-as-a-Judge. These projects have collectively laid a solid foundation for large language model systems, being widely adopted by leading LLM developers and companies. I will conclude by outlining some future directions, such as a programmatic and composable software stack for using LLMs and further improvements with synthetic data.
Bio
Lianmin Zheng is a Ph.D. student in the EECS department at UC Berkeley, advised by Ion Stoica and Joseph E. Gonzalez. His research interests include machine learning systems, large language models, compilers, and distributed systems. He builds full-stack, scalable, and efficient systems to advance the development of AI. He co-founded LMSYS.org, where he leads impactful open-source large language model projects such as Vicuna and Chatbot Arena, which have received millions of downloads and served millions of users. He has received a Meta Ph.D. Fellowship, an IEEE Micro Best Paper Award, and an a16z open-source AI grant.