Exoshuffle: Large-Scale Shuffle at the Application Level


Shuffle is a key primitive in large-scale data processing applications. The difficulty of large-scale shuffle has inspired the development of many specialized shuffle systems. While these systems greatly improve shuffle performance and reliability, they come at a cost: flexibility. First, each shuffle system is essentially built from scratch, which is a significant developer effort. Second, the monolithic design of these shuffle systems makes them too rigid to support fine-grained pipelining, as desired by applications like distributed ML training. We argue that the inflexibility stems from the tight coupling of shuffle algorithms and system-level optimizations, and propose to use the distributed futures abstraction to decouple shuffle algorithms from the system. We present Exoshuffle, an application-level shuffle design built on top of Ray, a task-based distributed futures system. We show that it is possible to (1) express shuffle algorithms from previous shuffle systems in a few hundred lines of application-level Python code, (2) achieve competitive performance and scalability with specialized data systems like Spark, and (3) achieve interoperability with other data applications via fine-grained pipelining.

Frank Sifei Luan
Frank Sifei Luan
栾思飞 | PhD Student

PhD at UC Berkeley focused on AI systems and cloud computing.