Fast Parallel Sorting: Bridging Theory and Practice on the CM-5 David E. Culler CS Division UC Berkeley As large-scale parallel machines evolve from exotic machine structures to glorified networks of workstations there is the hope that generic performance models will serve as an adequate guide for developing high-quality parallel programs. The LogP model characterizes the performance, not structure, of a parallel machine with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). Split-C is a modest parallel language which, like C, exposes the performance characteristics of the underlying machine. In this talk, I will describe a study where we take four interesting parallel sorting algorithms (bitonic, column, radix, and sample sort), analyze them under LogP, implement them in Split-C, and compare the measured performance on a CM-5 of 32 to 512 processors with that obtained from the model after substituting in the CM-5 parameters. Our experience was that the model served as a valuable guide throughout the development of the fast parallel sorts and revealed subtle defects in the implementations. The final observed performance matches closely with the prediction across a broad range of problem and machine sizes.