Fast Parallel Sorting: Bridging Theory and Practice on the CM-5
				   
			   David E. Culler
			     CS Division
			     UC Berkeley
				   

As large-scale parallel machines evolve from exotic machine structures
to glorified networks of workstations there is the hope that generic
performance models will serve as an adequate guide for developing
high-quality parallel programs.  The LogP model characterizes the
performance, not structure, of a parallel machine with a small set of
parameters: the communication latency (L), overhead (o), bandwidth
(g), and the number of processors (P).  Split-C is a modest parallel
language which, like C, exposes the performance characteristics of the
underlying machine.  In this talk, I will describe a study where we
take four interesting parallel sorting algorithms (bitonic, column,
radix, and sample sort), analyze them under LogP, implement them in
Split-C, and compare the measured performance on a CM-5 of 32 to 512
processors with that obtained from the model after substituting in the
CM-5 parameters.  Our experience was that the model served as a
valuable guide throughout the development of the fast parallel sorts
and revealed subtle defects in the implementations. The final observed
performance matches closely with the prediction across a broad range
of problem and machine sizes.