Performance Portable Language and Algorithm Design meets the T3D

		      Computer Science Division
		 University of California, Berkeley

In 1992 a wave of new MPP systems were arose that followed the
``shell'' approach, including the Thinking Machines CM-5, Intel
Paragon, Meiko CS-2, and Cray T3D.  In this approach the core of each
node is realized by a state-of-the-art commercial microprocessor and
its memory system, surrounded by a shell of additional logic to
support communication and synchronization.  Based on the announced
designs, we developed a simple parallel extension to the C language,
called Split-C, with the goal of extracting the full performance
capability out of this wave of machines. This provides a full C on
each node operating out of the local memory, augmented with a rich set
of assignment operations on the collective global address space.  As
the announcements were followed by delivery of the machines, we have
conducted the experiment of implementing the language on the machine
and assessing its performance.  The T3D provides a very interesting
case study because the shell is so elaborate, including support
global-memory access, prefetch, atomic operations, barriers, and block
transfers. The semantics of hardware primitives for global operation
are at essentially the same level as the language primitives.  Many
distinct mechanisms exist to perform the same function, and the
performance characteristics of the various mechanisms are not obvious.

This talk has two parts.  The first part presents our language
implementation approach, which begins by establishing the performance
characteristics of the machine empirically via microbenchmarks and
then tries to minimize the additional cost in mapping the language to
the hardware.  The second part is application case study using a
"hybrid algorithm" motivated by the local/global cost model of
Split-C, which yields the current worlds fastest solution connected
components to the problem arising in cluster methods.