Empirical Evaluation of the CRAY-T3D:  A Compiler Perspective

			   David E. Culler
		      Computer Science Division
		 University of California, Berkeley

In 1992 a wave of new MPP systems were arose that followed the
``shell'' approach, including the Thinking Machines CM-5, Intel
Paragon, Meiko CS-2, and Cray T3D.  In this approach the core of each
node is realized by a state-of-the-art commercial microprocessor and
its memory system, surrounded by a shell of additional logic to
support communication and synchronization.  Based on the announced
designs, we developed a simple parallel extension to the C language,
called Split-C, with the goal of extracting the full performance
capability out of this wave of machines. This provides a full C on
each node operating out of the local memory, augmented with a rich set
of assignment operations on the collective global address space.  As
the announcements were followed by delivery of the machines, we have
conducted the experiment of implementing the language on the machine
and assessing its performance.  The T3D provides a very interesting
case study because the shell is so elaborate, including support
global-memory access, prefetch, atomic operations, barriers, and block
transfers. The semantics of hardware primitives for global operation
are at essentially the same level as the language primitives.  Many
distinct mechanisms exist to perform the same function, and the
performance characteristics of the various mechanisms are not obvious.

This talk reflects our language implementation approach, which begins
by establishing the actual performance of the machine and then tries
to minimize the additional cost in mapping the language to the
hardware.  To do this, we follow a ``gray-box'' methodology, where
design documents are used to establish the functional characteristics
of the hardware and a set of micro-benchmarks are used to characterize
its performance empirically.  Together these dictate the
code-generation strategy.  The talk will provide a detailed empirical
performance characterization of the hardware primitives, evaluate
their utility in code generation for a parallel language and discuss
trade-offs and pitfalls in the machine architecture.  This is joint
work with Remzi Arpaci, Arvind Krishnamurthy, Steve Steinberg, and
Katherine Yelick.