Empirical Evaluation of the CRAY-T3D: A Compiler Perspective David E. Culler Computer Science Division University of California, Berkeley In 1992 a wave of new MPP systems were arose that followed the ``shell'' approach, including the Thinking Machines CM-5, Intel Paragon, Meiko CS-2, and Cray T3D. In this approach the core of each node is realized by a state-of-the-art commercial microprocessor and its memory system, surrounded by a shell of additional logic to support communication and synchronization. Based on the announced designs, we developed a simple parallel extension to the C language, called Split-C, with the goal of extracting the full performance capability out of this wave of machines. This provides a full C on each node operating out of the local memory, augmented with a rich set of assignment operations on the collective global address space. As the announcements were followed by delivery of the machines, we have conducted the experiment of implementing the language on the machine and assessing its performance. The T3D provides a very interesting case study because the shell is so elaborate, including support global-memory access, prefetch, atomic operations, barriers, and block transfers. The semantics of hardware primitives for global operation are at essentially the same level as the language primitives. Many distinct mechanisms exist to perform the same function, and the performance characteristics of the various mechanisms are not obvious. This talk reflects our language implementation approach, which begins by establishing the actual performance of the machine and then tries to minimize the additional cost in mapping the language to the hardware. To do this, we follow a ``gray-box'' methodology, where design documents are used to establish the functional characteristics of the hardware and a set of micro-benchmarks are used to characterize its performance empirically. Together these dictate the code-generation strategy. The talk will provide a detailed empirical performance characterization of the hardware primitives, evaluate their utility in code generation for a parallel language and discuss trade-offs and pitfalls in the machine architecture. This is joint work with Remzi Arpaci, Arvind Krishnamurthy, Steve Steinberg, and Katherine Yelick.