John D. Kubiatowicz. Integrated Shared-Memory and Message-Passing
Communication in the Alewife Multiprocessor. PhD thesis,
Massachusetts Institute of Technology, Department of Electrical
Engineering and Computer Science, February 1998.
(pdf, gzip'ed postscript)
To date, MIMD multiprocessors have been divided into two classes based on hardware communication models: those supporting shared memory and those supporting message passing. Breaking with tradition, this thesis argues that multiprocessors should integrate both communication mechanisms in a single hardware framework. Such integrated multiprocessors must address several architectural challenges that arise from integration. These challenges include the User-Level Access problem, the Service-Interleaving problem, and the Protocol Deadlock problem. The first involves which communication models are used for communication and how these models are accessed; the second involves avoiding livelocks and deadlocks introduced by multiple simultaneous streams of communication; and the third involves removing multi-node cycles in communication graphs. This thesis introduces these challenges and develops solutions in the context of Alewife, a large-scale multiprocessor. Solutions involve careful definition of communication semantics and interfaces to permit tradeoffs across the hardware/software boundary. Among other things, we will introduce the User-Direct Messaging model for message passing, the transaction buffer framework for preventing cache-line thrashing, and two-case delivery for avoiding protocol deadlock.
The Alewife prototype implements cache-coherent shared memory and user-level message passing in a single-chip Communications and Memory Management Unit (CMMU). The hardware mechanisms of the CMMU are coupled with a thin veneer of runtime software to support a uniform high-level communications interface. The CMMU employs a scalable cache-coherence scheme, functions with single-channel, bidirectional network, and directly supports up to 512 nodes. This thesis describes the design and implementation of the CMMU, associated processor-level interfaces, and runtime software. Included in our discussion is an implementation framework called service coupling, which permits efficient scheduling of highly contended resources (such as DRAM). This framework is well suited to integrated architectures.
To evaluate the efficacy of the Alewife design, this thesis presents results from an operating 32-node Alewife machine. These results include microbenchmarks, to focus on individual mechanisms, and macrobenchmarks, in the form of applications and kernels from SPLASH and NAS benchmark suits. The large suite of working programs and resulting performance numbers lead us to one of our primary conclusions, namely that the integration of shared-memory and message-passing communication models is possible at a reasonable cost, and can be done with a level of efficiency that does not compromise either model. We conclude by discussing the extent to which the lessons of Alewife can be applied to future multiprocessors.