In this lecture we will describe how machines perform these operations in a little more detail, concentrating on the aspects that affect the programmer, either by being visible through the programming model, or by affecting performance. In particular, we will present a more detailed performance model for communication on distributed memory machines.
Parallel architectures have been evolving quickly, and while it can be interesting to delve into details of particular machines, we will limit ourselves to pointing saying which current machines fall into the different categories we present. For more details on particular machines, see
We will enlarge on this picture to illustrate some differences among existing machines. Recall from Lecture 3 that we coarsely categorized machines in two ways. First, we distinguished between
The block diagrams below are quite schematic, and do not reflect the actual physical structure an any particular machine. Nonetheless, they are quite useful for understanding how these machines work.
A simple block diagram of an SIMD machine is shown below. The central control processor sends instructions to all the processors along the thin lines, which are executed in lock step by all the processors. In other words, the program resides in the central control processor, and sent to the individual processors one instruction at a time. This block diagram describes the Maspar and CM-2.
A simple block diagram of a distributed memory MIMD machine is shown below. Each processor node (shown enclosed by dotted lines) consists of a processor, some local memory, and a network interface processor (NI), all sitting on a bus. Loads and stores executed by the processor are serviced by the local memory in the usual way. In addition, the processor can send instruction to the NI telling it to communicate with an NI on another processor node. The NI can be a simple processor which must be controlled in detail by the main processor (as on the Paragon can be viewed this way). This class also includes the IBM SP-2 and NOW (network of workstations).
A simple block diagram of a shared memory MIMD machine is shown below. This particular diagram corresponds to a machine like the Cray C90 or T90, without caches, and where all it takes equally long for any processor to reach any memory location. There is a single address space, with each memory owning 1/m-th of it. (There need not be as many memories as processors.)
A simple block diagram of a cache based shared memory MIMD machine like the SGI Power Challenge is shown below. Here, each processor has a cache memory associated with it. The tricky part of designing such a machine is to deal with the following situation: Suppose processors i and j both read word k from the memory. Then work k will be fetched and stored in both Mem i and Mem j. Now suppose Proc i writes work k. How will Proc j find out that this happened, and get the updated value? Making sure all the caches are updated this way is called maintaining cache coherency; without it, all processors could have inconsistent views of the values of supposedly identical variables. There are many mechanisms for doing this. One of the simplest is called snoopy caching, where the interconnetion network is simply a bus to which all caches can listen simultaneously. Whenever an address is written, the address and the new value are put on the bus, and each cache updates its own value in case it also has a copy.
The Cray T3D, also falls into this category, since it has a global shared memory with caches, but it operates differently. There are m=p memories, and they are closely attached to their respective processors, rather than all being across the interconnection network. Referring to nonlocal memory requires assembling a global address out of two parts, in a segmented fashion. Only local memory is cachable, so cache coherence is not an issue.
For more detailed blocks diagrams of some of these architectures
In the ideal, but unconstructable, communication network, the cost of communicating one word of data from any processor to any other processor would be the same, no matter how many other processors were communicating among themselves. This constant communication time would be small, comparable to the cost of a basic arithmetic operation. Furthermore, the processors would operate in lock step, so synchronization would be free. Such a simple parallel computer would be relatively easy to program, and was a favorite model for theoretical computer scientists, who dubbed it the PRAM, or parallel random access memory. To define this model completely, we have to say what happens when several processors attempt to read or write the same location simultaneously. The most restrictive model is the EREW PRAM, or Exclusive Read Exclusive Write PRAM, where only one processor at a time is allowed to read or write a particular memory location. A more liberal model is the CRCW PRAM, or Concurrent Read Concurrent Write PRAM, where simultaneous reads are permitted without time penalty, and simultaneous writes store the sum of the all the data being written to that location. Think of this as every write being like a fetch-and-add instruction, as we discussed in Lecture 7.
The attraction of the PRAM model is that it is relatively easy to design and analyze algorithms for it. The drawback is that it ignores the real cost of communication and processors, which is much higher than the cost of arithmetic. To illustrate, here is the sort of very fast but unrealistic algorithm one can write on this perfect parallel computer. It uses n^2 processors to sort n distinct numbers x(1),...,x(n) in constant time, storing the sorted data in y:
for all ( i=1,n, j=1,n ) if ( x(i) > x(j) ) then count(i) = 1 for all ( i=1,n ) y(count(i)+1) = x(i)The algorithm works because count(i) equals the number of data items less than x(i). This algorithm is too good to be true, because the the computer is too good to be built: it assume n^2 processors are available, and that all can access a single location at a time. Even the less powerful EREW PRAM (Exclusive Read Exclusive Write PRAM) model, where at most one processor can read or write a location at a time (the others must time in turn), does not capture the characteristics of real systems. This is because the EREW PRAM model ignores the capacity of the network, or the total number of messages it can send at a time. Sending more messages than the capacity permits leads to contention, or traffic jams, for example when all processors try to reach different words in the same memory unit.
At this point we have two alternatives. We can use a detailed, architecture specific model of communication cost. This has the advantage of modeling accuracy, and may lead to very efficient algorithms highly tuned for a particular architecture. However, it will likely be complicated to analyze and need to change for each architecture. For example, later we will show how to tune matrix multiplication to go as fast as possible on the CM-2. This algorithm required clever combinatorics to figure out, and the authors are justifiably proud of it, but it is unreasonable to expect every programmer to solve a clever combinatorics problem for every algorithm on every machine.
So instead of detailed architectural models, we will mostly use an architecture-independent model which nonetheless captures the features of most modern communication networks. In particular, it attempts to hide enough details to make parallel programming a tractable problem (unlike many detailed architectural models) but not so many details that the programmer is encourage to write inefficient programs (like the PRAM).
This "compromise" model is called LogP. LogP has four parameters, all measured as multiples of the time it takes to perform one arithmetic operation:
We will often simplify LogP by assuming that the capacity is not exceeded, and that large messages of n packets are sent at once. The time line below is for sending n packets. This mimics the "store" in Split-C, because there is no idle time or acknowledgement sent back, as there would be in blocking send-receive, for instance. oS refers to the overhead on the sending processor, gS the gap on the sending processor, and oR the overhead on the receiving processor.
-----------> Time | oS | L | oR | packet 1 arrives | gS | oS | L | oR | packet 2 arrives | gS | oS | L | oR | packet 3 arrives | gS | oS | L | oR | packet 4 arrives ... | gS | oS | L | oR | packet n arrivesBy examining this time line, one can see that the total time to send n message packets is
2*o+L + (n-1)*g = (2*o+L-g) + n*g = alpha + n * betaSome authors refer to alpha = 2*o+L-g as the latency instead of L, since they both measure the startup time to send a message.
The following table gives alpha and beta for various machines, where the message packet is taken to be 8 bytes. Alpha and beta are normalized so that the time for one flop done during double precision (8 byte) matrix multiplication is equal to 1. The speed of matrix multiplication in Mflops is also given.
Machine alpha beta Matmul Software Mflops Alpha+Ether 38000 960 150 assumes PVM Alpha+FDDI 38000 213 150 assumes PVM Alpha+ATM1 38000 62 150 assumes PVM Alpha+ATM2 38000 15 150 assumes PVM HPAM+FDDI 300 13 20 CM5 450 4 3 CMMD CM5 96 4 3 Active Messages CM5+VU 14000 103 90 CMMD iPSC/860 5486 74 26 Delta 4650 87 31 Paragon 7800 9 39 SP1 28000 50 40 T3D 27000 9 150 Large messages(BLT) T3D 100 9 150 read/write
After discussing this table, we give some caveats about measuring and interpreting performance data like this.
The most important fact about this table is that alpha>>1 and beta>>1. In other words, the time to communicate is hundreds to thousands of times as long as the time to do a floating point operation. This means that an algorithm that does more local computation and less communication is strongly favored over one that communicates more. Indeed, an algorithm that does fewer than a hundred floating point operations per word sent is almost certain to be very inefficient.
The second most important fact about this table is that alpha >> beta, so a few large messages are much better than many small messages.
Because of these two facts, alpha and beta tell us much about how to program a particular parallel machine, by telling us how fast our programs will run. If two machines have the same alpha and beta, you will probably want to consider using similar algorithms on them (this also depends on whether they are both MIMD or both SIMD, both shared memory or both distributed memory too, of course).
The fact that different machines in the table above have such different alpha and beta makes it hard to write a single, best parallel program for a particular problem. For example, suppose we want to write a library routine to automatically load balance a collection of independent tasks among processors. To do this well, one should parameterize the algorithm in terms of alpha and beta, so it adjusts itself to the machine. We will discuss such algorithms later.
Let us discuss the above table in more detail. The first 4 lines give time for a network of DEC Alphas on a variety of local area networks, ranging from 1.25Mbyte/sec ethernet to 80MByte/sec ATM. Alpha remains constant in the first 4 lines because we assume the same slow message passing software, PVM, which requires a large amount of time interacting the OS kernel. Part of the NOW project is to improve the hardware and software supporting message passing to make alpha much smaller. A preliminary result is shown for the HPAM+FDDI (HP with Active Messages).
There are three lines for the CM-5, because the CM-5 comes both with and without vectors units (floating point accelerators), and because it can run different messaging software (CMMD and AM, or Active Messages).
The next three lines are a sequence of machine built by Intel. The IBM SP-1 consists of RS6000/370 processors; the IBM SP-2 uses RS6000/590 processors. The T3D is built by Cray, and consists of DEC Alphas with network and memory management done by Cray.
The messaging software can often introduce delays much larger than that of the underlying hardware. So be aware that when you or anyone else measures performance on a machine, you need to know exactly which message software was used, in order to interpret the results and compare them meaningfully with others. For example, PVM is slow both because it does all its message sending using UNIX sockets, which are slow, and also because it has a lot of error checking, can change data formats on the fly if processors with different formats are communicating, etc.
Now we present some caveats and warning about performance data like this. Most of the data in the table above is measured, but some is estimated. Most measurements are for blocking send/receive unless otherwise noted. The measured performance can be sensitive to the number of messages in the network at one time; the best performance will be for two processors communicating the the rest silent, although this is seldom the situation in a real algorithm. For more performance data, see "Message Passing Performance of Various Computers", by Jack Dongarra and Tom Dunigan, U. Tennessee CS report 95-299, in particular, their graph of latencies and bandwidths. "LogP Quantified: The case for low-overhead local area networks" by Kim Keeton, Tom Anderson and David Patterson, Hot Intereconnects III: A Symposium on High Performance Interconnects, 1995, goes into more detail on measuring communication performance.
The basic tradeoff in designing the interprocessor communication network is between speed of communication and cost: high speed requires lots of connectivity or "wires", and low cost requires few wires. In this section we examine several designs for the network, and evaluate their costs and speeds. We will also see to what extent the LogP model can depart from reality.
The network typically consist not just of wires but routing processors, which may be small, simple processors or as powerful as the main processors (eg. the Intel Paragon uses i860 processors for both). Early machines did not have routing processors, so that a single processor had to handle both computation and communication. Newer machines use separate routing processors so communication and computation can proceed in parallel.
Networks always permit each processor to send a message to any other processor (one-to-one communication). Sometimes they support other more complicated collective communications like broadcast (one-to-all), reduction and scan. We will first consider one-to-one communication.
We will classify networks using the following criteria.
For each network, we will give its
LogP and the alpha+n*beta communication cost model ignore diameter because they ignore the fact that messages to nearest neighbors may take less time than messages to distant processors, by using average values for alpha and beta. This reflects the fact that on modern architectures most of the delays are in software overhead at the source and destination, and relatively little in latency in between. In particular, this means that we often can ignore topology in algorithm design.
LogP incorporates bisection width indirectly by limiting the capacity of the network.
Initial (source) location: s1s2s3 Location after first shuffle: s2s3s1 Location after first switch: s2s3d1 (straight across if s1 = d1, adjacent wire if s1 != d1) Location after second shuffle: s3d1s2 Location after second switch: s3d1d2 (straight across if s2 = d2, adjacent wire if s2 != d2) Location after third shuffle: d1d2s3 Location after third switch: d1d2d3 (straight across if s3 = d3, adjacent wire if s3 != d3) Final (destination) location: d1d2d3
The diameter is m=log_2 p, since all message must traverse m stages. The bisection width is p. This network was used in the IBM RP3, BBN Butterfly, and NYU Ultracomputer.
A message is routed from processor i to processor j as follows. All the processors are at the leaves (the bottom), and the other nodes are routing processors. Starting from processor i, a message moves up the tree along the unique path taking it to the first possible common ancestor of i and j in the tree. There are many possible paths (a choice of two at each routing processor), so the routing processor chooses one at random in order to balance the load in the network. Upon reaching this first common ancestor, the message is then routed back down along the unique path connecting it to j. The diameter of this network is 2 log_2 p, and the bisection width is p/2
Among 4 nearest processors: 20 MBytes/sec Among 16 nearest processors: 10 MBytes/sec Among remaining processors: 5 MBytes/secSome of this bandwidth is used up in message headers, as mentioned below.
Here is a little more detail on the CM-5 data network. The vector units perform fast floating point operations, up to 32 Mflops per unit, for a total of 128 Mflops per node. CMF will automatically generate code using the vector units. To use the vector units from Split-C, one can call the CM-5 numerical library CMSSL (click here for details). The local memory hierarchy on a CM-5 node with vector units is rather complicated, and it turns out that data to be operated on by the vector units must be copied from the memory Split-C data resides in to the segment accessible by the vector units. This copying is done at the speed of the much slower Sparc 2 chip. Unless the CMSSL operation is complicated enough, the copy time may overwhelm the time to do the operation. For example, matrix-matrix multiplication takes O(n^3) flops so copying O(n^2) data is probably well worth it. But matrix-vector multiplication takes only O(n^2) flops and so is not worth using the vector units. The difficulty of using the vector units was a major problem in using the CM-5 effectively, and future architectures are unlikely to mimic it. In particular, the fastest microprocessors have floating point integrated on the same chip as the rest of the CPU and (level 1) cache.
To send a message a processor does the following 3 steps:
Note that there is no place in the message for a process ID, to distinguish message from different user processes. This means the machine does not support multiprogramming. It is, however, time-shared, which means the entire data network must be flushed and restored on a context switch ("all fall down" mode).
To prevent deadlock, processors must receive as many messages as they send, to keep the network from backing up. When using CMMD, interrupts are used, so if one has posted an asynchronous receive in anticipation of a message arriving, the processor will be interrupted on arrival, the receive performed, and user-specified handler called (if any). With Split-C, the default is to have interrupts turned off, so it is necessary to poll, or "touch" the network interface periodically to let it have some processor cycles in order to handle pending messages. This is done automatically whenever one does communication. For example at every "global := local" the network will be checked for incoming messages. If one does no communication for a long time, the possibility exists that message will back up and the machine will hang. This is unlikely in a program which does communication on a periodic basis, but nonetheless calling splitc_poll() will "check the mailbox", just in case. (Interrupts can be turned off or on in CMMD with CMMD_{disable,enable}_interrupts, and polling done with CMMD_poll().)
Routing messages may be done by following the edges of the hypercube in paths defined by binary reflected Gray Code. A d-bit Gray Code is a permutation of the integers from 0 to 2^d-1 such that adjacent integers in the list (as well as the first and last) differ by only one bit in their binary expansions. This makes them nearest neighbors in the hypercube. On way to construct such a list is recursively. Let
G(1) = { 0, 1 }be the 1-bit Gray code. Given the d-bit Gray code
G(d) = { g(0), g(1), ... , g(2^d-1) } ,the d+1-bit code G(d+1) is defined as
G(d+1) = { 0g(0), 0g(1), ... , 0g(2^d-1), 1g(2^d-1), ... , 1g(1), 1g(0) }Clearly, if g(i) and g(i+1) differ by one bit, so do the bit patterns in G(d+1). For example,
G(3) = { 000, 001, 011, 010, 110, 111, 101, 100 }We will return to Gray codes below when we discuss collective communication.
Since there are many more wires in a hypercube than either a tree or mesh, it is possible to embed any of these simpler networks in a hypercube, and do anything these simpler networks do. For example, a tree is easy to embed as follows. Suppose there are 8=p=2^d=2^3 processors, and that processor 000 is the root. The children of the root are gotten by toggling the first address bit, and so are 000 and 100 (so 000 doubles as root and left child). The children of the children are gotten by toggling the next address bit, and so are 000, 010, 100 and 110. Note that each node also plays the role of the left child. Finally, the leaves are gotten by toggling the third bit. Having one child identified with the parent causes no problems as long as algorithms use just one row of the tree at a time. Here is a picture.
Rings are even easier, since the Gray Code describes exactly the order in which the adjacent processors in the ring are embedded.
In addition to rings, meshes of any dimension may be embedded in a hypercube, so that nearest neighbors in the mesh are nearest neighbors in a hypercube. The restriction is that mesh dimensions must be powers of 2, and that you need a hypercube of dimension M = m1 + m2 + ... + mk to embed a 2^m1 x 2^m2 x ... x 2^ mk mesh. We illustrate below by embedding a 2x4 mesh in a 3-D hypercube:
This connection was used in the early series of Intel Hypercubes, and in the CM-2.