CS267: Lecture 9 (part 1), Feb 13, 1996

A closer look at parallel architectures

Introduction

Block diagrams of some basic architectures

Modeling Communication on Distributed Memory Machines

A look at some real communication networks on distributed memory machines

Bus

Crossbar

Perfect Shuffle and Omega Network

Rings, Meshes and Toruses

Trees and Fat Trees

Some details on the CM-5

Hypercubes

Introduction

In Lecture 3, we discussed the three standard functions as parallel architecture must provide:

Parallelism, or getting processor to work simultaneously.
Interprocessor communication, or having processors exchange information.
Synchronization, such as getting processors to ``agree'' on the value of a variable, or a time at which to start or stop.

In this lecture we will describe how machines perform these operations in a little more detail, concentrating on the aspects that affect the programmer, either by being visible through the programming model, or by affecting performance. In particular, we will present a more detailed performance model for communication on distributed memory machines.

Parallel architectures have been evolving quickly, and while it can be interesting to delve into details of particular machines, we will limit ourselves to pointing saying which current machines fall into the different categories we present. For more details on particular machines, see

On-line machine documentation on the main home-page

UCB's Parallel Architecture Course CS258, or

The Top 500 Supercomputer Sites

Block diagrams of some basic architectures

Recall the basic architecture of a parallel machine:

We will enlarge on this picture to illustrate some differences among existing machines. Recall from Lecture 3 that we coarsely categorized machines in two ways. First, we distinguished between

SIMD machines, where each processor executes the same instruction at the same time (or perhaps abstains from doing anything), and

MIMD machines, where each processor can compute independently of the others.

Second, we distinguished between

Shared address space machines, where the instructions "load memory location k into register r" and "store register r into memory location k" refer to the same memory location in the entire machine, no matter which processor executes them, and

Distributed address space machines, where the two load and store instructions refer to locations unique to the processor executing them. In other words, each processor has a separate independent memory, which cannot be directly accessed by load and store operations executed on other processors.

The block diagrams below are quite schematic, and do not reflect the actual physical structure an any particular machine. Nonetheless, they are quite useful for understanding how these machines work.

A simple block diagram of an SIMD machine is shown below. The central control processor sends instructions to all the processors along the thin lines, which are executed in lock step by all the processors. In other words, the program resides in the central control processor, and sent to the individual processors one instruction at a time. This block diagram describes the Maspar and CM-2.

A simple block diagram of a distributed memory MIMD machine is shown below. Each processor node (shown enclosed by dotted lines) consists of a processor, some local memory, and a network interface processor (NI), all sitting on a bus. Loads and stores executed by the processor are serviced by the local memory in the usual way. In addition, the processor can send instruction to the NI telling it to communicate with an NI on another processor node. The NI can be a simple processor which must be controlled in detail by the main processor (as on the Paragon can be viewed this way). This class also includes the IBM SP-2 and NOW (network of workstations).

A simple block diagram of a shared memory MIMD machine is shown below. This particular diagram corresponds to a machine like the Cray C90 or T90, without caches, and where all it takes equally long for any processor to reach any memory location. There is a single address space, with each memory owning 1/m-th of it. (There need not be as many memories as processors.)

A simple block diagram of a cache based shared memory MIMD machine like the SGI Power Challenge is shown below. Here, each processor has a cache memory associated with it. The tricky part of designing such a machine is to deal with the following situation: Suppose processors i and j both read word k from the memory. Then work k will be fetched and stored in both Mem i and Mem j. Now suppose Proc i writes work k. How will Proc j find out that this happened, and get the updated value? Making sure all the caches are updated this way is called maintaining cache coherency; without it, all processors could have inconsistent views of the values of supposedly identical variables. There are many mechanisms for doing this. One of the simplest is called snoopy caching, where the interconnetion network is simply a bus to which all caches can listen simultaneously. Whenever an address is written, the address and the new value are put on the bus, and each cache updates its own value in case it also has a copy.

The Cray T3D, also falls into this category, since it has a global shared memory with caches, but it operates differently. There are m=p memories, and they are closely attached to their respective processors, rather than all being across the interconnection network. Referring to nonlocal memory requires assembling a global address out of two parts, in a segmented fashion. Only local memory is cachable, so cache coherence is not an issue.

For more detailed blocks diagrams of some of these architectures

Modeling Communication on Distributed Memory Machines

We begin by describing a model for an ideal but unconstructable communication network, called the PRAM, and then discuss more realistic networks and their models.

In the ideal, but unconstructable, communication network, the cost of communicating one word of data from any processor to any other processor would be the same, no matter how many other processors were communicating among themselves. This constant communication time would be small, comparable to the cost of a basic arithmetic operation. Furthermore, the processors would operate in lock step, so synchronization would be free. Such a simple parallel computer would be relatively easy to program, and was a favorite model for theoretical computer scientists, who dubbed it the PRAM, or parallel random access memory. To define this model completely, we have to say what happens when several processors attempt to read or write the same location simultaneously. The most restrictive model is the EREW PRAM, or Exclusive Read Exclusive Write PRAM, where only one processor at a time is allowed to read or write a particular memory location. A more liberal model is the CRCW PRAM, or Concurrent Read Concurrent Write PRAM, where simultaneous reads are permitted without time penalty, and simultaneous writes store the sum of the all the data being written to that location. Think of this as every write being like a fetch-and-add instruction, as we discussed in Lecture 7.

The attraction of the PRAM model is that it is relatively easy to design and analyze algorithms for it. The drawback is that it ignores the real cost of communication and processors, which is much higher than the cost of arithmetic. To illustrate, here is the sort of very fast but unrealistic algorithm one can write on this perfect parallel computer. It uses n^2 processors to sort n distinct numbers x(1),...,x(n) in constant time, storing the sorted data in y:

    for all ( i=1,n, j=1,n )  if ( x(i) > x(j) ) then count(i) = 1
    for all ( i=1,n ) y(count(i)+1) = x(i)

The algorithm works because count(i) equals the number of data items less than x(i). This algorithm is too good to be true, because the the computer is too good to be built: it assume n^2 processors are available, and that all can access a single location at a time. Even the less powerful EREW PRAM (Exclusive Read Exclusive Write PRAM) model, where at most one processor can read or write a location at a time (the others must time in turn), does not capture the characteristics of real systems. This is because the EREW PRAM model ignores the capacity of the network, or the total number of messages it can send at a time. Sending more messages than the capacity permits leads to contention, or traffic jams, for example when all processors try to reach different words in the same memory unit.

At this point we have two alternatives. We can use a detailed, architecture specific model of communication cost. This has the advantage of modeling accuracy, and may lead to very efficient algorithms highly tuned for a particular architecture. However, it will likely be complicated to analyze and need to change for each architecture. For example, later we will show how to tune matrix multiplication to go as fast as possible on the CM-2. This algorithm required clever combinatorics to figure out, and the authors are justifiably proud of it, but it is unreasonable to expect every programmer to solve a clever combinatorics problem for every algorithm on every machine.

So instead of detailed architectural models, we will mostly use an architecture-independent model which nonetheless captures the features of most modern communication networks. In particular, it attempts to hide enough details to make parallel programming a tractable problem (unlike many detailed architectural models) but not so many details that the programmer is encourage to write inefficient programs (like the PRAM).

This "compromise" model is called LogP. LogP has four parameters, all measured as multiples of the time it takes to perform one arithmetic operation:

L, or latency, is the time it takes a message packet of some fixed length to traverse the network from source processors to destination processor.

o, or overhead, is the time a processor must spend either to send a message packet, or receive a message packet. This is typically time spent in packing or copying the message, time in the operating system, etc., and cannot be used for other useful work. In other words, from the time one processor decides to send a message to the time another processor has received it, a total of 2*o+L time steps have passed (o on the sending side, L in the network, o on the receiving side).

g, or gap, is the minimum time interval between consecutive sends or receives on a given processor. The reciprocal of g is the the per-processor communication bandwidth.

P is the number of processors.

Furthermore, it assumed that the network has a finite capacity, so that at most L/g message can be in transit from any processor to any processor at any given time. If a processor attempts to send a message that would exceed this limit, it stalls until the message can be sent without exceeding the capacity limit.

We will often simplify LogP by assuming that the capacity is not exceeded, and that large messages of n packets are sent at once. The time line below is for sending n packets. This mimics the "store" in Split-C, because there is no idle time or acknowledgement sent back, as there would be in blocking send-receive, for instance. oS refers to the overhead on the sending processor, gS the gap on the sending processor, and oR the overhead on the receiving processor.

-----------> Time

| oS |   L   | oR |          packet 1 arrives
|  gS  | oS |   L   | oR |          packet 2 arrives
       |  gS  | oS |   L   | oR |          packet 3 arrives
              |  gS  | oS |   L   | oR |          packet 4 arrives
                         ... 
                     |  gS  | oS |   L   | oR |          packet n arrives

By examining this time line, one can see that the total time to send n message packets is

     2*o+L + (n-1)*g = (2*o+L-g) + n*g = alpha + n * beta

Some authors refer to alpha = 2*o+L-g as the latency instead of L, since they both measure the startup time to send a message.

The following table gives alpha and beta for various machines, where the message packet is taken to be 8 bytes. Alpha and beta are normalized so that the time for one flop done during double precision (8 byte) matrix multiplication is equal to 1. The speed of matrix multiplication in Mflops is also given.

      Machine      alpha  beta Matmul     Software
                               Mflops

     Alpha+Ether   38000   960    150     assumes PVM
     Alpha+FDDI    38000   213    150     assumes PVM
     Alpha+ATM1    38000    62    150     assumes PVM
     Alpha+ATM2    38000    15    150     assumes PVM
     HPAM+FDDI       300    13     20
     CM5             450     4      3     CMMD
     CM5              96     4      3     Active Messages
     CM5+VU        14000   103     90     CMMD
     iPSC/860       5486    74     26
     Delta          4650    87     31
     Paragon        7800     9     39
     SP1           28000    50     40
     T3D           27000     9    150     Large messages(BLT)
     T3D             100     9    150     read/write

After discussing this table, we give some caveats about measuring and interpreting performance data like this.

The most important fact about this table is that alpha>>1 and beta>>1. In other words, the time to communicate is hundreds to thousands of times as long as the time to do a floating point operation. This means that an algorithm that does more local computation and less communication is strongly favored over one that communicates more. Indeed, an algorithm that does fewer than a hundred floating point operations per word sent is almost certain to be very inefficient.

The second most important fact about this table is that alpha >> beta, so a few large messages are much better than many small messages.

Because of these two facts, alpha and beta tell us much about how to program a particular parallel machine, by telling us how fast our programs will run. If two machines have the same alpha and beta, you will probably want to consider using similar algorithms on them (this also depends on whether they are both MIMD or both SIMD, both shared memory or both distributed memory too, of course).

The fact that different machines in the table above have such different alpha and beta makes it hard to write a single, best parallel program for a particular problem. For example, suppose we want to write a library routine to automatically load balance a collection of independent tasks among processors. To do this well, one should parameterize the algorithm in terms of alpha and beta, so it adjusts itself to the machine. We will discuss such algorithms later.

Let us discuss the above table in more detail. The first 4 lines give time for a network of DEC Alphas on a variety of local area networks, ranging from 1.25Mbyte/sec ethernet to 80MByte/sec ATM. Alpha remains constant in the first 4 lines because we assume the same slow message passing software, PVM, which requires a large amount of time interacting the OS kernel. Part of the NOW project is to improve the hardware and software supporting message passing to make alpha much smaller. A preliminary result is shown for the HPAM+FDDI (HP with Active Messages).

There are three lines for the CM-5, because the CM-5 comes both with and without vectors units (floating point accelerators), and because it can run different messaging software (CMMD and AM, or Active Messages).

The next three lines are a sequence of machine built by Intel. The IBM SP-1 consists of RS6000/370 processors; the IBM SP-2 uses RS6000/590 processors. The T3D is built by Cray, and consists of DEC Alphas with network and memory management done by Cray.

The messaging software can often introduce delays much larger than that of the underlying hardware. So be aware that when you or anyone else measures performance on a machine, you need to know exactly which message software was used, in order to interpret the results and compare them meaningfully with others. For example, PVM is slow both because it does all its message sending using UNIX sockets, which are slow, and also because it has a lot of error checking, can change data formats on the fly if processors with different formats are communicating, etc.

Now we present some caveats and warning about performance data like this. Most of the data in the table above is measured, but some is estimated. Most measurements are for blocking send/receive unless otherwise noted. The measured performance can be sensitive to the number of messages in the network at one time; the best performance will be for two processors communicating the the rest silent, although this is seldom the situation in a real algorithm. For more performance data, see "Message Passing Performance of Various Computers", by Jack Dongarra and Tom Dunigan, U. Tennessee CS report 95-299, in particular, their graph of latencies and bandwidths. "LogP Quantified: The case for low-overhead local area networks" by Kim Keeton, Tom Anderson and David Patterson, Hot Intereconnects III: A Symposium on High Performance Interconnects, 1995, goes into more detail on measuring communication performance.

A look at some real communication networks on distributed memory machines

The basic tradeoff in designing the interprocessor communication network is between speed of communication and cost: high speed requires lots of connectivity or "wires", and low cost requires few wires. In this section we examine several designs for the network, and evaluate their costs and speeds. We will also see to what extent the LogP model can depart from reality.

The network typically consist not just of wires but routing processors, which may be small, simple processors or as powerful as the main processors (eg. the Intel Paragon uses i860 processors for both). Early machines did not have routing processors, so that a single processor had to handle both computation and communication. Newer machines use separate routing processors so communication and computation can proceed in parallel.

Networks always permit each processor to send a message to any other processor (one-to-one communication). Sometimes they support other more complicated collective communications like broadcast (one-to-all), reduction and scan. We will first consider one-to-one communication.

We will classify networks using the following criteria.

Topology, i.e. which (routing) processors are "directly" connected to which other (routing) processors.

Dynamic vs. static, i.e. whether the topology can change dynamically.

Routing algorithm, i.e. the way a route is chosen for a message from one (routing) processor to another.

This classification is coarse, and many networks are hybrids with features from several categories.

For each network, we will give its

Diameter, i.e. the maximum number of (routing) processors through which a message must pass on its way from source to destination. The diameter measures the maximum delay for transmitting a message from one processor to another.

Bisection width , i.e. the largest number of messages which can be sent simultaneously (without needing to use the same wire or routing processor at the same time and so delaying one another), no matter which p/2 processors are sending to which other p/2 processors. It is also the smallest number of wires you would have to cut to disconnect the network into two equal halves.

LogP and the alpha+n*beta communication cost model ignore diameter because they ignore the fact that messages to nearest neighbors may take less time than messages to distant processors, by using average values for alpha and beta. This reflects the fact that on modern architectures most of the delays are in software overhead at the source and destination, and relatively little in latency in between. In particular, this means that we often can ignore topology in algorithm design.

LogP incorporates bisection width indirectly by limiting the capacity of the network.

   Initial (source) location:     s1s2s3
   Location after first shuffle:  s2s3s1
   Location after first switch:   s2s3d1  (straight across if s1 = d1, 
                                           adjacent wire if s1 != d1)
   Location after second shuffle: s3d1s2
   Location after second switch:  s3d1d2  (straight across if s2 = d2, 
                                           adjacent wire if s2 != d2)
   Location after third shuffle:  d1d2s3
   Location after third switch:   d1d2d3  (straight across if s3 = d3, 
                                           adjacent wire if s3 != d3)
   Final (destination) location:  d1d2d3

The diameter is m=log_2 p, since all message must traverse m stages. The bisection width is p. This network was used in the IBM RP3, BBN Butterfly, and NYU Ultracomputer.

Rings, Meshes and Toruses.

A ring is a one dimensional network, with processors connected to their neighbors on the left and right, with the leftmost and rightmost processors also connected. A 2D mesh is a rectangular array of processor, with each connected to its North, East, West and South neighbors. If the leftmost and rightmost processors in each row are connected, and the top and bottom processor in each column are connected, the network is called a torus. Extensions to higher dimensions are obvious. The diameter of a d-dimensional mesh is d*p^(1/d), and its bisection width is p^((d-1)/d). The torus is somewhat better connected. The Intel Paragon used a 2D torus, and the Cray T3D is a 3D torus (whence its name).

Trees and Fat Trees.

A (binary) tree consists of processors either at all the nodes of a tree, connected to parents and children, or it consists of processors at the leaves of a tree, with routing processors at the internal nodes. We will consider the latter case, since it corresponds to the CM-5 data network (the CM-5 has two other networks as well, which we discuss later). The diameter of a tree is 2 log_2 p, which corresponds to sending data from the leftmost leaf via the root to the rightmost leaf. The bisection width is merely 1, since if the p/2 leftmost processors try to send to the p/2 rightmost processors, they all have to pass through the root one at a time. Since the root is a bottleneck, the CM-5 uses what is called a fat-tree, which can be thought of as a tree with fat edges, where the edges at level k have two (or more) times the capacity as the edges as level k+1 (the root is level 0). In reality, this means that a child has two (or more) parents, as shown in the figure below.

A message is routed from processor i to processor j as follows. All the processors are at the leaves (the bottom), and the other nodes are routing processors. Starting from processor i, a message moves up the tree along the unique path taking it to the first possible common ancestor of i and j in the tree. There are many possible paths (a choice of two at each routing processor), so the routing processor chooses one at random in order to balance the load in the network. Upon reaching this first common ancestor, the message is then routed back down along the unique path connecting it to j. The diameter of this network is 2 log_2 p, and the bisection width is p/2

Some details on the CM-5

It is possible to have non-binary fat tree; the CM-5 uses a 4-way tree. It is also unnecessary to double the number of wires at each level; the CM-5 save some wires (and so cost) by having fewer wires at the top of the tree. The resulting maximum bandwidths are:

   Among  4 nearest processors:  20 MBytes/sec
   Among 16 nearest processors:  10 MBytes/sec
   Among remaining  processors:   5 MBytes/sec

Some of this bandwidth is used up in message headers, as mentioned below.

Here is a little more detail on the CM-5 data network. The vector units perform fast floating point operations, up to 32 Mflops per unit, for a total of 128 Mflops per node. CMF will automatically generate code using the vector units. To use the vector units from Split-C, one can call the CM-5 numerical library CMSSL (click here for details). The local memory hierarchy on a CM-5 node with vector units is rather complicated, and it turns out that data to be operated on by the vector units must be copied from the memory Split-C data resides in to the segment accessible by the vector units. This copying is done at the speed of the much slower Sparc 2 chip. Unless the CMSSL operation is complicated enough, the copy time may overwhelm the time to do the operation. For example, matrix-matrix multiplication takes O(n^3) flops so copying O(n^2) data is probably well worth it. But matrix-vector multiplication takes only O(n^2) flops and so is not worth using the vector units. The difficulty of using the vector units was a major problem in using the CM-5 effectively, and future architectures are unlikely to mimic it. In particular, the fastest microprocessors have floating point integrated on the same chip as the rest of the CPU and (level 1) cache.

To send a message a processor does the following 3 steps:

It stores the first word of the message in the send_first_register. This is actually a register on a special network interface chip, but is memory mapped into the processor address space so the processor can just store to it. This indicates the start of a new message, plus a message tag (from 0 to 5).

The processor stores the remainder of the message (up to 5 words) into the send_register (which is memory mapped again). The first word is the number of the destination processor.

Since the message may fail to be sent if the network is full, the program must check some status bits to see if it has actually been sent (once sent, delivery is guaranteed provided the destination processor continues to receive messages). If the send_OK flag is not sent, the message must be retransmitted.

To receive a message, a processor uses the following three steps

It checks the receive_OK flag, or lets itself be interrupted.

It checks the input message tag, and the length.

It reads consecutive words from the receive_register.

Note that there is no place in the message for a process ID, to distinguish message from different user processes. This means the machine does not support multiprogramming. It is, however, time-shared, which means the entire data network must be flushed and restored on a context switch ("all fall down" mode).

To prevent deadlock, processors must receive as many messages as they send, to keep the network from backing up. When using CMMD, interrupts are used, so if one has posted an asynchronous receive in anticipation of a message arriving, the processor will be interrupted on arrival, the receive performed, and user-specified handler called (if any). With Split-C, the default is to have interrupts turned off, so it is necessary to poll, or "touch" the network interface periodically to let it have some processor cycles in order to handle pending messages. This is done automatically whenever one does communication. For example at every "global := local" the network will be checked for incoming messages. If one does no communication for a long time, the possibility exists that message will back up and the machine will hang. This is unlikely in a program which does communication on a periodic basis, but nonetheless calling splitc_poll() will "check the mailbox", just in case. (Interrupts can be turned off or on in CMMD with CMMD_{disable,enable}_interrupts, and polling done with CMMD_poll().)

Aside on Active Messages

Here is what goes inside a message. The most basic sequence of events that can happen when a message arrives is this: the first word of the data is used as a pointer to some procedure on the receiving processor, which then is executed, using the remaining words in the message as arguments. This is called an active message. The active message can in principle do anything, but typically does one of the following things:

It may "put" one of the data words in a location specified by another data word, and send an acknowledgement back. This is what happens with "global = local" or "global := local".

It may "store" one of the data words in a location specified by another data word, not send an acknowledgement back, but keep a counter of the number of bytes of data received. This is what happens with "global :- local".

It may "get" data located at a memory location in one of the data words, and send it back to the source of the active message. This is what happens with "local = global" or "local := global".

If may receive data returned by an earlier active message request (get or read), store it in memory, and update a counter of the number of bytes received.

It may do any user-specified action. This should be short and run to completion with doing any blocking communication.

More complicated communication protocols, such as blocking send/receive, can be built from active messages as well, but on the CM-5 and many other machines, active messages are the most basic objects that one sends in the network. For more details, click here.

Hypercubes

A d-dimensional hypercube consists of p=2^d processors, each one with a d-bit address. Processor i and j are connected if there addresses differ in exactly one bit. This is usually visualized by thinking of the unit (hyper) cube embedded in d-dimensional Euclidean space, with one corner at 0 and lying in the positive orthant. The processors can be thought of as lying at the corners of the cube, with their (x1,x2,...,xd) coordinates identical to their processor numbers, and connected to their nearest neighbors on the cube.

Routing messages may be done by following the edges of the hypercube in paths defined by binary reflected Gray Code. A d-bit Gray Code is a permutation of the integers from 0 to 2^d-1 such that adjacent integers in the list (as well as the first and last) differ by only one bit in their binary expansions. This makes them nearest neighbors in the hypercube. On way to construct such a list is recursively. Let

    G(1) = { 0, 1 }

be the 1-bit Gray code. Given the d-bit Gray code

    G(d) = { g(0), g(1), ... , g(2^d-1) } ,

the d+1-bit code G(d+1) is defined as

    G(d+1) = { 0g(0), 0g(1), ... , 0g(2^d-1), 1g(2^d-1), ... , 1g(1), 1g(0) }

Clearly, if g(i) and g(i+1) differ by one bit, so do the bit patterns in G(d+1). For example,

    G(3) = { 000, 001, 011, 010, 110, 111, 101, 100 }

We will return to Gray codes below when we discuss collective communication.

Since there are many more wires in a hypercube than either a tree or mesh, it is possible to embed any of these simpler networks in a hypercube, and do anything these simpler networks do. For example, a tree is easy to embed as follows. Suppose there are 8=p=2^d=2^3 processors, and that processor 000 is the root. The children of the root are gotten by toggling the first address bit, and so are 000 and 100 (so 000 doubles as root and left child). The children of the children are gotten by toggling the next address bit, and so are 000, 010, 100 and 110. Note that each node also plays the role of the left child. Finally, the leaves are gotten by toggling the third bit. Having one child identified with the parent causes no problems as long as algorithms use just one row of the tree at a time. Here is a picture.

Rings are even easier, since the Gray Code describes exactly the order in which the adjacent processors in the ring are embedded.

In addition to rings, meshes of any dimension may be embedded in a hypercube, so that nearest neighbors in the mesh are nearest neighbors in a hypercube. The restriction is that mesh dimensions must be powers of 2, and that you need a hypercube of dimension M = m1 + m2 + ... + mk to embed a 2^m1 x 2^m2 x ... x 2^ mk mesh. We illustrate below by embedding a 2x4 mesh in a 3-D hypercube:

This connection was used in the early series of Intel Hypercubes, and in the CM-2.

CS267: Lecture 9 (part 1), Feb 13, 1996

A closer look at parallel architectures

Table of Contents

Aside on Active Messages