Message Passing Programming with CMMD

(CS 267, Feb 2 1995)

Distributed address space SPMD, or message passing, is a programming style suitable for distributed address space MIMD machines. We begin by examining the CMMD solution of the first Sharks and Fish problem, fish swimming in a current. The solutions are written in C with calls to the CM-5 message passing library CMMD. They could also have been written in Fortran with no change in the use of CMMD. (Documentation for CMMD is located in Volume 1 of the class reference material, or by typing cmview while logged into rodin or moore.) There are actually two source files to examine, which you should pop up in separate windows: the control processor code, called fish.cp.c, and the processor node code, called fish.pn.c. The control processor code runs on the "front end" of the CM-5, the Sparc you log in to (rodin or moore), and the processor node code runs in parallel on all the processors.

The host and the nodes all begin by initializing the programming environment, by calling CMMD_enable_host() (on the nodes) and CMMD_enable() (on the host). All CMMD procedures are prefixed by CMMD_. These calls are synchronous, or blocking, meaning that all CMMD_enable_host() calls wait until all the others have been called, before simultaneously returning to continue execution.

The host proceeds to read input arguments, and computes

      myfish = fish/CMMD_partition_size(), 
the number of fish each node will own, which is equal to the total number of fish divided by the number of processors.

Then the host broadcasts the input data to all the processors by calling

      CMMD_bc_from_host(&host_data, sizeof(host_data)). 
The nodes receive this broadcast data by calling
      CMMD_receive_bc_from_host(&node_data, sizeof(node_data)). 
These calls are also synchronous: only after all nodes and the host call their respective routines and the broadcast is completed, do all subroutines simultaneously return. If one node failed to call CMMD_receive_bc_from_host, the other processors would hang. Also, all the processors should agree on the number of bytes being sent.

At this point the nodes compute myfish, the number of fish they will each simulate (by the same formula as above). Then they malloc the space needed to store the fish, and initialize them to some reasonable values. Each processor initializes its fish somewhat differently by having it depend on MYPROC = CMMD_self_address(), the number of each processor, which runs from 0 to CMMD_partition_size()-1.

Now the nodes start looping over the time steps of the simulation, updating their local fish positions and velocities. This is straightforward sequential C code. Periodically, every steps_per_display time steps, the nodes send their local fish positions back to the host for graphical display. To do this, each node calls

      CMMD_send(CMMD_host_node(), tag=0, 
           FISH = pointer to local fish array, 
           size of local fish array)
to send a message to the host containing the local fish data. The second argument is used to label messages, and is a kind of "return address" that can be used by the receiving processor to sort its incoming mail. The host in turn must receive a message from each processor by calling
      CMMD_receive (j, 0, &allfish[j*myfish], 
             size of incoming message)
in a loop from j=0 to CMMD_partition_size()-1. To receive a message, not only must the destination node specified in CMMD_send() and the source node specified in CMMD_receive() match, but the tags must match. One can also specify the source node as CMMD_ANY_NODE to match any sending node, and specify the tag as CMMD_ANY_TAG to match any incoming tag. The size of the incoming message must be at least as big as the size of the message sent. (One can query how many bytes actually arrived.)

These sends and receives are also synchronous, or blocking. (Later releases of CMMD use CMMD_send_block instead of CMMD_send, and CMMD_receive_block instead of CMMD_receive, to distinguish from other kinds of send and receive we will discuss shortly.) This means that the sending node does not continue executing until the receive has been executed and the transmission completed. Similarly, the receive does not return until the transmission is completed. In other words, blocking sends and receives must come in matching pairs, or else the program will hang.

Now we discuss the computation of the root-mean-square fish velocity, which requires computing a global sum of local sum-of-squared-velocities. The local sums computed on each processor are called vsum. The global sum is computed by having each node call

      vsum = CMMD_reduce_to_host_float(vsum, CMMD_combiner_fadd);
and the host call
      vsum = CMMD_reduce_from_nodes_float (0, CMMD_combiner_fadd);
These synchronous calls compute the global floating point sum (as indicated by the second argument CMMD_combiner_fadd) of the first argument (vsum), returning the same result everywhere. The global maximum of the velocities (maxv) and accelerations (maxa) are computed similarly. The final operation is a global maximum of the local CPU times to get the overall running time.

We wish to illustrate one other CMMD routine, which is used in the solution of the second Sharks and Fish problem: fish with gravity. As before, there are actually two source files to examine, which you should pop up in separate windows: the control processor code, called fish.cp.c, and the processor node code, called fish.pn.c. Recall that in our current solution, each fish position must be combined with every other fish position in order to compute the gravitational force. The approach taken in fish.pn.c is to have each node call

      CMMD_concat_with_nodes(Fish,AllFish,sizeof(AFish)*myfish);
in order to make a copy of all fish data on all processors. Here Fish is the local fish array, AllFish is another array large enough to contain all fish data, and the last argument is the size of the local fish data. This synchronous subroutine returns with a copy of each local Fish occupying part of AllFish. This is not intended to be a particularly efficient solution, especially in memory: there needs to be a completely redundant copy of all the data on each processor.

Let us think about improving this solution. The solution analogous to the CM Fortran and Matlab solutions is to make a second copy of the data, call it FISHp, and rotate it among the processors. We now have a choice: As in the CM Fortran, we could compute the contribution to each fish force from one other fish, and then rotate the second copy FISHp by one fish. Most rotating fish in FISHp would remain on the same processor, with those at the "boundary" requiring communication. Alternatively, we could loop over all the local fish, computing all myfish^2 local force contributions of FISHp to FISH, and then rotate by myfish, getting an entirely new subset of FISHp. The first solution does number_of_fish rotations (communications) of one fish each. The second solution does p-1 rotations of myfish = number_of_fish / p fish each. Which is faster?

To predict which is faster requires a performance model for communication. The simplest model we will use for communication is the following one: The time it takes to send n bytes from one processor to one other processor is

        alpha + beta n
where alpha is the latency (in units of seconds) and beta is the inverse of bandwidth, where bandwidth is in units of bytes/second. We will supply numbers for these later, but on the CM-5, you can expect a bandwidth of about 5 Megabytes/second, or beta = .2 microseconds/byte. Alpha is much larger, tens or hundreds of microseconds. Thus, the first solution, which sends number_of_fish-1 messages each of size 1, costs
  (number_of_fish-1)*alpha + (number_of_fish-1)*beta    seconds.
The second solution, which sends p-1 messages each of size myfish = number_of_fish/p, costs
    (p-1)*alpha + (p-1)*myfish*beta = 
       (p-1)*alpha + number_of_fish*(p-1)/p*beta  seconds.
Comparing the two timings, the coefficients of beta are nearly the same: all the fish still have to visit all the processors. But the coefficients of alpha are quite different: the second solution is faster, because the number of fish should be much greater than p (otherwise we would not bother with a parallel computer!), and alpha is much larger than beta.

Another way to think about this result is in terms of the memory hierarchy of the machine. Accessing remote memory requires explicit communication and is much more expensive than local memory. The second solution above reduces the number of times we have to access this remote memory (though the total number of words accessed remains about the same). This is analogous to the first assignment, where we saw how important it was to use an entire cache line, since it was as expensive to fetch one word as every byte in the line.

Now let us think about implementing the second solution, using the block send and receive we have just discussed. The most obvious solution is to try having each processor execute

    CMMD_send( (MPROC+1) mod p, ... )         
                  ! send to neighbor on right
    CMMD_receive( (MPROC-1) mod p, ... )      
                  ! receive from neighbor on left
But this solution deadlocks, because no CMMD_send is allowed to return until a matching receive has occurred. Since everyone is waiting, no progress is made.

A simple fix is to take two communication steps instead of one: do the odd processors, and then the even processors:

    if ( MPROC is even )
       CMMD_send( (MPROC+1) mod p, ... )       
                     ! send to odd neighbor on right
       CMMD_receive( (MPROC-1) mod p, ... )    
                     ! receive from odd neighbor on left
    else ! ( MPROC is odd )
       CMMD_receive( (MPROC-1) mod p, ... )    
                     ! receive from even neighbor on left
       CMMD_send( (MPROC+1) mod p, ... )       
                     ! send to even neighbor on right
    endif
However, no such simple solution suffices if instead of a simple rotation, we wanted to do an arbitrary permutation of our data.

To support these operations, there is another synchronous CMMD routine: CMMD_send_and_receive, which lets the user specify both a message to be sent (to MPROC+1, say) and a message to be received (from MPROC-1, say). However, it does not fix all our problems. In particular, it still requires all processors to synchronize, so that if one processor is "running late", all other processors must wait for it, lowering efficiency. This is analogous to not having any conversations at a party until all latecomers have arrived, rather than letting those already present talk independently.

The solution offered by CMMD (and other similar libraries like PVM and MPI) is to have versions of send and receive which expose more machine details and have less strict semantics than the synchronous versions just described.

Nonblocking send, or

   CMMD_send_noblock( destination_node, tag, &data, sizeof(data) )
sends a message "immediately" by either putting it in the communication network, or saving it in a buffer if the network is busy, and returns right away. There is no guarantee that the message is ever received by any other processor. Asynchronous send, or
   CMMD_send_asynch( destination_node, tag, &data, sizeof(data),
                     handler, handler_argument) 
has even weaker guarantees. CMMD_send_asynch returns a pointer to a message control block or MCB, a structure in which the system records the status of the message, and can be queried by the user. The routine returns immediately, even if the network is not ready to receive the data. But in the interest of efficiency, no copy is made of the data in this case, (copying takes time and space), and it is the user's responsibility not to overwrite the data before it has been sent. The user can discover whether the message has been sent in two ways. First, the user can examine the MCB, by calling either CMMD_msg_done(MCB) or CMMD_all_sends_done(). Second, the system will automatically call the user-supplied handler procedure (with arguments MCB and handler_argument) when the message is finally sent. This is done asynchronously, typically with an interrupt. The handler procedure can set a flag to indicate the send is done, increment a message counter, or most anything else not involving blocking communication. It should also make sure to free the MCB, which was malloced by the system at the time of the send; otherwise memory fills up with dead MCBs.

Either send may be paired either with CMMD_receive_block or an asynchronous receive

   CMMD_receive_asynch( source_node, tag, &data, sizeof(data), handler )
This routine returns immediately after allocating an MCB to keep track of the requested incoming message. As above, the user may query the MCB (via CMMD_msg_done(MCB) or CMMD_all_messages_done()) to see if the message has arrived, or let the handler do it, which as before is called when an interrupt announces the arrival of the requested message. Again, the handler must make sure to free the MCB created by the CMMD_receive_asynch call.

The use of nonblocking or asynchronous send and receive permits more flexibility in the design of parallel algorithms, allowing communication in the network to proceed in parallel with computation, without requiring both sending and receiving processors to be at the same place at the same time. A simple analogy is using the phone system, where both parties need to be present, and (e)-mail, which runs asynchronously. Indeed, using synchronous send is like using a telephone where, having dialed, you can't do anything other than wait for someone to pick up. Using synchronous receive is like having a telephone that does not ring, and having picked up the phone in anticipation of a call, you can't do anything other than wait for someone to call. (One can do slightly better by calling CMMD_msg_pending, which says whether an incoming message is pending, before calling CMMD_receive_block.)

Let us think about how to write a general permutation routine in the three styles discussed so far. Suppose each processor has an integer perm(i), where perm(0) through perm(p-1) form a permutation of 0 through p-1. First we will try synchronous send and receive, for which the relevant routine is

    CMMD_send_and_receive( 
        source_node, source_tag, &source_array, sizeof(source_array), 
        dest_node, dest_tag, &dest_array, sizeof(dest_array) )
The first solution one might try is
    CMMD_send_and_receive( 
        CMMD_any_node, 0, &local_data, sizeof(local_data), 
        perm(MYPROC), 0, &local_data, sizeof(local_data) )
The problem with this is that local_data may be overwritten by new data before it is sent out. So one needs to have a second copy:
    CMMD_send_and_receive( 
        CMMD_any_node, 0, &new_local_data, sizeof(local_data), 
        perm(MYPROC), 0, &local_data, sizeof(local_data) )
This will work, but as we said, it requires all processor to synchronize, possibly wasting time if they are at different places in the computation.

Now let us try a solution using asynchronous or nonblocking send. Our first attempt is

   CMMD_send_asynch( perm(MYPROC), 0, &local_data, sizeof(local_data),
                     handler, handler_argument) 
   CMMD_receive_block( CMMD_any_node, 0, &local_data, sizeof(local_data) )
This permits every processor to send as soon as it can, and proceed as soon as its own data arrives (alternatively, we could do CMMD_receive_asynch if we did not need to use local_data immediately). But the trouble with this is that local_data could be overwritten before it is sent out, since CMMD_send_asynch assumes local_data will be available whenever it needs it. There are two ways to fix this. The first is to explicitly have a second copy of the local_data:
   CMMD_send_asynch( perm(MYPROC), 0, &local_data, sizeof(local_data),
                     handler, handler_argument) 
   CMMD_receive_block( CMMD_any_node, 0, &new_local_data, sizeof(local_data) )
and the second is to have an implicit copy, which is needed only if local_data cannot immediately be pushed into the communication network:
   CMMD_send_noblock( perm(MYPROC), 0, &local_data, sizeof(local_data) )
   CMMD_receive_noblock( CMMD_any_node, 0, &new_local_data, sizeof(local_data) )

There are some other important synchronous communication primitives in CMMD. One node may call

    CMMD_bc_to_nodes(&data,sizeof(data)), 
and all the other nodes may call
    CMMD_receive_bc_from_node(&data,sizeof(data)) 
in order for the first node to broadcast data to all the others. All nodes may call
    CMMD_reduce_add (or CMMD_reduce_max or ...) 
in order to compute the global sum (or max or ...) of the arguments, and send this value to all the processors. All the nodes may call
    CMMD_scan_add (or CMMD_scan_max or ...) 
in order to compute the scan or parallel prefix of the indicated operations. These routines are the CMMD analogues of CMF_SCAN_ADD, etc., in CM Fortran. As we described in the solution of Sharks & Fish 2,
    CMMD_concat_with_nodes(&local_data, &all_data, sizeof(local_data) 
makes a copy of every processor's local_data in all_data on each processor.
    CMMD_reset_partition_size(q) 
may be used to set the number of processors equal to any q which is no larger than the actual number of physical processors. This may be used to help compute speedups for varying numbers of processors.

By this time, you may have the (accurate) impression that there are many different kinds of message passing routines, and may be wondering whether there is a common denominator to which they all may be reduced. There are two approaches to this problem. The first is a cooperative effort by the parallel programming community to write a standard for message passing libraries, called the Message Passing Interface or MPI (See Volume 5 of the class reference material, or click here). This long document standardizes many variations on message passing, and is being implemented by various companies and research groups. A second approach is being taken at Berkeley. Active Messages form a yet lower communication layer in which all the others may be implemented. Being lower level, it permits more efficient communication, as well as more general communication patterns (which we will see in our discussion of Split-C), but still provides a machine independent layer to help guarantee portability.