Parallel Programming with Split-C

(CS 267, Feb 7 1995)

Split-C was designed at Berkeley, and is intended for distributed memory multiprocessors. It is a small SPMD extension to C, and meant to support programming in data parallel, message passing and shared memory styles. Like C, it is "close" to the machine, so understanding performance is relatively easy. The best document from which to learn Split-C is the tutorial Introduction to Split-C. There is a debugger available as well: Mantis.

We begin with a general discussion of Split-C features, and then discuss the solution to Sharks & Fish problem 1 in detail. The most important features of Split-C are

  • An SPMD programming style. There is one program text executed by all processors.
  • A 2-dimensional address space for the entire machine's memory. Every processor can access every memory location via addresses of the form (processor number, local address). Thus, we may view the machine memory as a 2D array with one row per processor, and one column per local memory location.
  • Global pointers. These pointers are global addresses of the form just described, and can be used much as regular C pointers are used. For example, the assignment
    *local_pointer = *global_pointer
    
    gets the data pointed to by the global_pointer, where ever it resides, and stores the value at the location indicated by local_pointer.
  • Spread Arrays. These are 2 (or more dimensional arrays) which are stored across processor memories. For example, A[i][j] may refer to word j on processor i. These last two features support a kind of shared memory programming style.
  • Split phase assignment. In the above example, "*local_pointer = *global_pointer", execution of this statement must complete before continuing. If this requires interprocess communication, the processor remains idle. It is possible to overlap computation and communication by beginning this operation, doing other useful work, and waiting later. This is done as follows:
          *local_pointer := *global pointer
          ... other work not requiring *local_pointer ...
          synch()
    
    The "split-phase" assignment operator := initiated the communication, and synch() waits until it is complete. f structure by a processor, and prevents certain bugs called race conditions which we will discuss below.
  • An extensive library, including reduction operations, bulk memory moves, etc.
  • Pointers to Global Data

    There are actually three kinds of pointers in Split-C: local pointers, global pointers, and spread pointers. Local pointers are standard C pointers, and refer to data only on the local processor. The other pointers can point to any word in any memory, and consist of a pair (processor number, local pointer). Spread pointers are associated with spread arrays, and will be discussed below. Here are some simple examples to show how global pointers work. First, pointers are declared as follows:
        int *Pl, *Pl1, *Pl2           /*  local pointers  */
        int *global Pg, Pg1, Pg2      /*  global pointers  */
        int *spread Ps, Ps1, Ps2      /*  spread pointers  */
    
    The following assignment sends messages to fetch the data pointed to by Pg1 and Pg2, brings them back, and stores their sum locally:
        *Pl = *Pg1 + *Pg2
    
    Execution does not continue until the entire operation is complete. Note that the program on the processors owning the data pointed to by Pg1 and Pg2 does not have to cooperate in this communication in any explicit way; it happens "automatically", in effect interrupting the processors owning the remote data, and letting them continue. In particular, there is no notion of needing matched sends and receives as in a message passing programming style. Rather than calling this send or receive, the operation performed is called a get, to emphasize that the processor owning the data need not anticipate the request for data.

    The following assignment stores data from local memory into a remote location:

        *Pg = *Pl
    
    As before, the processor owning the remote data need not anticipate the arrival of the message containing the new value. This operation is called a put.

    Global pointers permit us to construct distributed data structures which span the whole machine. For example, the following is an example of a tree which spans processors. The nodes of this binary tree can reside on any processor, and traversing the tree in the usual fashion, following pointers to child nodes, works without change.

         typedef struct global_tree *global gt_ptr
         typedef struct global_tree{
             int value;
             gt_ptr left_child;
             gt_ptr right_child;
         } g_tree
    
    We will discuss how to design good distributed data struction later when we discuss the Multipol library.

    Global pointers offer us the ability to write more complicated and flexible programs, but also get new kinds of bugs. The following code illustrates a race condition, where the answer depends on which processor executes "faster". Initially, processor 3 owns the data pointed to by global pointer i, and its value is 0:

            Processor 1             Processor 2
            *i = *i + 1             *i = *i + 2
            barrier()               barrier()
            print 'i=', i
    
    It is possible to print out i=1, i=2 or i=3, depending on the order in which the 4 global accesses to i occur. For example, if
      processor 1 gets *i (=0)
      processor 2 gets *i (=0)
      processor 1 puts *i (=0+1=1)
      processor 2 puts *i (=0+2=2)
    
    then the processor 1 will print "i=2". We will discuss programming styles and techniques that attempt to avoid this kind of bug.

    A more interesting example of a potential race condition is in a job queue, a data structure for distributing chunks of work of unpredictable sizes to different processors. We will discuss this example below after we present more feature of Split-C.

    Global pointers may be incremented like local pointers: if Pg = (processor,offset), then Pg+1 = (processor,offset+1). This lets one index through a remote part of a data structure. Spread pointers differ from global pointers only in this respect: if Ps = (processor,offset), then

       Ps+1 = (processor+1 ,offset)  if processor < PROCS-1, or
            = (0 ,offset+1)          if processor = PROCS-1
    
    where PROCS is the number of processors. In other words, viewing the memory as a 2D array, with one row per processor and one column per local memory location, incrementing Pg moves the pointer across a row, and incrementing Ps moves the pointer down a column. Incrementing Ps past the end of a column moves Ps to the top of the next columns.

    The local part of a global or spread pointer may be extracted using the function to_local.

    Only local pointers may be used to point to procedures; neither global nor spread pointers may be used this way. There are also some mild restrictions on use of deferenced global and spread pointers; see the last section of the Split-C tutorial.

    Spread Arrays and Spread Pointers

    A spread array is declared to exist across all processor memories, and is referenced the same way by all processors. For example,
        static int A[PROCS]::[10]
    
    declares an array of 10 integers in each processor memory. The double colon is called the spreader, and indicated that subscripts to its left index across processors, and subscripts to the right index within processors. So for example A[i,j] is stored in location to_local(A)+j on processor i. In other words, the 10 words on each processor reside at the same local memory locations.

    The declaration

        static double A[n][m]::[b][b]
    
    declares a total of n*m*b^2 ints. You may think of this as n*m groups of b^2 doubles being allocated to the processors in round robin fashion. The memory per processor is 8*b^2*ceiling(n*m/PROCS), so if PROCS does not evenly divide n*m, some memory will be wasted. A[i][j][k][l] is stored in processor i*m+j mod PROCS, and at offset to_local(A) + 8*b^2*floor( (i*m+j)/PROCS ) + k*b+l In the figure below, we illustrate the layout of A[2][5]::[7][7]. Each wide light gray rectangle represents 49 double words. The two wide dark gray rectangles represent wasted space. The two thin medium gray rectangles are the very first word, A[0][0][0][0], and A[1][1][7][7], respectively.

    In addition to declaring static spread arrays, one can malloc them:

       int *spread w = all_spread_malloc(10, sizeof(int))
    
    This is a synchronous, or blocking, subroutine call (like the first kind of send and receive we discussed in Lecture 6)), so all processors must participate, and should do so at about the same time to avoid making processors wait idly, since all processors will wait until all processors have called it. The value returned in w is a pointer to the first word of the array on processor 0:
         w = (0, local address of first word on processor 0).
    

    (A nonblocking version, int *spread w = spread_malloc(10, sizeof(int)), executes on just one processor, but allocates the same space as before, on all processors. Some internal locking is needed to prevent allocating the same memory twice, or even deadlock. However, this only works on the CM-5 implementation and its use is discouraged.)

    Split Phase Assignment

    The split phases referred to are the initiation of a remote read (or write), and blocking until its completion. This is indicated by the assignment operator ":=". The statement
         a := *global_pointer
         ... other work not involving a ...
         synch()
         b = b + a
    
    where a and b are local variables, initiates a get of the data pointed to by global_pointer, does other useful work, and only waits for a's arrival, by calling synch(), when a is really needed. This is also called prefetching. The statement
         *global_pointer := b
    
    similarly launches a put of the local data b into the remote location global_pointer, and immediately continues computing. One call also wait until an acknowledgement is received from the processor receiving b, by calling synch().

    Being able to initiate a remote read (or get) and remote write (or put), go on to do other useful work while the network is busy delivering the message and returning any response, and only waiting for completion when necessary, offers several speedup opportunities.

  • It allows one to compute and communicate in parallel; this is illustrated by the above example. This allows one to hide the latency of the communication network by prefetching.
  • Split-phase assignment lets one do many communications in parallel, if this is supported by the network (it often is). For example,
       /* lxn and sum are local variables; Gxn is a global pointer */
       lx1 := *Gx1             
       lx2 := *Gx2
       lx3 := *Gx3
       lx4 := *Gx4
       synch()
       sum = lx1 + lx2 + lx3 + lx4
    
    can have up to 4 gets running in parallel in the network, and hides the latency of all but the last one.
  • By avoiding the need to have processors synchronize on a send and receive, idle time spent waiting for another processor to send or receive data can be avoiding by simply getting the data when it is needed.
  • The total number of message in the system is decreased compared to using send and receive. A synchronous send and receive actuallys requires 4 messages to be sent (see the figure). In contrast, a put requires one message with the data, and one acknowledgement, and a get similarly requires just two message instead of 4. For small message, this is half as much memory traffic. This is illustrated in the figure. Here, time is the vertical axis in each picture, and the types of arrows indicate what the processor is doing during that time.

    Instead of synching on all outstanding puts and gets, it is possible to synch just on a selected subset of puts and gets, by associating a counter just with those puts and gets of interest. The counter is automatically incremented whenever a designated put or get is initiated, and automatically decremented when an acknowledgement is received, so one can test if all have been acknowledged by comparing the counter to zero. See section 10.5 of Introduction to Split-C for details.

    The freedom afforded by split-phase assignment also offers the freedom for new kinds of bugs. The following examples illustrates a loss of sequential memory consistency. Sequential consistency means that the outcome of the parallel program is consistent with some interleaved sequential execution of the PROCS different sequential programs. For example, if there are two processors, where processor 1 executes instructions instr1.1, instr1.2, instr1.3, ... in that order, and processor 2 similarly executes instr2.1, instr2.2, instr2.3 ... in order, then the parallel program must be equivalent to executing both sets of instructions in some interleaved order such that instri.j is executed before instri.(j+1). The following are examples of consistent and inconsistent orderings:

        Consistent      Inconsistent
         instr1.1         instr1.1
         instr2.1         instr2.2   *out of order
         instr1.2         instr1.2
         instr2.2         instr2.1   *out of order
         instr1.3         instr1.3
         instr2.3         instr2.3
         ...              ...
    
    Sequential consistency, or having the machine execute your instructions in the order you intended, is obviously an important tool if you want to predict what your program will do by looking at it. Sequential consistency can be lost, and bugs introduced, when the program mistakenly assumes that the network delivers messages in the order in which they were sent, when in fact the network (like the post office) does not guarantee this.

    For example, consider the following program, where x and y are global pointers to data owned by processor 2, both of which are initially zero:

            Processor 1              Processor 2
            *data := 1               while (*data_ready_flag != 1) {/* wait for data*/}
            *data_ready_flag := 1    print 'data=',*data
    
    From Processor 1's point of view, first *data is set to 1, then the *data_ready_flag is set. But Processor 2 may print either data=0 or data=1, depending on which message from Processor 1 is delivered first. If data=0 is printed, this is not sequentially consistent with the order in which Processor 1 has executed its instructions, and probably will result in a bug. Note that this bug is nondeterministic, i.e. may or may not occur on any particular run, because it is timing dependent. These are among the hardest bugs to find!