Performance Model

To assess the performance of this implementation, the code in the heart of the PCG algorithm (the iteration loop), was instrumented to provide timings of the various types of activity in the PCG algorithm. Timings for the dot product communication, the matrix-vector multiply communication, and the matrix-vector multiply computation, as well as the total time were acquired (see Fig. 1). The overall performance was sensitive to the location of synchronization points - and actually performed better with blocking sends and receives. This was probably due to the very regular nature of the algorithm, congestion in the communication network was avoided by having all processors stay loosely in step with each other. As there is not much opportunity for overlapping communication and computation in the PCG algorithm, little performance was lost in this synchronization.

The timing for the dot product were used to verify the published specification for the latency on the SP1. The mp_combine requires two way communication, between all nodes, of a small message (eight bytes), this should thus give an upper bound on the latency time for two messages. This measured latency time (with two nodes 70-80 sec) did conform well with the published time of 60 sec latency, and this grew slowly when using more processors.

ADAMS
Thu May 18 11:22:16 PDT 1995