Performance

The performance was tested on a 21600 degree of freedom structure, with about 0.20% sparsity (434 average band width), and 1835 iterations. The speedup is reasonably good (Fig. 1 and Fig. 2), though the recorded times for the vector communication are a bit higher than the published specifications would suggest. For this problem, the vector communication time was about twice that which the published specifications would suggest, and other tests with larger vectors performed much worse.

The reason for this discrepancy, between the published and measured performance, is not clear. This problem could be caused by, the line topology, used by this algorithm, may be mapped to the hardware with shared or multi-step communication paths, or the communication buffers may not be large enough. The communication paths are very simple and the vectors are about 4,000 bytes, and thus it is not clear where the problem is.

Another performance problem, that may be related to the previous, is that the time spent outside the measured activities - does not decrease as it should (see Fig. 1 - 'other time'). The none measured time is comprised of, the local dot product computation, the loop test, local saxpys, as well as some synchronization time (i.e. waiting for other processors). Though the performance is not fully modeled the performance is reasonably good; this is due to the fact that the PCG algorithm is rather easy to parallelize, and this project has only addressed the main loop of the PCG algorithm.

ADAMS
Thu May 18 11:22:16 PDT 1995