Daniel M. Pressel Computer Scientist U.S. Army Research Laboratory =============================================================================== Combining RISC-based SMP's and Loop-Level Parallelism Sets the Standard Daniel M. Pressel U.S. Army Research Laboratory, APG, MD The two principal metrics for success when it comes to parallel processing are: 1) The solution avoids a loss of performance. 2) The solution produces a significant increase in performance. The first category frequently involves the argument that the researcher couldn't afford a really large memory computer, and therefore had only two choices: One can either create an out-of-core solver, or one can write a parallel program. Given those options, it should not be surprising that the parallel program is considered to be a success. However, the second metric is the one that is more commonly discussed, and is what this work centers on. When parallelizing many scientific programs, it is commonly observed that the best serial/vector algorithms do not lend themselves to parallel programming. As a result, the researcher traditionally has been left with two choices: 1) Use a more easily parallelized algorithm, regardless of its efficiency. 2) Modify the existing algorithm, even if its performance is degraded (e.g., Domain Decomposition can severely degrade the convergence properties). The net effect of this is that even when a program appears to be demonstrating good performance in terms of MFLOPS/second, its actual performance in terms of time to completion might be 1/2 to 1/10th (or less) of what is expected. In an effort to avoid this loss of performance, we have used two concepts: 1) Traditional approaches to parallel programming were based on using large numbers of processors. With today's RISC processors, less than 100 processors will frequently suffice. Therefore, it is now possible to revisit those highly efficient serial/vector algorithms that had previously been discarded. 2) Loop-level parallelism using well-designed RISC-based SMP's can allow one to efficiently parallelize vectorizable code (after significant implementation-level tuning). This can allow one to use 10-100 processors without any degradation to the algorithmic efficiency. Traditionally, there have been two main objections to these concepts. The first is that using less than 100 processors does not provide enough performance. However, given the performance of today's processors, this is frequently no longer the case. The second objection to this approach is that the parallel speedups from loop-level parallelism is too low. We feel that this second conclusion is based on a number of incorrect assumptions and does not agree with our experimental results. For an Implicit CFD code (F3D), our results on an Origin 2000 have produced speedups, over the performance of a vector optimized version of the code running on one processor of a Cray C90, of as much as a factor of 27 (for a 59 million grid point data set). Relative to the performance of the code on one processor of an Origin 2000, there was a speedup of a factor of 70 when using 118 processors. Using Domain Decomposition with this algorithm would either have required significant modifications to the algorithm or would have resulted in a degradation to the convergence properties that would be roughly proportional to the cube root of the number of processors being used. This would imply that in order to obtain a comparable level of performance when using Domain Decomposition, one would need to use over 500 processors!