Measuring Performance of Parallel Programs

Sharks and Fish - A Collection of Parallel Programming Problems

(CS 267, Jan 26 1995)

Review of the last 2 lectures

We studied how to write a fast matrix-multiplication algorithm in some detail on the RS6000/590. We did this in order to illustrate several more general points:

To get full speed out of the architecture, one must exploit parallelism, pipelining, and locality. These are ubiquitous issues at all levels of parallel computing, not just the very low level of the RS6000/590 CPU and cache. We will see them throughout the semester.

It is a challenge to juggle all three issues simultaneously and get top speed.

Since it is a challenge, one should try to use a higher level building block (such as the ESSL library) which has already dealt with these issues and hidden them from the user. We will discuss many other such building blocks, at different levels of abstraction, during the semester.

Using such a higher level building block effectively often requires reorganizing existing algorithms or designing new ones. We will illustrate this later with the example of Gaussian elimination, which is very natural to express in term of saxpys or matrix-vector products, but takes some effort to reorganize to use matrix-matrix multiplication.

Just as it is difficult to juggle parallelism, pipelining and locality, it is also a challenge to reorganize algorithms in the way just suggested. Therefore, we cannot yet expect compilers to do this automatically for us in all cases. This is an active research area, which we will discuss later in the semester. This also means we cannot yet hope to avoid knowing the grungy architectural details we discussed for the RS6000/590, if we hope to get absolutely top speed. Designing libraries, compilers, and other tools to make knowing such grungy details unnecessary is a major motivation for people building software and hardware systems for parallel computing.

How to measure the performance of a parallel program

The following is a list of ways we will measure performance of a parallel algorithm (see also section 4.1 of text, or click here for a gentler introduction.

T(p,n) = time to solve a problem of size n on p processors. Sometimes we omit n and write T(p).

T(1) = serial or sequential time, using the best serial algorithm, which is not necessarily the parallel algorithm with p set to 1.

Speedup(p) = T(1)/T(p) = how much faster you go on p processors than 1 processor. A "speedup plot" is Speedup(p) plotted versus p. If you use a poor serial algorithm, T(1) and so Speedup(p) will be artificially large, and your parallel algorithm will look artificially good. (See the paper "Misleading Performance Reporting in the Supercomputing Field" by Bailey in Volume 6 of the class reference material.

Efficiency(p) = Speedup(p)/p. An "efficiency plot" is Efficiency(p) plotted versus p.

Speedup and efficiency are the most common ways to report the performance of a parallel algorithm

It is a ``pseudo theorem'' that Speedup(p) <= p, or Efficiency(p) <= 1, because in principle one processor can simulate the actions of p processors in at most p times as long, by taking one step of each of the p serial programs making up the parallel program, in a round robin fashion. This idea is due to Brent. Actually, we can sometimes get "superlinear speedup", i.e. Speedup(p)>p, if either

the algorithm is nondeterministic, and a different path is taken in the parallel algorithm than the serial algorithm (this happens in search problems and symbolic computing)

the problem is too large for one processor without hitting a slower level in memory hierarchy (eg paging or ``thrashing'')

Our pseudo-theorem implies that if we plot the straight line at 45 degrees through the origin on a speedup plot, this will bound the true speedup from above, and the quality of the parallel implementation can be measured by how close our algorithm gets to this ``perfect speedup curve''.

Perfect speedup is seldom attainable. Instead, we often try to design "scalable algorithms", where Efficiency(p) is bounded away from 0 as p grows. This means that we are guaranteed to get some benefit proportional to the size of the machine, as we use more and more processors.

A common variation on these performance measures is as follows: Since people often buy a larger machine to run a larger problem, not just the same problem faster, we let problem size n(p) grow with p, to measure the ``scaled speedup'' T(1,n(p))/T(p,n(p)) and ``scaled efficiency'' T(1,n(p))/(T(p,n(p))*p).

Amdahl's law gives a simple bound on how much speedup we can expect: suppose a problem spends fraction f<1 of its time doing work than can be parallelized, and fraction s=1-f doing serial work, which cannot be parallelized. Then T(p) = T(1)*f/p + T(1)*s, and Speedup(p) = T(1)/T(p) = 1/(f/p+s) <= 1/s, no matter how big p is. Efficiency(p) = 1/(f+s*p) goes to 0 as p increases. In other words speedup is bounded, and in fact increasing p past s/f can't increase speed by more than a factor of two.

Amdahl's law teaches us this lesson: We need to make sure there are no serial bottleneck (the s part) in codes if we hope to have a scalable algorithm. For example, even if only s=1% of a program is serial, the speedup is limited to 100, and so it is not worth using a machine with more than 100 processors. As we will see, this means we need to measure performance carefully (profiling) to identify serial bottlenecks. There will be a handout on profiling, since there are a number of ways to do measurements inaccurately.

Sharks and Fish - a collection of parallel programming problems

This is a collection of 6 successively more difficult parallel programming problems, designed to illustrate many parallel programming problems and techniques. It is basically a simulation of "particles" moving around and interacting subject to certain rules, which are not entirely physical but instructive and amusing. Some of the problems are discrete (the sharks and fish can only occupy a discrete set of positions), and some are continuous (the sharks and fish can be anywhere). We have working implementations of many of these problem in 5 programming languages, 4 of which are parallel, to illustrate different ways the same algorithm is expressed in different parallel programming models. You will have programming assignments involving modifying some of these implementations. The 5 languages are

Matlab. This is a serial language, but one which lends itself to programming in terms of large operations (like matrix multiply) on rectangular arrays of numbers. In addition to Matlab's own on-line help facility, a mini-manual is located in Volume 4 of the class reference material, or here.

Connection Machine Fortran, also called CM Fortran or CMF. One can think of this CM-5-specific language as a parallel Matlab, and a close relative of languages now or soon to be available on other machines, such as Fortran 90 and High Performance Fortran, or HPF. A CMF manual is available in Volume 2 of the class reference material, or by typing cmview while logged in to rodin. An HPF manual is located in Volume 5 of the class reference material, or here.

Split-C is a locally designed and implemented language which augments C with just enough parallel constructs to expose the basic parallelism provided by the machine. Originally implemented for the CM-5, it has been ported to many machines, including all the platforms we will use this semester. A manual is located in Volume 4 of the class reference material, or here .

CMMD (not really an acronym) is a "message passing library" for the CM-5, i.e. a package of C or Fortran callable subroutines which allow one to explicitly send and receive messages from one processor to another. This is the lowest level at which one does parallel programming, and so the least pleasant, but the most common. Documentation is available in Volume 1 of the class reference material, or by typing cmview while logged in to rodin.

pSather is a parallel object oriented language designed and implemented locally at ICSI . A manual is located in Volume 4 of the class reference material, or here.

Now we discuss the rules followed by the sharks and fish in more detail.

Sharks and fish live in a 2D ocean, moving, breeding, eating and dying.

The ocean is square and periodic, so fish swimming out to the left reenter at the right, and so on.

The ocean may either be discrete, where sharks and fish are constrained to move from one grid point to a neighboring grid point, or the ocean may be continuous.

In all cases, the sharks and fish move according to a "force law" which may be written

   force on a shark (or fish) =   force_External 
                                     (a current, felt independently by each 
                                      shark or fish, including a random component)
                                + force_Nearest_Neighbors 
                                     (sharks are strongly attracted by nearby fish)
                                + force_"Gravity"  
                                     (sharks are attracted to fish, and 
                                      fish repelled by sharks)

These three kinds of forces are parallelized in different ways: The external force can be computed independently for each fish, and will be the easiest force to parallelize. Forces which only depend on the nearest neighbors, or very close neighbors, require relatively little cooperation between processors and are next easiest to parallelize. Forces which depend on all other fish, like gravity, require the cleverest algorithms to compute efficiently (in serial or parallel).

Fish and sharks breed if old enough.

A shark eats any fish it "collides" with, and dies if it does not eat for too long.

Here are the rules in brief, with more details available here.

Sharks and Fish 1. Fish alone move continuously subject to an external current and Newton's laws.

Sharks and Fish 2. Fish alone move continuously subject to gravitational attraction and Newton's laws.

Sharks and Fish 3. Fish alone play the "Game of Life" on a square grid.

Sharks and Fish 4. Fish alone move randomly on a square grid, with at most one fish per grid point.

Sharks and Fish 5. Sharks and Fish both move randomly on a square grid, with at most one fish or shark per grid point, including rules for fish attracting sharks, eating, breeding and dying.

Sharks and Fish 6. Like Sharks and Fish 5, but continuous, subject to Newton's laws.

The following software for these problems exists on rodin under /usr/castle/share/proj/shortcourse/wator. It can also be seen by clicking on the x's below. In each case you will see a directory of source code for each problem. In the case of the Matlab code, the routine fishXinit.m is an initialization program, to be run first to set up the problem (where X varies from 1 to 5), and fishX.m is the main Matlab solution routine.

	Language	Problem number
			1	2	3	4	5	6
	Matlab		x	x	x	x	x
	CMMD		x	x	x
	CMF		x	x	x	x
	Split-C		x	x	x	x
	pSather						x