Measuring Performance of Parallel Programs
Sharks and Fish - A Collection of Parallel Programming Problems
(CS 267, Jan 26 1995)
Review of the last 2 lectures
We studied how to write a fast matrix-multiplication algorithm in some detail
on the RS6000/590.
We did this in order to illustrate several more general points:
How to measure the performance of a parallel program
The following is a list of ways we will measure performance of a parallel
algorithm (see also section 4.1 of text, or click
here for a gentler introduction.
Speedup and efficiency are the most common ways to report the performance
of a parallel algorithm
It is a ``pseudo theorem'' that Speedup(p) <= p, or
Efficiency(p) <= 1, because in principle one processor can simulate
the actions of p processors in at most p times as long, by taking
one step of each of the p serial programs making up the parallel
program, in a round robin fashion. This idea is due to Brent.
Actually, we can sometimes get "superlinear speedup", i.e.
Speedup(p)>p, if either
Our pseudo-theorem implies that if we plot the straight line at 45 degrees
through the
origin on a speedup plot, this will bound the true speedup from above,
and the quality of the parallel implementation can be measured by how
close our algorithm gets to this ``perfect speedup curve''.
Perfect speedup is seldom attainable. Instead, we often
try to design "scalable algorithms", where Efficiency(p) is bounded away
from 0 as p grows. This means that we are guaranteed to get some benefit
proportional to the size of the machine, as we use more and more processors.
A common variation on these performance measures is as follows:
Since people often buy
a larger machine to run a larger problem, not just the same problem
faster, we let problem size n(p) grow with p, to measure the
``scaled speedup''
T(1,n(p))/T(p,n(p)) and ``scaled efficiency'' T(1,n(p))/(T(p,n(p))*p).
Amdahl's law gives a simple bound on how much speedup we can expect:
suppose a problem spends fraction f<1 of its time doing
work than can be parallelized, and fraction s=1-f doing serial work,
which cannot be parallelized. Then T(p) = T(1)*f/p + T(1)*s, and
Speedup(p) = T(1)/T(p) = 1/(f/p+s) <= 1/s, no matter how big p is.
Efficiency(p) = 1/(f+s*p) goes to 0 as p increases.
In other words speedup is bounded, and in fact increasing p past s/f can't
increase speed by more than a factor of two.
Amdahl's law teaches us this lesson:
We need to make sure there are no serial bottleneck (the s part)
in codes if we hope to have a scalable algorithm.
For example, even if only s=1% of a program is serial, the speedup is
limited to 100, and so it is not worth using a machine with more than
100 processors. As we will see,
this means we need to measure performance carefully (profiling) to
identify serial bottlenecks. There will be a handout on profiling,
since there are a number of ways to do measurements inaccurately.
Sharks and Fish - a collection of parallel programming problems
This is a collection of 6 successively more difficult parallel programming
problems, designed to illustrate many parallel programming problems
and techniques. It is basically a simulation of "particles" moving around
and interacting subject to certain rules, which are not entirely
physical but instructive and amusing.
Some of the problems are discrete
(the sharks and fish can only occupy a discrete set of positions), and
some are continuous (the sharks and fish can be anywhere).
We have working implementations of many of these problem in 5
programming languages, 4 of which are parallel, to illustrate
different ways the same algorithm is expressed in different parallel
programming models. You will have programming assignments involving modifying
some of these implementations. The 5 languages are
Now we discuss the rules followed by the sharks and fish in more detail.
Here are the rules in brief, with more details available
here.
The following software for these problems exists on rodin under
/usr/castle/share/proj/shortcourse/wator. It can also be seen by
clicking on the x's below. In each case you will see a directory
of source code for each problem. In the case of the Matlab code,
the routine fishXinit.m is an initialization program, to be run
first to set up the problem (where X varies from 1 to 5),
and fishX.m is the main Matlab solution routine.
Language Problem number
1 2 3 4 5 6
Matlab x x x x x
CMMD x x x
CMF x x x x
Split-C x x x x
pSather x