The parallel and/or high performance machines available to us this semester include:

The languages we will use include:

We will discuss algorithms for linear algebra problems, such as solving linear systems of equations and finding eigenvalues. These arise in applications from many fields (including structural mechanics, circuit simulation, computational chemistry, etc.). We will cover methods for both dense and sparse matrices, and direct and iterative methods (for example, Gaussian elimination and conjugate gradients, respectively). Many of these algorithms have been implemented in the libraries discussed above.

These linear algebra problems often arise from solving differential equations, which arise in many areas. We will discuss the heat equation, Poisson's equation, and equations involved in climate modeling in particular.

We will also discuss variety of more complicated applications, which require complicated distributed data structures, or combining various of the earlier techniques we discussed. The list of applications includes circuit simulation, solving systems of polynomial equations algebraically, cell simulation, computational genetics, game search, and electromagnetic field simulation. The choice will depend on the interest of the class.

Combinatorial problems such as sorting, computing connected components of a graph, and the Traveling Salesman Problem will also be discussed.

Scientific and engineering problems requiring the most computing power to simulate are commonly called "Grand Challenges" Click on Grand Challenges or largest problems, like predicting the climate 50 years hence, are estimated to require computers computing at the rate of 1 Tflop = 1 Teraflop = 10^12 floating point operations per second, and with a memory size of 1 TB = 1 Terabyte = 10^12 bytes. Here is some commonly used notation we will use to describe problem sizes:

1 Mflop = 1 Megaflop = 10^6 floating point operations per second 1 Gflop = 1 Gigaflop = 10^9 floating point operations per second 1 Tflop = 1 Teraflop = 10^12 floating point operations per second 1 MB = 1 Megabyte = 10^6 bytes 1 GB = 1 Gigabyte = 10^9 bytes 1 TB = 1 Terabyte = 10^12 bytes 1 PB = 1 Petabyte = 10^15 bytesLet us illustrate how these problems could require such large computing resources. We will consider climate modeling in particular, using a vastly oversimplied model of a system being developed under NASA support at UCLA and Berkeley. What is climate? Mostly simply, it is a function of 4 arguments

Climate (longitude,latitude,elevation,time)which returns a vector of 6 values:

temperature, pressure, humidity, and wind velocity(3 words).(More generally, we could ask for the concentration of each chemical species, such as ozone, in addition to the 6 values above.) To represent this continuous function in the computer we need to

An algorithm to predict the weather (short term) or climate (long term, and only statistically significant), is a function which maps the climate at time t, Climate(i,j,k,n), for all i,j,k, to the climate at the next time step t+dt, Climate(i,j,k,n+1), for all i,j,k. The algorithm involves solving a system of equations including, in particular the Navier-Stokes equations for fluid flow of the gasses in the atmosphere.

Suppose we discretize the earth's surface into 1 kilometer-by-1 kilometer cells in the latitude-longitude direction, and with 10 cells in the vertical direction. From the surface area of the earth, we can compute that there are about 5e9 ( = 5 x 10^9) cells in the atmosphere. With six 4-byte words per cell, that comes to about 10^11 = .1 TB (to the nearest order of magnitude).

Suppose we take 100 flops (floating point operations) to update each cell by one minute. In other words, dt = 1 minute, and computing Climate(i,j,k,n+1) for all i,j,k from Climate(i,j,k,n) takes about 100*5e9 = 5e11 floating point operations. It clearly makes no sense to take longer than one minute to predict the weather one minute from now; otherwise it is cheaper to look out the window. Thus, we must compute at least at the rate of

5e11 flops / 60 secs ~ 8 Gflops.Weather prediction (computing 24 hours to compute the weather 7 days hence) requires computing 7 times faster, or 56 Gflops. Climate prediction (computing 30 days, a long run, to compute the climate 50 years hence), requires computing 50*12=600 times faster, or 4.8 Tflops.

The actual grid resolution used in climate codes today is 4 degrees of latitude by 5 degrees of longitude, or about 450 km by 560 km, a rather coarse resolution. A near term goal is to improve this resolution to 2 degrees by 2.5 degrees, which is four times as much data.

The size of the overall database is enormous. NASA is putting up weather satellites expected to collect 1TB/day for a period of years, totaling as much as 6 PB of data, which no existing system is large enough to store. The Sequoia 2000 Global Change Research Project is concerned with building this database.

Now consider the 1TB memory. Memory is conventionally built as a planar grid of bits, in our case say a 10^6 by 10^6 grid of words. If this grid is .3mm by .3mm, then one word occupies about 3 Angstroms by 3 Angstroms, or the size of a small atom. It is hard to imagine where the wires would go!

But, if we try to solve a much tinier 100-by-100 linear system, the fastest the system will go is 10 Mflops rather than 281 Gflops, and that is attained using just one of the 6768 processors, an expensive waste of silicon. Staying with a single processor, but going up to a 1000x1000 matrix, the speed goes up to 36 Mflops.

Where do the flops go? Why does the speed depend so much on the problem size?
The answer lies in understanding the *memory hierarchy*.
All computers, even cheap ones, look something like this:

Useful work, such as floating point operations, can only be done on data at the top of the hierarchy. So to work on data stored lower in the hierarchy, it must first be transferred to the registers, perhaps displacing other data already there. Transferring data among levels is slow, much slower than the rate at which we can do useful work on data in the registers. In fact, this data transfer is the bottleneck in most computations: more time is spent moving data in the hierarchy than doing useful work.

Good algorithm design requires keeping active data near the top of the hierarchy as long as possible, and minimizing movement between levels. For many problems, like Gaussian elimination, only if the problem is large enough is there enough work to do at the top of the hierarchy to mask the time spent transferring data among lower levels. The more processors one has, the larger the problem has to be to mask this transfer time. We will study this example in detail later.

It is often remarked that speeds of basic microprocessors grow by a factor of 2
every 18 months; this empirical observation, true over many years, is called
*Moore's Law*. In case you think that parallel programming is too hard,
and that you would rather wait until Moore's Law makes the personal computer
on your desk fast enough for your problem, think again: The reason Moore's Law
is true, is that microprocessor manufacturers are adopting many of the tricks of
parallel computing and accounting for memory hierarchies that we will discuss.
So getting the peak speed from your microprocessor is becoming more difficult.
There is no way around the issues we will discuss.