### U.C. Berkeley Department of Electrical Engineering and Computer Science CS267 Sp96 Assignment 1: Due Thursday, February 1, 5pm

For this first assignment, we will have a little programming contest. Although the assignment isn't about ``parallelism'' per se, it involves some of the more crucial issues in high performance algorithm design, namely dealing with the memory hierarachy.

We've all coded matrix multiply at least once in our lives (and soon at least twice). Here is the simplest algorithm to compute C = C + A*B, where all matrices are n-by-n:

```      for i=1 to n
for j=1 to n
for k=1 to n
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end for
end for
end for
```
Unfortunately, this algorithm is oblivious to the computer's memory hierarchy, and optimizing compilers provide limited solace. Consider the IBM RS6000/590 which has a peak speed of 266 Mflops. Running the above code in Fortran, compiled using the f77 compiler without optimization, yields the rather lower speed of 4.5 Mflops when multiplying 64-by-64 matrices. Using f77 -O2 (optimization level 2) raises the speed to 65 Mflops, and f77 -O3 -qhot further raises the speed to 92 Mflops, still a far cry from IBM's own proprietary ESSL library routine, which runs at about 240 Mflops (ESSL is written entirely in Fortran).

If you didn't want to pay for ESSL, you might look around on the WWW in the Netlib Repository of public-domain software, and discover a public-domain matrix multiply optimized for the RS6000 (dmr). Using f77 -O2 on this routine yields 152 Mflops, and using f77 -O3 -qhot yields 186 Mflops, still a ways from ESSL's 240 Mflops. Not only that, since the careful programmer always tests to see if the answer is correct, he or she will soon discover that the answer using dmr with f77 -O3 -qhot is completely wrong for n>64 matrices (well, the compiler did issue a warning message about possibly changing the semantics of the program).

For this assignment, we will be coding an optimized memory-hierarchy-cognizant matrix-matrix multiply routine in C. For simplicity, we will only require square matrices (worrying about the non-square case, essential for a good library code, can be a bit time-consuming). The goal of the assignment is to:

1. write a matrix routine C = C + A*B for square matrices,
2. get as close to peak performance as possible, while still getting the right answer, and
3. get to know each other.

For extra credit, you may try to optimize your routine for another architecture different form IBM RS6000 or implement one more BLAS3 routine.

We will be using an IBM RS6000/590, accounts to be made available soon. Note that the IBM compiler supports a wide variety of compiler optimizations which you should read about (see http://www.austin.ibm.com/tech/options.html, or type 'xlc' once you are logged on to the RS6000).

Here is some basic RS6000 architectural information:

```System       CPU       ClkMHz Cache        SPEC  SPEC  Info  Source
Name         Type      ext/in Ext+I/D      92Int 92 FP Date  Obtained
------------ --------- ------ ------------ ----- ----- ----- ------------------
IBM 590      POWER2    66.6   32/256       121.6 259.7 Jul94 comp.benchmarks
```
More architectural information can be found at http://www.austin.ibm.com

To judge the relative qualities of your routines, we will have a contest with a non-negligible 1st prize and a ..., well, let's just say a 2nd prize. We will work in mixed teams of three people. Each team will create a single routine and submit it to the matrix appropriations committee (MAC) consisting of the Professor and TA. We'll all gather on Tuesday following the due date during the 6:00PM time-slot with each team making a short 2-minute presentation about their routine, and why it should run fast. Then we will time the routine on an dedicated RS6000, as well as make sure you get the right answer. The test program and method for measuring performance will be posted later. The winning team will then be applauded gratuitously, and presented with the non-negligible prize.

You are welcome to discuss your optimization strategies with other teams. You should, however, keep in mind that this is a competition -- any hints might distort the contest results.

Here is the routine interface we will use.

```/*
**  mul_mfmf_mf:
**    Optimized matrix multiply routine.
**
**    Parameters:
**      int matdim: the matrix dimension
**      double * A: source1 matrix of size (matdim)x(matdim)
**      double * B: source2 matrix of size (matdim)x(matdim)
**      double * C: destination matrix of size (matdim)x(matdim)
**
**    Operation:
**      C = C + A*B;
**
*/
void
mul_mfmf_mf(
int matdim,
const double *A,
const double *B,
double *C);
```
Assume an ANSI compliant C compiler and that the arrays are stored in row-major order. This means that the following loop will access memory locations in increasing consecutive order:
```  for (i=0;i < matdim;i++)
for (j=0;j < matdim;j++)
access(A[i*matdim+j]);
```

This document is available at http://www.cs.berkeley.edu/~demmel/cs267/assignment1.html. For reference, The BLAS directory on Netlib is located at http://www.netlib.org/blas/index.html. A paper describing the interface in that directory is located at http://www.netlib.org/blas/blasqr.ps. And a BLAS matrix multiply in fortran can be found at http://www.netlib.org/blas/dgemm.f.

A good set of papers on the general issues of blocking algorithms for memory hierarchies is here. Look at the README file for a list of titles of each paper. tile.ps is also interesting.