CS267 Assignment 1: Optimize Matrix Multiplication
Due Date: Tuesday February 14, 2012 at 11:59PM
Your task is to optimize matrix multiplication (matmul) code to run fast on a single processor core of
NERSC's Franklin cluster.
We consider a special case of matmul:
C := C + A*B
where A, B, and C are n x n matrices.
This can be performed using 2n^{3} floating point operations (n^{3} adds, n^{3} multiplies), as in the following pseudocode:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end
end
end
 You will be paired up in teams of two. Let your instructors know if you still don't have a teammate after Feb 2.
 Your submission should be a gzipped tar archive, formatted (for Team 4) like:
team4_hw1.tgz. It should contain:
This link tells you how to use tar to make a .tgz file. Email your .tgz file to the GSIs.
 Your writeup should contain:
 the names of the people in your group,
 the optimizations used or attempted,
 the results of those optimizations,
 the reason for any odd behavior (e.g., dips) in performance, and
 how the performance changed when running your optimized code on a different machine.
For the last requirement, you may run your implementation on another NERSC machine, on your laptop/cellphone, on the cloud, etc.
 Please carefully read the notes for implementation details. Stay tuned to the
CS267 Piazza page
(sign up first) for updates and clarifications, as well as discussion.
 If you are new to optimizing numerical codes, we recommend reading the papers in the references section.
These parts are not graded. You should be satisfied with your square_dgemm results and writeup before beginning an optional part.
 Implement Strassen matmul. Consider switching over to the threenestedloops algorithm when the recursive subproblems are small enough.
 Support the dgemm interface (ie, rectangular matrices, transposing, scalar multiples).
 Try float (singleprecision). This means you can use 4way SIMD parallelism on Franklin.
 Try complex numbers (single and doubleprecision)  note that complex numbers are part of C99 and supported in gcc. This forum thread gives advice on vectorizing complex multiplication with the conventional approach  but note that there are other algorithms for this operation.
 Optimize your matmul for the case when the inputs are symmetric. Consider conventional and packed symmetric storage.
We provide two simple implementations for you to start with:
a naive threeloop implementation similar to the pseudocode above,
and a more cacheefficient blocked implementation.
The necessary files are in cs267_hw1.tgz. Included are the following:

 dgemmnaive.c
 A naive implementation of matrix multiply using three nested loops,
 dgemmblocked.c
 A simple blocked implementation of matrix multiply,
 dgemmblas.c
 A wrapper for the vendor's optimized BLAS implementation of matrix multiply (default: Cray LibSci),
 benchmark.c
 The driver program that measures the runtime and verifies the correctness by comparing with the vendor's implementation,
 Makefile
 A simple makefile to build the executables,
 jobblas, jobblocked, jobnaive
 Scripts to run the executables on Franklin compute nodes. For example, type "qsub jobblas" to benchmark the BLAS version.

The documentation for Franklin's programming environment can be found below.
 Goto, K., and van de Geijn, R. A. 2008. Anatomy of HighPerformance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
(Note: explains the design decisions for the GotoBLAS dgemm implementation, which also apply to your code.)
 Chellappa, S., Franchetti, F., and Püschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196–259.
(Note: how to write C code for modern compilers and memory hierarchies, so that it runs fast. Recommended reading, especially for newcomers to code optimization.)
 Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
(Note: PHiPAC is a codegenerating autotuner for matmul that started as a submission for this HW in a previous semester of CS267. Also see ATLAS; both are good examples if you are considering code generation strategies.)
 Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63–74.
(Note: clearly explains cache blocking, supported by with performance models.)
 Notes on vectorizing with SSE intrinsics, from lecture 2/9/12, here
You are also welcome to learn from the source code of stateofart BLAS implementations
such as GotoBLAS
and ATLAS.
However, you should not reuse those codes in your submission.
Below are results recorded on Franklin using the provided benchmark. Performance was reproducible to within 5%, so if you feel your performance is misrepresented, please rerun your submitted code to make sure, and then contact the GSIs (cs267.sp12@gmail.com) with this data.
Note that
 Team 1 = naive blocked code (b=41) compiled with GNU, "O1" optimization
 Team 0 = Cray LibSci DGEMM
 Team 17 = naive blocked code (b=41) compiled with GNU, "O3 ffastmath funrollloops march=amdfam10" optimization
[ Back to CS267 Resource Page ]