CS267 Assignment 1: Optimize Matrix Multiplication
Due Date: Tuesday February 14, 2012 at 11:59PM
Your task is to optimize matrix multiplication (matmul) code to run fast on a single processor core of
NERSC's Franklin cluster.
We consider a special case of matmul:
C := C + A*B
where A, B, and C are n x n matrices.
This can be performed using 2n3 floating point operations (n3 adds, n3 multiplies), as in the following pseudocode:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
- You will be paired up in teams of two. Let your instructors know if you still don't have a teammate after Feb 2.
- Your submission should be a gzipped tar archive, formatted (for Team 4) like:
team4_hw1.tgz. It should contain:
This link tells you how to use tar to make a .tgz file. Email your .tgz file to the GSIs.
- Your write-up should contain:
For the last requirement, you may run your implementation on another NERSC machine, on your laptop/cellphone, on the cloud, etc.
- the names of the people in your group,
- the optimizations used or attempted,
- the results of those optimizations,
- the reason for any odd behavior (e.g., dips) in performance, and
- how the performance changed when running your optimized code on a different machine.
- Please carefully read the notes for implementation details. Stay tuned to the
CS267 Piazza page
(sign up first) for updates and clarifications, as well as discussion.
- If you are new to optimizing numerical codes, we recommend reading the papers in the references section.
These parts are not graded. You should be satisfied with your square_dgemm results and write-up before beginning an optional part.
- Implement Strassen matmul. Consider switching over to the three-nested-loops algorithm when the recursive subproblems are small enough.
- Support the dgemm interface (ie, rectangular matrices, transposing, scalar multiples).
- Try float (single-precision). This means you can use 4-way SIMD parallelism on Franklin.
- Try complex numbers (single- and double-precision) - note that complex numbers are part of C99 and supported in gcc. This forum thread gives advice on vectorizing complex multiplication with the conventional approach - but note that there are other algorithms for this operation.
- Optimize your matmul for the case when the inputs are symmetric. Consider conventional and packed symmetric storage.
We provide two simple implementations for you to start with:
a naive three-loop implementation similar to the pseudocode above,
and a more cache-efficient blocked implementation.
The necessary files are in cs267_hw1.tgz. Included are the following:
The documentation for Franklin's programming environment can be found below.
- A naive implementation of matrix multiply using three nested loops,
- A simple blocked implementation of matrix multiply,
- A wrapper for the vendor's optimized BLAS implementation of matrix multiply (default: Cray LibSci),
- The driver program that measures the runtime and verifies the correctness by comparing with the vendor's implementation,
- A simple makefile to build the executables,
- job-blas, job-blocked, job-naive
- Scripts to run the executables on Franklin compute nodes. For example, type "qsub job-blas" to benchmark the BLAS version.
You are also welcome to learn from the source code of state-of-art BLAS implementations
such as GotoBLAS
However, you should not reuse those codes in your submission.
- Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
(Note: explains the design decisions for the GotoBLAS dgemm implementation, which also apply to your code.)
- Chellappa, S., Franchetti, F., and Püschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196–259.
(Note: how to write C code for modern compilers and memory hierarchies, so that it runs fast. Recommended reading, especially for newcomers to code optimization.)
- Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
(Note: PHiPAC is a code-generating autotuner for matmul that started as a submission for this HW in a previous semester of CS267. Also see ATLAS; both are good examples if you are considering code generation strategies.)
- Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63–74.
(Note: clearly explains cache blocking, supported by with performance models.)
- Notes on vectorizing with SSE intrinsics, from lecture 2/9/12, here
Below are results recorded on Franklin using the provided benchmark. Performance was reproducible to within 5%, so if you feel your performance is misrepresented, please re-run your submitted code to make sure, and then contact the GSIs (firstname.lastname@example.org) with this data.
- Team -1 = naive blocked code (b=41) compiled with GNU, "-O1" optimization
- Team 0 = Cray LibSci DGEMM
- Team 17 = naive blocked code (b=41) compiled with GNU, "-O3 -ffast-math -funroll-loops -march=amdfam10" optimization
[ Back to CS267 Resource Page ]