This assignment explores the cache behavior of the NOW Ultrasparc 170 processors, both by benchmarking a simple microbenchmark like the one in Lecture 2, and by trying to make matrix-multiplication run as fast as possible.
You will be using the NOW cluster for this assignment, which runs the GLUnix operating system. I have constructed a NOW / GLUnix tutorial. The NOW machines reject normal telnet/rlogin sessions, so you will have to use either Kerberos or ssh to access the NOW. There is also a Kerberos / ssh tutorial.
Grab these files:
The memory benchmark program measures the time to read and write the elements of an array while it varies the length of the array and the stride through the array. This will reveal surprisingly many details of the memory hierarchy of the machine you run the benchmark on. Note that the reported time is for a read *and* write. Consequently, though the first-level cache access on an UltraSPARC is 6ns (one cycle at 167 MHz), the graph should report 12ns for the L1 cache latency.
To create the graph (membench.eps), simply type 'gmake' in a directory containing the three files. Make sure that the machine is idle when you do this so that the graph is free of noise. Examine the graph. Quantify and label the features that reveal the following:
This is not a required part of the assignment, but you may be curious to quantify the following as well:
Grab these files:
Part I of this assignment should familiarize you with the memory hierarchy of the NOW. The goal is to optimize matrix multiplication on these machines. As you can see from the graph from Part I of this assignment, the latencies to different levels of the memory hierarchy can differ enormously. This suggests that paying careful attention to where loads and stores land in the cache hierarchy can go a long way to increasing performance. The operation is C = C + A * B. For simplicity, A, B, and C are square matrices.
Typing 'gmake' should create an executable 'matmul'. This program evaluates the matrix multiply routine provided in 'matmul.c'. The two column output gives the MFLOPS rating for various matrix sizes. Here is the output for the naive code:
16 41.912558 32 39.622733 64 24.764380 128 24.231798 256 6.197091 23 38.170980 43 41.723790 61 32.241619 79 31.349301 99 31.068687 119 30.207716 151 28.967136
30-40 MFLOPS on the UltraSPARCs is not too impressive. The processors have 167 MHz clocks, and are capable of one FP add, one FP multiply, and one FP load/store per cycle. Thus, the peak performance of the machines is 334 MFLOPS. This is not achievable, of course, but matrix multiply code exists that gets around 300 MFLOPS out of the Ultras! Being aware of caches goes a long way. If you want to better, then there are many other considerations. If you write C code, you have to write it so that the compiler can discover opportunities for optimization. You may also want to think about various sources of pipeline stall cycles. Documentation on the Sun compiler and the UltraSPARC-I may prove useful. The CPU Info Center is a generally useful source. You may want to look a similar assignment that was given last year and the year before. The page from two years ago has some additional links that may prove helpful. The following commands tell you cool and useful stuff about the machine you are on (paths given for the NOW):
What you need to do is replace the naive matrix multiply routine in matmul.c with your own routine. The interface is:
void matmul (int i_matdim, const double* pd_A, const double* pd_B, double* pd_C);
You can write your routine in C or assembly (or both), in any number of source files. Simply set the variable 'SRCS' in the Makefile equal to the list of source files. Do not include the driver program in this list.
Though you must use the Sun C compiler, you may choose your compiler flags. A starting point is provided in the Makefile:
CFLAGS := -xO2 -dalign -xlibmil -native -xarch=v8plusa -xchip=ultra -xCC CFLAGS := $(CFLAGS) -fsimple=0See the Sun compiler documentation for a description of these flags. Running 'cc -flags' will give you a concise description of options. You are not permitted to use any compiler flags that relax IEEE floating-point compliance (e.g. -fns, -fsimple=2).
We will run your code on a dedicated UltraSPARC on matrices ranging in size from 16x16 to 256x256. Any part of 'driver.c', including the specific matrix sizes, is subject to change. The Makefile compiles the driver program separately, so you cannot perform any cross-source-file optimizations involving the driver. We will present the contest results in class. There will be first and second prizes...