The IBM RS6000/590 - architecture and algorithms (CS 267, Jan 24 1995)

Introduction

We will consider the details of the RS6000/590 (also called POWER2) architecture, and show how to exploit them in order to write an efficient matrix-vector multiplication algorithm. This will illustrate the issues you need to consider for Assignment 1, for which you will write a matrix-matrix multiplication algorithm.

More important than the particular details of the RS6000/590 is that it illustrates many of the issues we will encounter in other aspects of parallel computing. Indeed, the following issues come up over and over again. In order to write fast code on the RS6000/590, or any machine, one should exploit:

Parallelism - Do as many operations simultaneously as possible: The RS6000/590 is a superscalar processor, which means it can execute up to six instructions simultaneously, provided they do not conflict: 2 floating point instructions, 2 fixed point instructions, a condition register instruction (such as a comparison), and a branch.

Pipelining - Do long sequences of identical operations by processing then in a pipeline: On the RS6000/590, a single floating point instruction can do a multiply and an add in a pipeline of length 2, yielding a factor 2 in speed up if a long sequence of such operations is available.

Locality - Reuse data which is "close to" other data you are using: On the RS6000, when data is fetched from slow memory to fast memory (cache), it comes in 256 byte chunks called cache lines, and a good algorithm will use all this data at the same time, to minimize the number of times the cache line has to be fetched from slow memory.

Much of the detail of the following discussion is taken from "Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms", R. Agarwal, F. Gustavson, M. Zubair, IBM J. of R&D, v. 38, n. 5, Sept 94 (see reader). The whole issue of this journal discusses details of the RS6000 architecture. Also click here for more detail on IBM's numerical subroutine library ESSL or here for on-line versions of the journal articles. There is also a great deal of material available on tuning code.

The following table shows the steps one might go through in attempting to write a fast matrix multiplication routine for the RS6000. The first line is the clock rate, and the second is the peak machine speed. Since each floating point operation takes 1 cycle, the fact that 266 = 4 x 66.5 indicates that 4-fold floating-point parallelism is available in the machine. The other lines give the computation rate for 64-by-64 matrix multiply, computed using several algorithms and compiler options. Unblocked matmul refers to Algorithm 1 in Lecture 2, DMR is a public-domain routine optimized for the RS6000, and ESSL dgemm is IBM's proprietary optimized matrix-multiply. f77 is the name of the Fortran compiler, and the strings following f77 are various compiler optimization flags. When one uses "-O3 -qhot" without the "-qstrict" option, the compiler prints a warning that it may change the semantics of the code in its pursuit of a fast version of the code; in this case the resulting code gave erroneous answers for n>64. There is another compiler option, not illustrated here, which would "pattern match" to discover that the code was really doing matrix multiplication, and replace everything by a call to ESSL dgemm. This table illustrates that one cannot yet expect compilers to take naive code and produce object code that fully exploits the speed of the machine, expect for a very few special cases where the compiler can recognize an entire algorithm (such as matrix multiply) and replace the user's program with a call to a preexisting library subroutine.

          Clock rate:                       66.5 MHz
          Peak speed:                      266   Mflops
          Unblocked matmul, f77:             4.5 Mflops
          Unblocked matmul, f77 -O2:        65   Mflops
          Unblocked matmul, f77 -O3 -qhot:  92   Mflops
          DMR             , f77 -O2:       152   Mflops
          DMR             , f77 -O3 -qhot: 186   Mflops (buggy for n>64)
          ESSL dgemm      , f77 -O3 -qhot: 240   Mflops

Now we begin discussing the RS6000/590 in more detail, and explain how one would write a matrix-vector multiply. More details may be found in the IBM document Algorithm and Architecture Aspects of Producing ESSL BLAS on POWER2, especially the section on BLAS-2 Implementation.

The CPU consists of 8 chips: an FPU (Floating Point Unit), FXU (Fixed Point Unit), ICU (Instruction Cache Unit), SCU (Storage Control Unit), and 4 DCU (Data Cache Units).

Floating Point Unit. The FPU has 32 64-bit floating point registers, and two independent MAF (Fused-multiply-add) units, which may operate independently and in parallel. The hardware automatically tries to schedule independent instructions to use the next MAF unit available. A fused-multiply-add instruction performs a=b*c+d in two cycles, doing b*c on the first cycle and (b*c)+d on the second. There are separate adders and multipliers, both of which may work simultaneously on different instructions. This is called pipelining with a 2-stage pipe (multiply and add). To illustrate, consider a sequence of operations ai=bi*ci+di, for i=1 to 4. This would be performed in 5 cycles as follows:

Multiplication   b1*c1    b2*c2       b3*c3       b4*c4
Addition               (b1*c1)+d1  (b2*c2)+d2  (b3*c3)+d3  (b4*c4)+d4
Cycle              1        2           3           4           5

Thus, a 2-stage pipeline can perform n operations in n+1 steps. Since 1 "operation" in this case consists of 2 floating point operations, this means 2*n flops are performed in n+1 cycles. Without pipelining, it would take 2*n cycles to perform the 2*n flops, meaning that the speedup is 2*n/(n+1) or nearly 2 for large n.

More generally, an s-stage pipeline can work on as many as s independent operations at a time, taking s+n-1 cycles to finish n operations:

Stage 1   op1     op2     op3                 ... opn
Stage 2           op1     op2                         ... opn
  ...                 ...
Stage s                      ...  op1     op2                 ... opn
Cycle      1       2       3 ...   s      s+1 ...  n  ...        s+n-1

Without pipelining, the same sequence of operations would take s*n cycles, meaning that the speedup is s*n/(s+n-1) or nearly s when n is large. Pipelining is a very common way to get speed, because it does not require multiple copies of the hardware (in this case, one adder and one multipler are all that is needed), and so is ubiquitous in the RS6000/590, as well as other machines ranging from PCs to supercomputers.

We just said that when the number of operations n is large compared to the pipeline length s, the speedup s*n/(s+n-1) is nearly s. But for smaller n, the speedup drops. A common way to express this fact is with the quantity n_{1/2}, read "n-one-half", which is the minimum value of n for which the speedup is half its peak value s:

    s * n_{1/2} / (s + n_{1/2} - 1) = s/2      or    n_{1/2} = s-1

The concept of "minimum problem size to reach half-peak performance" has become popoular in other contexts not immediately associated with pipelines, for example the LINPACK Benchmark, where n_{1/2} is the smallest matrix size for which Gaussian elimination runs at half its peak performance.

The fused-multiply-add instruction has one other interesting property: it is performed with one round-off error. In other words, in a=b*c+d, b*c is first computed exactly to quadruple (128 bit) precision, if b and c are double (64 bit) precision, then d is added, and the sum rounded to a. This use of very high precision is used by the RS6000 to implement division, which still takes about 19 times longer then either multiply or add. Square root takes 27 times longer. The fused multiply-add may be used to simulate higher precision cheaply. All arithmetic conforms to the IEEE Floating Point Arithmetic standard which we will discuss later (see the reader).

To summarize: to get speed out of the FPU, one's algorithm needs to execute a sequence of independent operations each of which is a multiply-add operation a=b*c+d.

Data Cache Unit. Here are the vital statistics of the RS6000/590 cache: It is 256 KB long, with 256 bytes (32 8-byte double-words) per cache line. It is 4-way associative. Its write policy is store-back with LRU. It is dual-ported and nonblocking. A cache miss has a latency of 14-20 cycles and takes 8 cycles to fill a cache line. There is a quad load/store instruction.

Let us explain what all this means. The cache, or fast memory, is 256 KB long. It is read/written to slow memory in 256 byte chunks called cache lines. A read to a location stored in cache retrieves the data in one cycle, otherwise there is a cache miss and a delay of 14-20 cycles before an entire cache line containing the desired data starts arriving, 32 bytes/cycle, from slow memory (another pipeline!). Therefore it is much faster if an algorithm processes all 256 consecutive bytes (32 consecutive double-words), since it pays for them to be moved anyway, rather than, say, only using every 32nd double-word and ignoring the rest. This is called exploiting spatial locality, or using data located close together in memory.

For example, consider the following two algorithms to add two 256-by-256 double precision matrices, both stored C-style in row-major order:

        Algorithm 1                                Algorithm 2

     for i = 1 to 256                           for j = 1 to 256
        for j = 1 to 256                           for i = 1 to 256
            A(i,j) = A(i,j) + B(i,j)                   A(i,j) = A(i,j) + B(i,j)
        end for                                    end for
     end for                                    end for

A and B each occupy 8*256^2 = 2^19 bytes, whereas the cache is only 2^18 bytes long. Since the matrices are stored by rows, Algorithm 1 accesses consecutive memory elements in the innermost loop, whereas Algorithm 2 accesses memory elements 8*256 = 2^11 bytes apart, and so in different cache lines. Thus, there will probably be 2 cache-misses on every pass through the inner loop of Algorithm 2 (one for A and one for B), whereas there will only be 2 cache misses for every 32 values of j in the innermost loop of Algorithm 1. In other words, Algorithm 2 will probably have 32 times as many cache misses as Algorithm 1, and so be much slower.

As long as there is space in the cache when there is a cache-miss, cache-lines are fetched and stored in cache. The cache is organized in 256 sets with 4 cache-lines per set. 8 consecutive bits of the memory address being fetched are used to pick which of the 256 sets a cache line is to occupy. Since there are 4 lines in a set (4-way set associativity) as many as 4 lines with the same 8 bits in their addresses can simultaneously be in cache.

When a set fills up, and a fifth cache line is fetched, one of the lines currently in the set must be selected to be "evicted", or removed from the cache. The one selected is the one Least Recently Used (LRU) - a counter is associated with each cache line to keep track of this - and it is then written back into slow memory. This is an illustration of why an algorithm should exhibit temporal locality, or reuse of data recently used: as long a datum is used frequently enough, it is guaranteed to remain in cache via the LRU mechanism.

When a word is written, only the cache is updated, not the slow memory. The modified cache line is marked as "dirty" by setting a bit. When a dirty cache line is evicted, it must be moved back to slow memory ("clean" cache lines can just be overwritten, which is faster). This is called a write-back policy, since slow memory is updated only when a dirty cache line is evicted, in contrast to a write-through policy, where slow memory is updated on every write (this is slower, since it accesses slow memory more often).

The cache is dual-ported and nonblocking. This means that if one cache read misses, and a 14-20 cycle delay ensues while a cache line is fetched from slow memory, then there is a second port on the cache which can continue functioning independently and keeping the FPU supplied with data.

Here is (slightly simplied) code for matrix-vector multiplication on the RS6000/590:

T0 = Y(I) 	! load 24  y elements
T1 = Y(I+1) 	! in 24 FPRs
        ...
T23 = Y(I+23)
DO J = J1, J1+JBLK-1
	XJ = X(J)	! load an element of x
	F0 = A(I, J)! one Load Quad load
	F1 = A(I+1,J)! both F0 and F1
	T0 = T0 + XJ*F0
	T1 = T1 + XJ*F1
	   ...
	F0 = A(I+22,J)
	F1 = A(I+23,J)
	T22 = T22 + XJ*F0
	T23 = T23 + XJ*F1
ENDDO
Y(I) = T0 	! store y elements
Y(I+1) = T1 	! after the loop
     ...
Y(I+23) = T23

Let us explain what is going on:

A is stored in column major format (Fortran).

The operation is y = y + A*x, A m-by-n, broken up as operations on submatrices:

      y(i:i+23) = y(i:i+23) + A(i:i+23,j1:j1+jbkl-1)*x(j1:j1+jblk-1)

There are also outer loops (not shown) from j1 = 1 to M step JBLK, and i = 1 to n step 24. JBLK chosen so block of A fits in cache:

     24*JBLK + 25 <= 2^15, or JBLK <= 1364

T0=y(i) through T23=y(i+23) loads 24 FPRs with y; this reuses the cache line occupied by y(i). These remain in registers throughout the inner loop.

In the loop, XJ, F0 and F1 also occupy registers.

The J loop performs a saxpy:

      y(i:i+23) = y(i:i+23) + A(i:i+23,j)*x(j)

Two consecutive entries of A, say A(i,j) and A(i+1,j), are loaded at the same time with a quad load instruction. Only one FXU is needed to get this information.

Two Tj's are updated at the same time with two fused-multiply-add instructions.

The other FXU can be used to prefetch the next column of A, by inserting

           d = A(i+23,j+2)

in the inner loop to get column j+2 ahead of time, completely hiding cache miss time.

One should choose i carefully so all subblocks begin on cache-line boundaries, to minimize cache misses.

Here is a slide of the performance of matrix-vector multiplication (also called DGEMV). ESSL refers to IBM's optimized routine, and LAPACK [sic] refers to the naive algorithm with 2 nested loops, which may be found here. Note the drop in performance when the dimension hits around 180, which is the largest size matrix that fits in the cache. The drop occurs because it is impossible to hide the slower speed of the slower memory by blocking techniques, as discussed in Lecture 1.

In contrast, consider the performance of matrix multiplication in the following graph. The performance stays pegged at close to the machine peak, even for matrices too large to fit in cache, because the blocking technique in Algorithm 3 of Lecture 1 can hide the slower speed of the slow memory.