*
cs267
Jan 23, 1996
*

### 2x2 Matrix Multiply

The objective is to write a superfast 2x2 matrix multiply for POWER2
that may be used in the inner loop of a general mm. (ESSL uses 2x2
blocking for POWER architecture, however larger blocks are used for
POWER2).
Suppose you have 4 local variables contained in FP registers and named
c11, c12, c21, c22. You also have two 'double*'s A and B, and you want
to do a matrix-matrix accumulate into the matrix defined by cij. Also,
Asep and Bsep are integer (in registers) that define the distance in
doubles between two rows (i.e., the col dim of A and B resp). Goal:
minimize the number of memory references, the number of additional FP
registers and try to express the computation using multiply-accumulates.

###
Solution:

#define mul_mdmd_md2x2(c11,c12,c21,c22,A,Asep,B,Bsep) \
{ \
const double *bp,*ap; \
double b1,b2; \
double a; \
\
bp = B; \
b1 = bp[0]; b2 = bp[1]; /*quad-word load*/ \
bp += Bsep; \
\
ap = A; \
a = ap[0]; ap += Asep; \
\
c11 += a*b1; /* c11 += a11*b11 */ /* fma */ \
c12 += a*b2; /* c12 += a11*b12 */ /* fma */ \
\
a = ap[0]; ap = &A[1]; \
\
c21 += a*b1; /* c21 += a21*b11 */ /* fma */ \
c22 += a*b2; /* c22 += a21*b12 */ /* fma */ \
\
b1 = bp[0]; b2 = bp[1]; /* quad-word load */ \
a = ap[0]; ap += Asep; \
\
c11 += a*b1; /* c11 += a12*b21 */ /* fma */ \
c12 += a*b2; /* c12 += a12*b22 */ /* fma */ \
\
a = ap[0]; \
\
c21 += a*b1; /* c21 += a22*b21 */ /* fma */ \
c22 += a*b2; /* c22 += a22*b22 */ /* fma */ \
}

###
Evaluation:

###
Further improvements: