cs267 Jan 23, 1996

### 2x2 Matrix Multiply

The objective is to write a superfast 2x2 matrix multiply for POWER2 that may be used in the inner loop of a general mm. (ESSL uses 2x2 blocking for POWER architecture, however larger blocks are used for POWER2).

Suppose you have 4 local variables contained in FP registers and named c11, c12, c21, c22. You also have two 'double*'s A and B, and you want to do a matrix-matrix accumulate into the matrix defined by cij. Also, Asep and Bsep are integer (in registers) that define the distance in doubles between two rows (i.e., the col dim of A and B resp). Goal: minimize the number of memory references, the number of additional FP registers and try to express the computation using multiply-accumulates.

### Solution:

```#define mul_mdmd_md2x2(c11,c12,c21,c22,A,Asep,B,Bsep)         	  \
{								  \
const double *bp,*ap; 					  \
double b1,b2;						  \
double a;							  \
\
bp = B;							  \
bp += Bsep; 							  \
\
ap = A;							  \
a = ap[0]; ap += Asep;					  \
\
c11 += a*b1; /* c11 += a11*b11 */	/* fma */		  \
c12 += a*b2; /* c12 += a11*b12 */	/* fma */	       	  \
\
a = ap[0]; ap = &A[1];				          \
\
c21 += a*b1; /* c21 += a21*b11 */	/* fma */		  \
c22 += a*b2; /* c22 += a21*b12 */	/* fma */		  \
\
a = ap[0]; ap += Asep;					  \
\
c11 += a*b1; /* c11 += a12*b21 */	/* fma */		  \
c12 += a*b2; /* c12 += a12*b22 */	/* fma */		  \
\
a = ap[0];							  \
\
c21 += a*b1; /* c21 += a22*b21 */	/* fma */	      	  \
c22 += a*b2; /* c22 += a22*b22 */	/* fma */		  \
}

```

### Evaluation:

• need only 3 additional FP registers