cs267 Jan 23, 1996

2x2 Matrix Multiply

The objective is to write a superfast 2x2 matrix multiply for POWER2 that may be used in the inner loop of a general mm. (ESSL uses 2x2 blocking for POWER architecture, however larger blocks are used for POWER2).

Suppose you have 4 local variables contained in FP registers and named c11, c12, c21, c22. You also have two 'double*'s A and B, and you want to do a matrix-matrix accumulate into the matrix defined by cij. Also, Asep and Bsep are integer (in registers) that define the distance in doubles between two rows (i.e., the col dim of A and B resp). Goal: minimize the number of memory references, the number of additional FP registers and try to express the computation using multiply-accumulates.

Solution:


#define mul_mdmd_md2x2(c11,c12,c21,c22,A,Asep,B,Bsep)         	  \
{								  \
   const double *bp,*ap; 					  \
   double b1,b2;						  \
   double a;							  \
								  \
   bp = B;							  \
   b1 = bp[0]; b2 = bp[1]; 		/*quad-word load*/        \
   bp += Bsep; 							  \
								  \
   ap = A;							  \
   a = ap[0]; ap += Asep;					  \
								  \
   c11 += a*b1; /* c11 += a11*b11 */	/* fma */		  \
   c12 += a*b2; /* c12 += a11*b12 */	/* fma */	       	  \
								  \
   a = ap[0]; ap = &A[1];				          \
								  \
   c21 += a*b1; /* c21 += a21*b11 */	/* fma */		  \
   c22 += a*b2; /* c22 += a21*b12 */	/* fma */		  \
								  \
   b1 = bp[0]; b2 = bp[1];		/* quad-word load */	  \
   a = ap[0]; ap += Asep;					  \
								  \
   c11 += a*b1; /* c11 += a12*b21 */	/* fma */		  \
   c12 += a*b2; /* c12 += a12*b22 */	/* fma */		  \
								  \
   a = ap[0];							  \
								  \
   c21 += a*b1; /* c21 += a22*b21 */	/* fma */	      	  \
   c22 += a*b2; /* c22 += a22*b22 */	/* fma */		  \
}								   



Evaluation:

  • need only 3 additional FP registers
  • only 6 memory references (saved two because of quad-word loads)
  • 8 multiply-accumulates - can be executed in 9 cycles.
  • Further improvements:

  • fairly easy to do prefetching for the following multiply-accumulate at the expense of extra FP registers
  • quad-word loads may not work properly if either or both A and B are odd double-word aligned. This should be fixed outside the inner loop.