Suppose you have 4 local variables contained in FP registers and named c11, c12, c21, c22. You also have two 'double*'s A and B, and you want to do a matrix-matrix accumulate into the matrix defined by cij. Also, Asep and Bsep are integer (in registers) that define the distance in doubles between two rows (i.e., the col dim of A and B resp). Goal: minimize the number of memory references, the number of additional FP registers and try to express the computation using multiply-accumulates.
#define mul_mdmd_md2x2(c11,c12,c21,c22,A,Asep,B,Bsep) \ { \ const double *bp,*ap; \ double b1,b2; \ double a; \ \ bp = B; \ b1 = bp[0]; b2 = bp[1]; /*quad-word load*/ \ bp += Bsep; \ \ ap = A; \ a = ap[0]; ap += Asep; \ \ c11 += a*b1; /* c11 += a11*b11 */ /* fma */ \ c12 += a*b2; /* c12 += a11*b12 */ /* fma */ \ \ a = ap[0]; ap = &A[1]; \ \ c21 += a*b1; /* c21 += a21*b11 */ /* fma */ \ c22 += a*b2; /* c22 += a21*b12 */ /* fma */ \ \ b1 = bp[0]; b2 = bp[1]; /* quad-word load */ \ a = ap[0]; ap += Asep; \ \ c11 += a*b1; /* c11 += a12*b21 */ /* fma */ \ c12 += a*b2; /* c12 += a12*b22 */ /* fma */ \ \ a = ap[0]; \ \ c21 += a*b1; /* c21 += a22*b21 */ /* fma */ \ c22 += a*b2; /* c22 += a22*b22 */ /* fma */ \ }