Here is the challenge and our proposed solution. Consider just the dot product routine dot, which computes
r <- alpha * sum (i=1 to n) x(i)*y(i) + beta * rThere are 32 versions of this subroutine, depending on the mathematical type (real or complex) and precision (single or double) of the input and output arguments. For example, in c_sdot all the arguments are real and single, whereas in c_zdot_d_d, r, alpha, and beta are complex double, but x and y are real double.
Furthermore, each subroutine may contain 1, 2 or 3 implementations of the dot product, depending on the internal precision PREC requested by the user (a run-time case statement). For example, c_sdot only uses internal precision = single, but in c_sdot_x the user can specify single, double, or extended internal precision. (On an Intel machine, 80-bit extended would be a fourth possibility.)
Our goal is to automate the construction and testing of this enormous range of possible algorithms.
To this end, we are using the M4 macro package to automatically generate all the subroutines and implementations. There are 3 levels of macros.
We also have macro code for symv, symmetric-matrix-vector multiplication.Here is the general scheme for the dot product. All flavors of matrix-vector multiplication and matrix-matrix multiplication will be tested the same way, based on our ability to test dot products.
The idea is to choose r(in), alpha, x(1:n), y(1:n), beta, r(out,true) and a scale factor s so that
Here is the algorithm for choosing r(in), alpha, x(1:n), y(1:n), beta, r(out,true) and s:1) Choose r(in), alpha, x(1:n), beta and y(1:n-k) randomly, or from the user
(for some small k specified below)
2) for j=n-k+1 to n
2.1) Compute tmp = alpha*sum(i=1 to j-1) x(i)*y(i) + beta*r(in) as
accurately as possible (presumably in double-double)
2.2) y(j) = -round(tmp/(alpha*x(j))) all in working precision
endfor
3) tmp = alpha*sum(i=1 to n) x(i)*y(i) + beta*r(in) as accurately as possible
4) r(out,true) = round(tmp)
5) s = |alpha|*sum(i=1 to n) |x(i)*y(i)| + |beta*r(in)|
Note that the loop tries to choose y(n-k+1) through y(n) in order to create as much cancellation as possible, and make r(out,true) as close to zero as possible. Here is why. The usual error analysis of dot products says that a correctly computed fl(r(out)) should satisfy
| fl(r(out)) - r(true) | <= O(eps_output)*|r(true)| + O(eps_internal)*swhere r(true) is the mathematically correct answer. Since we only know r(true,out) where
| r(out,true) - r(true) | <= O(eps_output)*|r(true)| + O(eps_doubledouble)*swe get
| fl(r(out)) - r(out,true) | <= O(eps_output)*|r(true)| +
O(eps_internal+eps_doubledouble)*s
or
| fl(r(out)) - r(out,true) |/s <= O(eps_output)*|r(true)|/s +
O(eps_internal+eps_doubledouble)
For this last quantity to be a good estimate of eps_internal, we need
that |r(true)|/s be as small as possible (at most eps_doubledouble/eps_output).
This is what the loop above tries to achieve.
How big should k be, i.e. how many cancellations do we need? It depends on the relative sizes of eps_output and eps_doubledouble. Basically,
k = ceiling( log(eps_doubledouble)/log(eps_output) )which is as large as ceiling(106/24) = 5 for eps_ouput = single. This means that we cannot easily test dot products with single output and extra internal prcision with n < 6 (more likely 7 or 8) so simply.
Notice that we essentially need a "trusted" highly accuracy dot product in order to build our other test code. This trusted implementation will need separate unit testing outside this scheme.
As long as the F95 code eventually called F77, that would help here too.
Let me note that conventional operator overloading where the implementation of each binary operation only depends on the argument types and not the result type is not enough for our purposes. It cannot implement extra = double*double correctly, for example. The alternative would be to promote one argument to the output type, but this is unnecessarily inefficient (extra = extra*double is significantly more expensive than extra = double*double). This may limit the utility of this feature in F95 and C++.
I can't tell from my brief reading of his web page how he deals with many flavors of mixed precision, possible extended precision, or testing it all. But I have not had time to read all the material.
From the point of view of the BLAST Forum, where we voted not to have a C++ interface, I wonder whether we could then go ahead and use C++ for our reference implementation. Presumably we would need "wrappers", but I'm not sure how we would construct them, or whether they would take much less effort than what we have been trying to do at Berkeley.
I look forward to a discussion of these issues at the BLAST meeting in March 99.
I hope that we can come to a consensus approach to producing a reference implementation at the next BLAST meeting. If there turns out to be consensus around the approach presented here, then it would be natural to divide the work by subroutine, using our DOT implementation as a model. There is clearly a lot to do.
There is the possibility of an F77 version, as described above.
It might be possible to use this approach to use the same code template for the interval arithmetic versions of the code in chapter 5, but I have not pursued this.
The sparse routines in Chapter 3 seem sufficiently different in scope that I do not see any big advantage of this approach there. If they go for extended precision versions some day, then maybe.
Everything we have discussed so far applies to all the BLAS that "look like dot products", i.e. matrix-vector-multiply, matrix-matrix-multiply etc. But is does not apply so simply to solving triangular systems with extended internal precision.
First, solving triangular systems with extended internal precision requires mallocing an array of extra precise temporary variables, since we need to keep the solution in extended to perform substitution, before finally rounding at the end. (Blocked implementations of matrix-matrix multiplication will require this too, but not the reference implementation.) This is not hard, but it is different, and mallocing can be slow. Perhaps we should not care about this in a reference implementation. (Note that using all stride-1 access in GEMV also would require mallocing a temporary array of extra precision variables, which we have chosen not to do in our current efforts.)
Second, we have not figured out how to test accuracy of triangular solves as straightforwardly as for dot products, where we can easily construct problems with answers known ahead of time to extended precision high accuracy. We are still thinking about this.