Date: Thu, 31 May 2001 21:23:07 -0700 (PDT) From: "W. Kahan" <wkahan@EECS.Berkeley.EDU> To: gls@labean.East.Sun.COM Subject: Your 72 slides Cc: wkahan@EECS.Berkeley.EDU, joe.darcy@eng.sun.com Guy: Thanks for sending your slides. Do you have enough time allotted to your presentation to cover them all? Even if you rushed through them as fast as they could be read, not allowing any time for passage through the viscosity of the human mind, they would take at least about an hour. I agree with some of what you propose and disagree with other parts, as might be expected. I agree with pages 1 and 2, and with the intent behind page 3 even if it is slightly naive, and with the first bullet on page 5. Then we diverge at the word "portable", to which you assign a meaning that goes well beyond what the word meant for at least 30 years, from roughly 1960 to 1990. We agree that Java should not support traps, as they are construed by hardware implementors, though we have different reasons. Certainly, applications programmers should not be expected to write their own trap handlers to cope with Over/ Underflows, Divisions-by-Zero or Invalid Operations. But a cure you advocate turns out to be worse than the disease; the OV and UN that Fraley and Walther proposed turned out to burden, not ease, the lives of applications programmers. Wm. J. Cody was the point man on that issue twenty years ago, and came down strongly against their inclusion in the proposed IEEE standard. Only because we needed F & W's votes did we "Compromise" by introducing the Signalling NaNs to accommodate them; you can see how little good that did. Your example starting on p. 31 is a Red Herring. It can be handled easily on machines that tell for each floating-point instruction which status bits to update in the event of a floating-point exception; this is done on Itanium to facilitate speculative execution down both paths of branches. But that capability is irrelevant to your example for this reason: Knowing WHICH of x and y overflowed is almost never nearly as important as knowing that at least one of them did, after which the cost of recomputing both, if need be, will usually be inconsequential compared with the cost of coping with the overflow. Flag tests are sparse in most applications programs, hardly ever to be found in tight loops. Exceptional treatment of exceptions is appropriate for the Math library of elementary transcendental functions ONLY if these run very fast, which is another interesting question if Java continues to try to legislate exact reproducibility of all floating-point computations. Please understand that I do not scoff at the need for SOME kinds of floating-point computations to be reproducible exactly. We do differ on whether that much reproducibility can be enforced by the language alone, and therefore whether it should be invoked as an excuse for canonizing a floating-point architecture (SPARC's) disadvantageous for the overwhelming majority of programmers. In the absence of a Pope of Computation, there is no way to ensure exact reproducibility of approximated results no matter how hard mere language designers and implementors try to enforce it. Instead I believe that exact reproducibility should be the just reward for a programmer who demands it explicitly and pays the price in both a disciplined restriction to (sometimes cumbersome) programming locutions that guarantee reproducibility, and a consequent loss of execution-time speed. And that demand makes sense for Binary far less than for Decimal floating-point, which is what Java should have specified at the outset if exact reproducibility were really intended by its designers. You ask on p. 58 why "IEEE 754 supports wrapped exponents _only_ through traps". The "_only_" is gratuitous, as becomes clear on p. 59 where you mention scalb and logb. The traps specified by IEEE 754 serve _only_ as clues for hardware designers that their traps need not be precise (provided they are restartable), and need supply only certain minimum information (wrapped result in a known destination for Over/Underflow, operand values and intended destination for Invalid Operation, etc.) to what would (we vainly hoped) turn out to be a limited menu of useful options along the lines of my notes you have cited. Incidentally, your comment on page 62 that my "original example does not address the question of overflow in the sums" is mistaken. The counting mode as described in my notes and actually implemented on the IBM 7090/7094, and used very successfully to compute long products/quotients for physicists and chemists in the mid 1960s, counts up or down correctly regardless of whether over/underflow occurs in a sum or a subsequent product or even quotient. The counting mode was used also to speed up the comparison of complex absolute values; sqrt(x**2 + y**2) vs. sqrt(a**2 + b**2) was turned into (x-a)*(x+a) vs. (b-y)*(b+y) after sorting to make x = |x| >= y = |y| and a = |a| >= b = |b| , and also counting any over/underflows. I know that the counting mode has only very few uses, but can you think of a better way to do what it does? I think the suggestion that the significand's last bit be used as an Inexact flag betrays a profound misunderstanding of its uses: "An inexact value represents a result that fell somewhere in the interval between adjacent exact values" is mathematically true and at the same time false for human purposes. Likewise, you have not thought through the reasons for identifying NaNs with their origins; the _place_ in the program where the NaN was generated is what we need much more than the op-code. I agree with your suggestions about max and min on p. 52, and with the usefulness of fast tests of flags and of operands' categoties. For the latter, perhaps reprogrammable PLAs would be helpful. Rounding modes are NOT just special trap handlers. DEC's Alpha almost got them right when some were allowed into the OP-code, but put the wrong ones there. (p. 65) Trap handlers should certainly be subroutines supplied along with the run-time library of Math functions and Decimal-Binary conversions. There are so few things that can be done usefully after floating-point exceptions, they might as well be put into a standardized menu. (p. 68) The _Modes_ requested by a program to determine how exceptions shall be handled, and how rounding will be directed, are values of variables whose scopes are the proper responsibility of language designers. We agree that they are not best regarded as "dedicated registers", and that "Side effects on global state are a bad idea ...". (pp. 69 - 70) "Borneo may be on the right track for Java" warms my heart. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: In any event, I hope you will take pity on your audience by showing them only a selection from the pages you have sent me. A memorable treatment of a few issues is more humane than an attempt to cover all of them in one hour. With warmest regards, W. Kahan <wkahan@cs.berkeley.edu> (510) 642-5638 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Date: Wed, 6 Jun 2001 18:06:21 -0400 (EDT) From: Guy Steele - Sun Microsystems Labs <gls@labean.East.Sun.COM> Reply-To: Guy Steele - Sun Microsystems Labs <gls@labean.East.Sun.COM> Subject: Thanks for the comments To: wkahan@EECS.Berkeley.EDU Cc: Guy.Steele@east.sun.com, joe.darcy@eng.sun.com Date: Thu, 31 May 2001 21:23:07 -0700 (PDT) From: "W. Kahan" <wkahan@EECS.Berkeley.EDU> To: gls@labean.East.Sun.COM Subject: Your 72 slides Cc: wkahan@EECS.Berkeley.EDU, joe.darcy@eng.sun.com Thank you so much for your prompt and extensive feedback on my slides---much more than I had expected. I was able to read them late Saturday evening and thereby improve my talk to some extent. Guy: Thanks for sending your slides. Do you have enough time allotted to your presentation to cover them all? Even if you rushed through them as fast as they could be read, not allowing any time for passage through the viscosity of the human mind, they would take at least about an hour. Yes, I had an hour, and I realized that I was pushing it on the material. But I made printed handouts---with thumbnails at four per page, 72 slides fit on nine sheets of paper, double-sided, which is not too bad. That allowed me to comment on the slides rather than read them, knowing that listeners could return to puzzling points later if necessary. On a slide such as #39 or #42, I merely noted that there were lots of special cases and read one or two as examples, then went on. I ended up skipping slides #59 through #63, stopping only to note that they were slightly esoteric tools that allow one to manipulate exponent and significand separately, and to request that everyone scribble out the footnote on slide #62 because it was based on my misreading of your notes (as you pointed out). I agree with some of what you propose and disagree with other parts, as might be expected. I agree with pages 1 and 2, and with the intent behind page 3 even if it is slightly naive, Naive in the sense that it slavishly mirrors the existing Java "import" mechanism? It is thereby understood that one can important a single explicitly named static member, such as by import static java.lang.Math.sqrt; just as one can import a single explicitly named class; but there is no facility for renaming an entity as it is imported, for example. Or do you view it as naive in some additional sense? and with the first bullet on page 5. Then we diverge at the word "portable", to which you assign a meaning that goes well beyond what the word meant for at least 30 years, from roughly 1960 to 1990. Yes, so I have in the past; but for this talk, at this slide, I discussed this precise variance in opinion and meaning concerning this term. Briefly put, I noted that portability down to the last bit was explicitly envisioned by Coonen, at least (slide #6), as one alternative; noted that Java was the first widely-used language to bring that goal to fruition by design (and by fiat); then pointed out that a capable programmer could easily write code that would behave in an acceptably portable and desirably speedy fashion whether the precision was double or extended; then asked the question of what fraction of the original target audience for Java would likely be capable in this area, and noted that the target audience has shifted in the last five years. I still defend the original decision to require bit-for-bit portability in an attempt to protect the originally envisioned target audience from falling into certain traps, particularly in the case of code debugged on an Intel machine and later downloaded to less capable machines. It makes no more sense to criticize the original design of Java for choosing not to support numerical programming than to criticize Fortran 2000 for failing to support network sockets and HTML. (Which reminds me: they both do a poor job of supporting linked lists and functional arguments---why can't they both be more like Lisp??) Nevertheless, those interested in numerical applications seem to like what they see in Java. You need to understand that many in the original Java community regard the "Fortran programmers" as party crashers. I am not one; I think there is value in extending Java to support the needs of many more communities. But it must be done in a thoughtful way, balancing many interests. Moreover, it is much easier to marshal support and resources for a language facility if it has broad applicability. Thus lightweight classes and operator overloading, considered as general facilities, are much easier to sell the Java community than something as specific as complex and imaginary numbers. We agree that Java should not support traps, as they are construed by hardware implementors, though we have different reasons. Certainly, applications programmers should not be expected to write their own trap handlers to cope with Over/ Underflows, Divisions-by-Zero or Invalid Operations. But a cure you advocate turns out to be worse than the disease; the OV and UN that Fraley and Walther proposed turned out to burden, not ease, the lives of applications programmers. Wm. J. Cody was the point man on that issue twenty years ago, and came down strongly against their inclusion in the proposed IEEE standard. Only because we needed F & W's votes did we "Compromise" by introducing the Signalling NaNs to accommodate them; you can see how little good that did. Yes. I did not have space on the overheads or time in the talk to do a detailed comparison to contrast my proposal with that of Fraley and Walther. I had to acknowledge their priority in suggesting OV and UN symbols, but there are notable differences. F & W seemed to want a mode that eliminates all "risk", especially by requiring that adding UN to an ordinary value must produce ERR rather than letting the UN quietly disappear. Such a definition could indeed cause ERR symbols to proliferate unnecessarily---though F & W are indeed correct that letting UN disappear when added to a normal is risky if the programmer has not thought out the consequences. I am not proposing such a stringent treatment, however. In my proposal, OV behaves very much as infinity does now, and OV would tend to be propagated in exactly those situations where infinity is propagated now. It's only purpose is to distinguish, so to speak, between an "exact" infinity arising from division by zero and an "inexact" infinity arising from overflow or the reciprocation of an underflow. Similarly, UN distinguishes between an "exact" zero, produced by subtraction of equals or multiplication by an exact zero, and a very tiny but nonzero quantity. I address the value of this distinction below in response to your example about the comparison of the norms of complex numbers. I considered an even more elaborate alternative, in which one has two UN values: one ("UN!") that guarantees that the result is tiny and another ("UN?") that merely indicates that some contributing computation was tiny but the result may not be. Thus the single precision product (2^-65)(2^-65) would produce "UN!", but multiplying that result by (2^60) would produce "UN?". On the other hand, multiplying "UN!" by a value not greater than 1 can correctly produce "UN!". Then adding "UN!" to an ordinary number quietly disappears (when rounding to nearest) but adding "UN?" produces "UN?", thus behaving rather like the ERR symbol of F & W. However, I am not certain that this additional complication is useful in practice; it may be subject to the same objections that Cody had years ago. In the actual proposal I put forward, UN times any finite value produces UN. This is exactly analogous to the behavior under IEEE 754 where drastic underflow produces a zero, and multiplying that zero by any finite value produces zero. The only difference is that the fact of underflow has been encoded in the value. And adding this value to any nonzero number causes the underflow indication to quietly disappear; so in this respect it is a little different from the IEEE 754 underflow flag. Your example starting on p. 31 is a Red Herring. It can be handled easily on machines that tell for each floating-point instruction which status bits to update in the event of a floating-point exception; this is done on Itanium to facilitate speculative execution down both paths of branches. I allude to this in the rotated comment about the IA-64 at the right-hand edge of slide #35. But, far from being a Red Herring, it precisely supports my point: there is value in not limiting the recording of flag status information to a single global status register, but instead providing a way to associate flag status information with results of computations. The IA-64/Itanium approach is a middle ground that requires a compiler (or assembly language programmer) to dynamically associate multiple flag status registers with computations and their results, rather than (a) having a single status register, or (b) having a separate status register associated with each and every floating-point register. This falls under the category of "additional explicit destination register" for state in the list on slide #30. It is a legitimate and very clever approach to getting some of the advantages of associating status flag information with computations rather than with intervals of time; it is completely compatible with IEEE 754, but it makes things rather more complicated for the compiler. The approach I want to explore is somewhat incompatible with IEEE 754, but makes things much easier on the compiler, among other advantages. But that capability is irrelevant to your example for this reason: Knowing WHICH of x and y overflowed is almost never nearly as important as knowing that at least one of them did, Which is why I labeled the example "artificial" on slide #31. after which the cost of recomputing both, if need be, will usually be inconsequential compared with the cost of coping with the overflow. Flag tests are sparse in most applications programs, hardly ever to be found in tight loops. I conjecture that one reason for this, at least, is that flag testing is too expensive on today's hardware. I believe it would be more useful, and more used, if flag testing were cheap and easy to express. Exceptional treatment of exceptions is appropriate for the Math library of elementary transcendental functions ONLY if these run very fast, which is another interesting question if Java continues to try to legislate exact reproducibility of all floating-point computations. One of my goals is to make the use of flags much faster by encouraging hardware designers to implement fast and convenient test instructions. Another goal is to make coping with overflow and underflow significantly faster by proving better facilities for scaling computations. (See below.) As for this example, I suggested that the tight loop shown on slide #37 is more plausible. This does indeed have flag testing in a tight loop. If the statistic of the program are that N is 10,000 and overflow occurs about once in a million iterations, then this approach is wrong and it would be better to test flags after the entire loop is done, or to use a trap. But if N is 1,000,000 and overflow occurs about once in a million iterations, then it is more likely than not that the exception will be encountered, and it may be better to catch it early than to have to redo the entire loop; therefore either a trap or an explicit test on each iteration is called for. In some architectures the trap is better; I am out to show that under circumstances that are plausible, even likely, on today's architectures, the use of an explicit test may be more efficient. (Recall that reacting to a trap may require tens, hundreds, or even thousands of clock cycles.) Please understand that I do not scoff at the need for SOME kinds of floating-point computations to be reproducible exactly. We do differ on whether that much reproducibility can be enforced by the language alone, and therefore whether it should be invoked as an excuse for canonizing a floating-point architecture (SPARC's) disadvantageous for the overwhelming majority of programmers. The overwhelming majority of programmers are busy programming point-and-click user interfaces, e-commerce, and video games. Most Java programmers don't need floating-point computations; they're trying to capture credit-card numbers and email addresses. Of course such reproducibility can be enforced by a language. Java (before the Great "strictfp" Compromise) did a pretty good job of that, except to the extent that implementors violated the implementation specification---and I suspect that they did this not because they were expert in floating-point arithmetic and had the same principled reasons that you do for disobeying or arguing against the specification, but because they were actually quite naive about the issues and simply assumed that using the native instructions of their chosen target machine, as they have in the past for other, less rigorously specified languages, would "do the right thing" (or close enough). Then again, reproducibility can't be enforced by a language, because people can always choose to use another language. The question is, should Java be extended in such a way to be more useful to other user communities, in particular those interested in numerical programming? I have consistently argued in favor of this general principle, though I have often argued against specific proposals for doing so because part of my job is to balance the needs of multiple user communities. I agree, for example, that Java might be somewhat easier to use and less error-prone for numerical purposes if it were to use the "widest-precision-anywhere-in-an-expression" rule. But that rule violates what I regard as a much more fundamental principle of language design, which is that you can always break off a subcomputation and give it a name, then refer to it by that name, without in any way disturbing the behavior of the program. The ability to name things is one of our most important tools in taming complexity; I must balk at any attempt to undermine naming mechanisms and their expected properties. Now, these two goals are not necessarily mutually incompatible under all circumstances, but preserving them both requires great care and finesse, and may place a burden on the application programmer---and, what is more important, the maintenance programmer---in some languages. Which brings me to an important meta-principle. One of the goals of a good language design is not only to make it easy to express a program in the first place, but to make it easy to express that program in such a way that perturbations to the program can be made easily and reliably. The careful programmer often must think about not just a single point in the space of programs, but a cluster of design points that might be of interest in the future, and express his program in such a way that these alternate design points may be reached through small changes in the program text. Naming mechanisms are one tool for achieving this goal (subroutines being a special case of such naming mechanisms). In the absence of a Pope of Computation, there is no way to ensure exact reproducibility of approximated results no matter how hard mere language designers and implementors try to enforce it. Instead I believe that exact reproducibility should be the just reward for a programmer who demands it explicitly and pays the price in both a disciplined restriction to (sometimes cumbersome) programming locutions that guarantee reproducibility, and a consequent loss of execution-time speed. And that demand makes sense for Binary far less than for Decimal floating-point, which is what Java should have specified at the outset if exact reproducibility were really intended by its designers. You seem to have paraphrased Tom Lehrer: "Speed is the important thing, rather than getting the right answer." I believe that exact reproducibility (perhaps at the price of speed) should be the reward for a programmer who has said nothing special and may be unaware of the persnickety ways in which computers may vary from model to model. You yourself have exhibited many examples of computations that work as desired at one precision and fail at another precision---sometimes smaller and sometimes larger. It is very important to me that once a numerically naive programmer has verified correct operation of his program on one implementation ---whether the program manages to exhibit that desired behavior through use, abuse, or nonuse of floating-point arithmetic--- he has an extremely good chance of that program operating as desired and tested on millions of other machines over which he has no control other than the expectation that it conforms to the Java Language Specification. Speed should be the just reward for a programmer who demands it explicitly (thereby indicating that he claims the competence and takes the responsibility for deciding whether such variations are tolerable for the purpose at hand) and pays the price in a disciplined restriction to programming locutions that produce behavior that always lies within an acceptable range of behaviors despite the variations among the supporting implementations, or in a disciplined restriction of the use of the program to specific implementations that are adequate to support it. Therefore the introduction of a keyword ("strictfp") to allow the programmer to indicate whether certain variations in floating-point behavior are tolerable for his purposes is a perfectly good idea, and additional such indications may be desirable in the future. But making the default behavior, that which occurs if the programmer says nothing, to allow variation from machine to machine, was an absolutely terrible decision from a language designer's point of view. This decision was dictated by even larger interests, which were political and economic rather than technical. Would it be fair to say that you are trying to promote the interests of numerically competent programmers, for whom "speed over reproducibility" is a reasonable default because of their assumed competence, whereas I am trying to promote the interests of a much larger and more diverse pool of programmers, whose fields of competence vary widely? You ask on p. 58 why "IEEE 754 supports wrapped exponents _only_ through traps". The "_only_" is gratuitous, as becomes clear on p. 59 where you mention scalb and logb. The traps specified by IEEE 754 serve _only_ as clues for hardware designers that their traps need not be precise (provided they are restartable), and need supply only certain minimum information (wrapped result in a known destination for Over/Underflow, operand values and intended destination for Invalid Operation, etc.) to what would (we vainly hoped) turn out to be a limited menu of useful options along the lines of my notes you have cited. I have news for you: an awful lot of hardware designers take IEEE 754 quite literally and do not regard its specifications as clues, but as requirements (the notable exception to this rule being the willingness to trap and let software handle denorms). The only part of 754 proper, as opposed to the appendix (which is largely ignored by hardware designers precisely because it is an appendix), that addresses the computation of wrapped exponents is the discussion in sections 7.3 and 7.4. Some hardware actually delivers a result with wrapped exponent to the hardware-level traps handler, and some uses software after the hardware-level trap has been sprung, but my point was that a third, and potentially useful, approach would be for the trap handler to receive the original operands and for the hardware to provide an instruction that would produce results with wrapped exponents; such an instruction could be used on entry top the hardware trap handler but could also be a generally useful facility in ordinary code as well. There is no need to tie the concept of producing a wrapped exponent to the concept of taking a trap. While IEEE 754 does not require the tying of these two concepts, the way it is written certainly encourages implementations to tie them together. It is unfortunate that IEEE 754 was not accompanied by a rationale document explaining the envisioned range of implementation options (as advice to the hardware guys) and the envisioned uses for the prescribed facilities (as advice to the language guys). Incidentally, your comment on page 62 that my "original example does not address the question of overflow in the sums" is mistaken. That is entirely correct; it was my error in misreading the intent of your notes, and I told my audience so. Thanks for noting it. The counting mode as described in my notes and actually implemented on the IBM 7090/7094, and used very successfully to compute long products/quotients for physicists and chemists in the mid 1960s, counts up or down correctly regardless of whether over/underflow occurs in a sum or a subsequent product or even quotient. I now see how such code would operate. The counting mode was used also to speed up the comparison of complex absolute values; sqrt(x**2 + y**2) vs. sqrt(a**2 + b**2) was turned into (x-a)*(x+a) vs. (b-y)*(b+y) after sorting to make x = |x| >= y = |y| and a = |a| >= b = |b| , and also counting any over/underflows. I know that the counting mode has only very few uses, but can you think of a better way to do what it does? Sure. Suppose that we have the OV and UN symbols, and suppose also that compound tests such as if (isNaN(x) | isInfinity(x) | isOV(x) | isUN(x)) { ... } are cheap (a test instruction followed by a conditional branch, for example). Then we can use code such as the following rather than KOUNT mode: p = (x-a)*(x+a) q = (b-y)*(b+y) if (isOV(p) | isUN(p) | isSubnormal(p)) { if (isOV(p) ? isOV(q) : (isUN(q) | isSubnormal(q))) { k = 61 - SCALEOF(max(x, a)) x = scalb(x, k) y = scalb(y, k) a = scalb(a, k) b = scalb(b, k) p = (x-a)*(x+a) q = (b-y)*(b+y) } } return p < q; where SCALEOF is much like LOGB except that it returns an integer rather than a floating-point number and it returns zero if the argument is NIOUZ (NaN, Infinity, OV, UN, or Zero). In this code as well as yours, we rely on the fact that any NaN in the input will cause the comparison to fail, and a NaN resulting from a difference of infinities will also cause the comparison to fail, thus causing all points at infinity to be regarded as equal. One of the reasons this works, and would not work under IEEE 754, is that if x and a are equal, y and b are not equal, and the computation of q underflows, the underflow does not produce a zero; where IEEE 754 would produce a zero (and only the multiplication could do this, not the addition or the subtraction), this proposal produces UN of the appropriate sign, and the comparison of this to the zero resulting from (x-a)*(x+a) produces the correct result. If one of p and q is OV and the other is not, then the simple comparison produces the correct result. Likewise, if one of p and q is UN or subnormal and the other is not, the simple comparison produces the correct result. (One must worry about subnormal values as well as UN because of the loss of precision.) Only if P and q have both overflowed or both become small is more care required; in this code I simply beat it with a sledgehammer, scaling all operands. (It is assumed that scaling a nonzero number to be very tiny never produces a zero, but rather UN.) If the compound test (isOV(p) | isUN(p) | isSubnormal(p)) can be performed in one or two instructions, this technique compares favorably with the cost of establishing KOUNT mode in the first place (not to mention saving and restoring the previous mode), even if we assume a single machine instruction that saves the old mode in register R1 while loading the new mode from register R2. But we can do even better. I believe it would be useful to have a few very general tools for manipulating exponents and significands separately. Consider these proposed primitives, which could easily be single hardware instructions (as I describe further below): SCALEOF(x) = if isNIOUZ(x) then 0 else (int)logb(x) This has the property that if x is not NIOUZ, then scalb(x, -SCALEOF(x)) is less than 2 but not less than 1. COSCALE(x, s) = scalb(x, SCALEOF(s)) COUNTERSCALE(x, s) = scalb(x, -SCALEOF(s)) This has the property that if x is not NIOUZ, then COUNTERSCALE(x, x) is less than 2 but not less than 1. LOWCOUNTERSCALE(x, s) = scalb(x, -SCALEOF(s)-1) This has the property that if x is not NIOUZ, then COUNTERSCALE(x, x) is less than 1 but not less than 1/2. All four of these can, I imagine, be implemented in a single clock cycle. The advantage of COSCALE and COUNTERSCALE is that they are useful idioms that can be implemented entirely on the floating-point side of the processor, without using integer registers, and twice or three times as fast as implementing them in terms of SCALEOF and scalb. MULSIG(x, y) = x * COUNTERSCALE(y, y) This amounts to ignoring the exponent field of y if y is not NIOUZ, so I expect it could easily be supported by existing floating-point multiply circuits as a single instruction by adding a little bit of control gating to the exponent computation. Thus it would have exactly the same cost as an ordinary multiplication. MULSIGS(x, y) = COUNTERSCALE(x, x) * COUNTERSCALE(y, y) Similar. If either operand is NIOUZ, the result will be NIOUZ. If neither result is NIOUZ, the result will be less than 4 but not less than 1. LOWMULSIG(x, y) = x * LOWCOUNTERSCALE(y, y) Similar. Note that it multiplies x by a value that is less than 1, so it cannot overflow. AVERAGE(x, y) = (x + y)/2, computed by adjusting the exponent during the addition rather than in two separate steps. The advantage of this operation over addition is that it never overflows. The downside is that if the result is subnormal, you might have lost a bit of information (the lsb). This should require only a small amount of extra control circuitry in the exponent calculations of existing adders. These primitives are adequate to implement, as a user library, an encapsulated data type consisting of a double precision significand plus a 32-bit (or 64-bit) exponent. Here is a sketch: class wideExponentDouble { int expt; double signif; // always less than 2 but not less than 1, if not NIOUZ wideExponentDouble(double value) { this.expt = SCALEOF(value); this.signif = COUNTERSCALE(value, value); } wideExponentDouble(int scale, double value) { this.expt = scale + SCALEOF(value); this.signif = COUNTERSCALE(value, value); } wideExponentDouble add(wideExponentDouble that) { if (this.expt < that.expt) return that.add(this); double thatsig = isNIOUZ(that.signif) ? that.signif : scalb(that.signif, that.expt - this.expt); double newsig = this.signif + thatsig; int newexpt = isNIOUZ(newsig) ? Integer.MIN_VALUE : this.expt + SCALEOF(newsig); return new wideExponentDouble(newexpt, COUNTERSCALE(newsig, newsig)); } wideExponentDouble multiply(wideExponentDouble that) { double newsig = MULSIGS(this.signif, that.signif); return new wideExponentDouble(this.expt + that.expt + SCALEOF(newsig), COUNTERSCALE(newsig, newsig)); } wideExponentDouble sqrt() { double adjustedSignif = this.expt & 1 == 0 ? this.signif : scalb(this.signif, 1)); return new wideExponentDouble(this.expt >> 1, // NOT "this.expt / 2" !!!! sqrt(adjustedSignif)); } ... } But for particular algorithms we can do much better with hand-coding. Returning to the question of comparing complex magnitudes: instead of comparing (x-a)*(x+a) and (b-y)*(b+y), let us compare (x-a)*(x+a)/2 and (b-y)*(b+y)/2. The point is that we can compute (x+a)/2 and (b+y)/2 using the AVERAGE primitive, and therefore the multiplications are the only possible source of overflow. c = x - a; d = AVERAGE(x, a); f = b - y; g = AVERAGE(b, y); p = LOWMULSIG(d, c); q = g * LOWCOUNTERSCALE(f, c); if (isUN(q) | isSubnormal(q)) { // testing q, not p p = c * scalb(x + a, 192); q = f * scalb(b + y, 192); } return p < q; Moreover, we perform one multiplication using LOWMULSIG, so it can't overflow, either. And b+y is not greater than x+a, so the result of LOWCOUNTERSCALE(g, d) is less than 1, so the other multiplication never overflows. So overflow simply is not an issue in this code. On the other hand, if we are very finicky, we must worry about underflow and about the possibility that the AVERAGE operation may have lost a least significant bit. The AVERAGE operation can have lost a bit only if its result is subnormal. So if computing d lost a bit, then p must be subnormal; if computing g lost a bit, then q must be subnormal. But x >= y and a >= b, so if d is subnormal then g is subnormal. It follows that if either AVERAGE operation lost a bit, then q is subnormal. So the test will catch those cases, as well as cases where both multiplications produce subnormal or UN results. In such cases we perform additions rather than averaging operations, scale the results up to avoid underflow, multiply, and compare. These cases occur only when at least one of the four numbers is very tiny. In the "normal" case, which covers nearly ALL numbers (not just half of them), the execution cost is four adds or subtracts (counting AVERAGE as an add), two multiplies (counting LOWMULSIG as a multiply), one extra scaling operation, and a test. Tricks such as AVERAGE are well-known to the DSP community for use with fixed-point arithmetic, and are built in as hardware instructions in many vector processors. Tools such as I propose would allow the careful floating-point programmer to mix the benefits of floating-point arithmetic with those of explicit scaling. Slides #59 through #63 illustrate how long products of sums may be handled in a similar manner, by controlling scaling ahead of time so as to prevent overflow and underflow from occurring in the first place. Use of the primitive AVERAGE in slide #62 would make the code even more concise, assuming that loss of the lsb when averaging very tiny numbers is not a problem for this application. (One would have to initialize "expt" to N rather than 0 to compensate for the extra divisions by 2.) I think the suggestion that the significand's last bit be used as an Inexact flag betrays a profound misunderstanding of its uses: "An inexact value represents a result that fell somewhere in the interval between adjacent exact values" is mathematically true and at the same time false for human purposes. It's true enough when the inexact value comes from exact operands. If inexact operands go in, the interval interpretation is misleading. However, it seems to aid people's intuition when they first see the idea, and it is interesting (and surprising) that if you "steal" the low bit of the word, as opposed to any other bit, to serve as an inexact flag then the resulting behavior looks like a rounding mode. However, this raises an interesting question: In practice, when the inexact flag is used, is the computed value usually discarded when it is found to be inexact after all, or is it not infrequently put to use despite the fact that it is inexact? If the value is usually discarded, then one might simply propose a mode of computation in which any inexact result is replaced by a NaN that is guaranteed to propagate the information about inexactness; then one need not lose a bit of precision. Likewise, you have not thought through the reasons for identifying NaNs with their origins; the _place_ in the program where the NaN was generated is what we need much more than the op-code. I understand that. In 1980, the address space of a typical computer was well under 23 bits and you could hope to encode a program location in the significand of a float. It's too bad that IEEE 754 didn't specify, or at least suggest, that a generated NaN should include the program counter of the instruction that generated the NaN! But that possibility has been blown away by Moore's Law. There may well be, in a fairly large code, more than 2^23 distinct program locations that could generate NaNs. So we must be more clever and retain partial information. Remembering the kind of operation seems helpful; there may be other information we can retain. As for doubles, we have a second word to play with, so maybe we can retain a 32-bit program counter, which may hold us for five years at most (my home desktop computer has 1 GB of memory on it). Experience with serious debugging systems over the last 20 years has shown us that even a program counter is really not enough information. It's not helpful, for example, to know that "hypot" caused an overflow without knowing who called "hypot". A thrown Java exception object is rather heavyweight for this reason; it contains a summary of many stack frames to aid understand of where and how the exceptional situation occurred. So perhaps a double could contain an integer that identifies a region of memory into which debugging information, including a program counter or many program counters, has been dumped. But allocating such a memory region is perhaps best done by software, and therefore is perhaps best handled by a low-level trap handler. Therefore the hardware specification need not concern itself with such things, as long as sufficiently many NaN values have been reserved for such use by software. I agree with your suggestions about max and min on p. 52, and with the usefulness of fast tests of flags and of operands' categoties. For the latter, perhaps reprogrammable PLAs would be helpful. An interesting idea for some contexts, thanks. Rounding modes are NOT just special trap handlers. Sure they are. If we define the inability to represent the exact mathematical value in the destination format by reason of insufficient precision (which we do: we call it the inexact trap), then that situation must be handled. It may be handled by hardware or by software, but it must be handled. One way to handle it is to choose a bit pattern that represents some value other than the exact mathematical value. When the chosen value is close to the exact mathematical value, we call this rounding. IEEE 754 specifies a trap enable flag for overflow. This allows us to choose easily between two trap handlers. One is a trap handler in the conventional sense: a user-supplied software subroutine. The other is a trap handler that, for efficiency, is normally implemented in hardware: it computes an infinity of the appropriate sign and returns that as the result of the operation that signaled the exception. IEEE 754 likewise specifies a trap enable flag for underflow. This allows us to choose easily between two trap handlers. One is a trap handler in the conventional sense: a user-supplied software subroutine. The other is a trap handler that, for efficiency, might be implemented in hardware: it computes an appropriate subnormal number (possibly zero) and returns that as the result of the operation that signaled the exception. But, for some processors, that trap handler is often also implemented in software. IEEE 754 likewise specifies a trap enable flag for inexact. It also provides for a four-bit rounding mode specifier. Together, at a sufficiently abstract level of interpretation, these allow us to choose easily among five trap handlers. One is a trap handler in the conventional sense: a user-supplied software subroutine. The other four are trap handlers that, for efficiency, are normally implemented in hardware: we call these trap handlers round-to-nearest, round-toward-zero, round-toward-plus-infinity, and round-toward-minus-infinity. If only the specification of the interface to the user-supplied trap handler had required that the trap handler have easy access to the raw result plus the guard and sticky bits, it would be easy to implement other rounding modes in software. DEC's Alpha almost got them right when some were allowed into the OP-code, but put the wrong ones there. (p. 65) Yep, alas. Trap handlers should certainly be subroutines supplied along with the run-time library of Math functions and Decimal-Binary conversions. There are so few things that can be done usefully after floating-point exceptions, they might as well be put into a standardized menu. (p. 68) The _Modes_ requested by a program to determine how exceptions shall be handled, and how rounding will be directed, are values of variables whose scopes are the proper responsibility of language designers. We agree that they are not best regarded as "dedicated registers", and that "Side effects on global state are a bad idea ...". (pp. 69 - 70) I think here we are in agreement. "Borneo may be on the right track for Java" warms my heart. Glad to hear it, and I meant it sincerely. My two reservations are, first, whether trap handlers should be subroutines (which they are not in Borneo, clearly for reasons of fitting in with the rest of Java), and, second, whether it has sufficient abstraction to permit some wiggle room for implementors under the hood. What looks like a trap handler in Borneo might be reached at the low level either by a hardware trap mechanism or an explicit branch instruction; I want to make sure that hardware designers and compiler writers have some choices. :::::: :::::: :::::: :::::: :::::: I am very grateful that you took the time to look over my slides and give me such detailed feedback. I'd love to continue the conversation, too, and work toward a more detailed and sensible proposal, and perhaps an appropriate refinement of Borneo with Joe Darcy. Yours, Guy Steele