SRTEST 15 March 1995 A Test for SRT Division ========================= ************************************************************** * CAUTION: This is work in progress. It may be WRONG. * * It is being distributed now only so that my colleagues * * in academe and in industry, who need it immediately, * * can try it out and help me get it right. * ************************************************************** Prof. W. Kahan Elect. Eng. & Computer Science University of California Berkeley CA 94720-1776 (510) 642-5638 Abstract: ~~~~~~~~~ SRTEST is an astonishingly simple computer program devised to test the correctness of quotient-prediction hardware for binary floating-point division that uses any Non-Restoring (SRT) algorithm. Nowadays that hardware often takes the form of a Programmed Logic Array (PLA). How long SRTEST should run to "prove" a PLA correct remains an open question; however, SRTEST is intended to uncover a flaw in a PLA far sooner than random testing can. For example, when it ran on an Intel Pentium with the notorious binary floating-point divide bug of 1994, SRTEST uncovered the flaw in a tiny fraction of a second after performing its 356 th trial division and a few thousand arithmetic operations in all, although the flaw had earlier been found to afflict one division in about 8,800,000,000 with independent random operands. However, to use either that kind of finding or SRTEST's performance to predict failure rates among diverse users of a defective divider is an abuse of statistics. A study of SRTEST's workings sheds light on that abuse and helps us avoid repeating it. Table of Contents: ~~~~~~~~~~~~~~~~~~ Introduction SRTEST's Needs and Limitations Outline of SRT Division Carry-Save Remainders What SRTEST Doesn't Need Why SRT Division is Hard to Test Proofs vs. Tests Where to Test Strips of Layers of Tiles Out of Desperation Scattering Samples Checking a Quotient The Proof of a Pudding is in the Eating A Short History Bruised Integers Illusory Randomness Unanswered Questions Temptations to be Resisted Annotated Bibliography Acknowledgements Appendix: Fragility of Low Probabilities Appendix: Vagaries of Carry-Save Appendix: Program Listing Introduction: ~~~~~~~~~~~~~ A millennium ago, the Moorish craftsmen who built some of Spain's architectural jewels manifested humility by incorporating small flaws into their handiwork lest the deity punish any mortal's presumption of artistic perfection. To-day's computer architects accomplish the same effect, but not out of humility. Now complexity punishes the naive presumption that several million transistors can be commanded each to do the right deed at the right moment, and now knowledgeable people expect no modern microprocessor to be perfectly correct. How much imperfection warrants replacement? Nobody disputes that a task that can be performed correctly, and whose correctness is provable mathematically, should be performed correctly lest our ability to cope with unavoidable imperfections be overwhelmed. If there is a dispute, it concerns the extent to which avoidable but humanly inevitable errors should occasion expensive remedies. Intel faced that question again in the summer of 1994 after a flaw had come to light in their Pentium (TM) processor's floating-point division. According to Jim Jarrett (1995), Intel's Head of Investor Relations, ... " A task force was formed immediately to characterize the problem and determine its impact. The group looked at everything from spread- sheets and computer-aided design programs to fluid dynamics and financial applications. After a lot of work, a relieved task force reported last summer that the floating-point flaw was truly insignificant. Specifically, on a statistical basis, a user would encounter reduced precision in floating-point divide operations once every 9 billion random number pairs. For the average spreadsheet user performing 1,000 floating-point divide operations daily, this meant reduced precision once every 27,000 years. The user was much more likely to encounter failures in other PC components; for example, a soft error in a DRAM chip occurs statistically once every seven years in an 8-MB array. ... It was insignificant enough that it wasn't even listed on an errata sheet. The masks were changed to fix the flaw." Many a statistician and almost all engineers could find no fault with the task force's analysis nor with the conclusions drawn above nor with Intel's consequent decision to replace defective Pentiums only for customers who could demonstrate that its flaw put their work at risk. That seemed to be a sound decision considering that Pentiums had been running for over two years before its divider's flaw had been noticed. I disagree. I shall try to explain how flaws like Pentium's can stay hidden for astoundingly long times almost regardless of failure rates. I shall try to explain why, regardless of the correctness of failure rate statistics for random divisions, subsequent probabilistic conclusions were illogical and could not justify the decision to replace defective chips only grudgingly. That decision was soon rescinded, but for reasons having more to do with lawsuits, public relations and public confidence than with doubts about its probabilistic justification. What are the right probabilistic conclusions? Too little is known now to support scientifically any better assessment of risk than has been tendered by The MathWorks' Dr. Cleve Moler (1995): " With the Pentium, there is a very small chance of making a very large error." To his assessment this paper will add a strong suggestion, but short of proof, that practically all users of defective Pentiums must fall into two classes. By far the larger class consists of users who are so unlikely to be harmed by the defect that no underwriter could do better than guess at the premium they should pay for insurance against harm. The smaller class consists of users almost certain to be harmed by the defect almost regardless of the frequency of divisions in their work. Unfortunately, only for users who perform no floating-point divisions ( and hence face no hazard ) do we know a better way than trial and error to determine the class into which any particular user falls. Our conclusions leave unanswered the question Intel faced as Jarrett (1995) rephrased it: " What happens in the future if we discover a flaw that produces a problem once every 90 billion random number pairs; do we still need to replace processors? What about once every 900 billion? No processor is ever perfect, and if the answer to every bug ÄÄ no matter how minor ÄÄ is to replace the processor, what does this do to the economics of computing? What happens to the pace of innovation in an industry paralyzed by caution?" Questions like these deserve answers, but these do not because they beg the question. They take for granted a kind of randomness that makes sense for transient disruptions to DRAMs from alpha particles but makes no sense for a reproducible and systematic arithmetic bug like Pentium's. The lower the failure rate for random number pairs, the less relevant it is to a realistic appraisal of the bug's impact, since hardly anyone computes quotients of uncorrelated random numbers. The impact of a systematic bug is best revealed through preponderantly non-probabilistic considerations. Complexity is one of those considerations. If the bug can be awakened only by extremely complicated ( and therefore presumably unlikely ) combinations of circumstances, that bug may be eligible for promotion to a "feature" in the documentation provided also that a relatively simple software patch will let the sleeping bug lie. Conversely, if someone unaware of the bug's existence may write a very simple program that awakens that bug to destructive action, that bug is dangerous. A simple program that will expose a bug without prior knowledge of that bug's location makes a good test program. Conversely, if this test program is general enough and simple enough, that bug is dangerous. SRTEST is such a test program. Its design predates Pentium's by a decade. It hits Pentium's bug first on the 356 th division tested and again 2,294 times in the first million tested. And SRTEST is very simple; about half a dozen arithmetic instructions in its inner loop generate test divisors and dividends, and another half dozen test whether quotients are correctly rounded. That no such test was known to Pentium's designers is regrettable; it is also hard to believe that so simple a test has remained entirely unknown until now. Two integer parameters sum up what SRTEST has to know about the SRT divider under test: one estimates how many leading bits of divisor go into the quotient digit selection circuitry; the other estimates each digit's width. ( These are 5 and 2 for Pentium.) These estimates need not be perfect. SRTEST need not know how the divider manipulates remainders nor whether they are held in a carry-save form. Though the strategy behind SRTEST is simple, its derivation is not, nor is it yet logically complete; in short, SRTEST works better than I can explain. None the less, its existence, its simplicity, and its high yield of division failures cast doubt upon appraisals of risk inferred from randomly generated failures. How SRTEST was derived and what it does will be explained. I cannot say how far to trust SRTEST, nor how far to trust alleged proofs of correctness since they can lie too. Still, Intel's experience should teach us what not to do again; that will be summarized. Technical details about the fragility of extremely low probabilities and the vagaries of carry-save adders have been banished to appendices along with a listing of the program SRTEST. SRTEST's Needs and Limitations: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SRTEST needs values for two of the integer parameters that distinguish different variants of SRT division; they will be specified later. SRTEST presumes that floating-point arithmetic operations other than division, namely addition, subtraction and multiplication, are all performed correctly as prescribed by IEEE Standard 754 for Binary Floating-Point Arithmetic. Some versions of SRTEST perform also integer operations upon the exponent and logical manipulations upon the significand of a floating-point number but, if portability is assigned precedence over speed, these can all be replaced by slightly slower standard floating-point arithmetic operations including IEEE 754's LogB and ScalB functions. Every version of SRTEST is intended to run on floating-point hardware whose division, whether built-in or simulated in software, is the sole rational arithmetic operation in doubt. Except in that operation, roundoff never figures in versions of SRTEST that perform logical manipulations upon significands. SRTEST is not intended to test efficiently whether division is rounded correctly; that task is performed by my program DIVCK (1986), which widespread use has shown to be effective. By the way, Dr. D. Hough of Sun Microsystems has reported that DIVCK, though not designed to do what SRTEST does, found an error in Pentium's single-precision division after two minutes' running. That is dumb luck, and bad luck for Intel where DIVCK had been run on a simulation of Pentium too slow to run long enough. Thus do ever more ambitious designs overwhelm our simulators and other design aids; to redress the imbalance we must invest more design effort into tests better focused upon our designs' vulnerabilities, if we can figure out what they are. SRTEST is focused; it is intended to test efficiently only the logic that selects quotient digits in SRT division algorithms. Any other error SRTEST may uncover will have been uncovered accidentally. Outline of SRT Division: ~~~~~~~~~~~~~~~~~~~~~~~~~~ Almost nothing need be known about Non-Restoring (SRT) division to appreciate what SRTEST does. That division process will be outlined briefly here in terms chosen to resemble as much as possible the way Long Division has long been taught in schools. During the process, each quotient digit ( it can consist of more than one bit ) is selected from a table of eligible digits; that table is often implemented as a PLA. It is best viewed as if organized in rows and columns. A few leading bits of the divisor determine which column of the PLA will supply all the quotient digits for any given instance of division. A few leading bits of the current remainder ( initially it is the dividend ) determine the row. The quotient digit extracted from the selected row and column is then multiplied by the divisor and their product subtracted from the current remainder; the difference is the new remainder that will determine the next quotient digit. After sufficiently many quotient digits have been extracted by the foregoing process, they are assimilated into a conventional quotient and rounded to produce the final result of the division operation. Every digit-by-digit divider has a Radix, which indicates roughly the width of each quotient digit. For long division as taught in schools, where Radix = 10 (decimal arithmetic), the quotient digits come from the set { 0, 1, ..., Radix-1 }. Most computers use a power of 2 as radix; for quotient digits L bits wide, Radix = 2^L . SRT division derives its speed advantage by selecting each quotient digit from a redundant set of digits. For instance, when each digit is one bit wide it will be selected from the set { -1, 0, 1 } ; when two bits wide it may well be selected from { -2, -1, 0, 1, 2 }, as happens on Pentium. Redundancy renders unnecessary the revisions of ill-guessed quotient digits that may be remembered as part of long division as it is taught in schools. Instead, subsequent digits get chosen in a way that compensates for any slightly ill-guessed quotient digit. The speed advantage comes about because, once chosen, no quotient digit need be revised. The disadvantage is that a quotient can have many redundant representations from any one of which its conventional binary representation must ultimately be derived by an assimilation process akin to subtraction though potentially faster. Carry-Save Remainders: ~~~~~~~~~~~~~~~~~~~~~~ There is another complication: SRT remainders may be represented in a " Carry-Save " way. That representation of a binary number consists of two numbers whose sum is the number represented, but which are not added explicitly to produce that sum lest doing so impose a delay while carries propagate a long way from right to left. Circuitry to add or subtract a third number to or from the two can be designed to produce a result in carry-save form ( as a sum of two new numbers ) extremely quickly and compactly too because it need not provide for propagations of carries or borrows. The wider the numbers, the greater is carry- save's advantage in speed and compactness. For high-precision SRT division like Pentium's (64 sig. bits), carry-save's advantage is too great to forego. The disadvantage is that each numerical value has vastly many different carry-save representations, which complicates the designer's task a little and the tester's task a lot. Those complications are explained in detail in the Appendix: Vagaries of Carry-Save. What SRTEST Doesn't Need: ~~~~~~~~~~~~~~~~~~~~~~~~~~~ SRTEST is oblivious to all the details of the division process except the divider's Radix and the number N of leading bits of divisor used to select a column from the PLA. The PLA must have 2^(N-1) columns. ( Why not 2^N ? Because a normalized divisor's first leading bit is always 1 .) The time taken by SRTEST is proportional to the number of columns, other things being the same. Therefore, if N is not known, supplying an over-estimate instead to SRTEST will merely slow its work. ( N = 5 for Intel's Pentium, whose PLA has 16 columns.) Similarly, an over-estimate of the Radix is expected only to retard SRTEST. ( Pentium's Radix = 4 .) However under-estimating Radix or N may invalidate the test by preventing or excessively delaying SRTEST's inspection of parts of the PLA. SRTEST is oblivious to the number of rows in the PLA. That number is determined by the number of leading bits of the current remainder used to select a row. ( Intel's Pentium uses 7 bits, so its PLA has 128 rows.) Increasing the number of rows is expected to slow SRTEST proportionately, but that expectation has not yet been tested. SRTEST is oblivious to the entries in the PLA whose correctness is being tested. That is not so paradoxical as it may seem at first; we care only that the PLA as a whole be correct regardless of how it was constructed. Besides, we must be grateful that SRTEST does not ask us for a copy of the PLA; Pentium's table has 2048 entries. A characteristic of SRT division is that many of the entries in its PLA could be changed without invalidating the division process. There are two reasons for this. One is that over a quarter of the PLA's entries, particularly the topmost and bottommost few in each column, are " Don't Cares ": they never get selected by the division process. Redundancy in the set of quotient digits is the second reason; this implies that many an error in an early quotient digit can be put right by changing all subsequent digits in ways that will cancel the error at the end when all digits are assimilated into a correct final quotient. SRT division's limited capacity for self-correction is advantageous in some respects; it lets circuit designers vary quotient selection logic within limits in order to optimize some aspect ( cost or speed ) of a design. Neither we nor SRTEST wish to be distracted by variations, in quotient digit sequences, that yield the same quotient at the end. Why SRT Division is Hard to Test: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SRT division's capacity for self-correction is limited in a way that is vulnerable to errors of a kind that we may expect to uncover only by extravagantly prolonged testing. Some entries in the PLA, especially those adjacent to entries that can never be selected, are selected in so relatively few instances of division that errors in those entries big enough to defeat self-correction relatively often could still stay hidden until one of a few critical examples were tested. Other entries in the PLA, though selected relatively frequently, could be altered in a way that would fail to be self-corrected only in relatively rare instances; such errors too could remain unexposed until one of those rare critical instances were tested. And representing remainders in a Carry-Save fashion further rarifies critical instances, leaving them excruciatingly difficult to uncover; see the Carry-Save Appendix. The statistics of those critical instances are appallingly worse than anyone had suspected before Pentium's FDIV bug came to light. Its PLA had errors, 0 instead of 2 , in five out of 2048 entries. These gross errors were not uncovered until too late despite billions of divisions tested. Alas, these five entries lay along the boundary between entries that do and entries that never get selected, so only a very tiny fraction like 1/8,000,000,000 of all possible divisions ever selected them and therefore failed to be self-corrected. This tiny fraction was estimated, after the errors had come to light, in an Intel "white paper" by Dr. M.L. Barton and H.P. Sharangpani (1994). They obtained two estimates, one from a model by Markoff chains, the other from a count of the number wrong in more than 10^12 double-precision divisions tested at random on Pentium. The estimates agreed. Barton and Sharangpani also observed that defects appeared uniformly distributed in erroneous quotients from their 14th to 52nd sig. bits; and in so far as divisions were infrequent in applications software they inferred that all but a tiny fraction of Pentium users were extremely unlikely ever to be affected adversely by these defects. If random testing were the only way to test SRT division, we could not afford it. Neither can we afford the damage done by errors that, as we shall see, can occur far more often than the tiny fraction cited above suggests. The tiny fraction merely tells us that random testing of SRT dividers is too wasteful unless it fails early, and that the first million or so customers who use such dividers can easily overlook something wrong for at least several months, rather than that errors unexposed by billions of random tests must occur too rarely to matter much to most of us. We shall revisit these issues. Proofs vs. Tests: ~~~~~~~~~~~~~~~~~~~ If an SRT divider's PLA is correct, it can be proved correct mathematically, so why not do just that and skip bothersome testing? The designers of Pentium's PLA did prove it correct mathematically, or so they thought, but their "proof" of correctness was incorrect. They had " simplified " it by presuming their PLA sign-symmetric. A natural presumption, it can be found also in J. Fandrianto (1987), though he had to modify his PLA a little to make it work correctly. That presumption is slightly incompatible with carry-save remainders, but by itself it did no harm to Pentium. However, a characterization of Pentium's PLA, submitted to the CAD tool that laid it out, was inadvertently incomplete, rendering five entries in the PLA invalid. Unluckily, those invalid entries were overlooked because they fell in that half of the PLA that was presumed identical but for sign to the half that was checked for correctness. As if in collusion, the two mistakes concealed each other; still, they need not have been fatal. The designers' fatal mistake, induced by faith in their "proof" of correctness, was not to contemplate whether their divider differed enough from Intel's previous practice to require new tests. Instead, the corporate department responsible for testing ran its old tests ÄÄ random tests and DIVCK ÄÄ on a simulation of the new design. We know now why those tests turned up nothing wrong. Reviewing the sequence of events, many a practitioner must be thinking " There go I but for the grace of ... ." Proofs are not infallible. In the mid 1980s the implementation of floating-point on the Inmos Transputer T-800 was "proved" correct mechanically. Its specifications had been spelled out in a language called " Z " and an implementation microcoded in " Occam," both languages so contrived that the implementation's compliance with its specifications could be confirmed by a computerized theorem prover. However the specifications contained two mistakes. One was unintended: the floating-point number -0 had been confused with the most negative binary integer, whose bit-pattern is the same. The other mistake was misguidedly intentional: the flags for Invalid Operation, Overflow and Divide-by-Zero were merged into one flag. Consequently the T800 occasionally mishandles -0 , and spuriously signals INVALID when a computation creates and then consumes infinity in a way that would be innocuous if it conformed to IEEE Standard 754. Early advertisements that perfect conformity had been "proved" are still believed widely. Come to think of it, why should specifications and proofs be any less vulnerable to error than designs and programs? Specifications are mere translations of good intentions into formal language; something can get lost in translation, as poets and contract lawyers know. Proofs run to considerably greater lengths than the designs and programs they are intended to validate, so proofs have a larger capture cross- section for error. We yearn for the sense of security that proofs provide, but certainty eludes us and security is an illusion. That is not to say that proofs are worthless. Quite the contrary: +---------------------------------------------------------------+ \ The best testimony to a design's correctness is a proof. / \ But it is testimony of uncertain credibility without / \ corroboration. The best corroboration is another / \ independent proof provided it is not too long. / +-----------------------------------------------------+ Tests provide corroboration too, some more so than others. Testimony from purely random tests is undermined by a suspicion that too many of them may have fallen into terrain already tested instead of into Terra Incognita. That is how random tests of Pentium's SRT division were wasteful; they almost all exercised only the PLA's non-extreme rows, leaving extreme rows inadequately tested until after billions of tests. Regardless of their inefficiencies, tests are indispensable; tests of devices "proved" correct actually test our own thinking. And since test programs can deceive us so easily, they too must be assayed for efficiency, proved correct and tested. Of doubts there can be no end but instead a Law of Diminishing Returns. To enhance the returns from tests we must concentrate them where they will do more good. Where to Test: ~~~~~~~~~~~~~~ " Cherchez la femme ! " ... attributed by Alexandre Dumas to his fictionalized detective Joseph Fouch‚. Our motto will be " Seek singularities ! " There lie all errors. In the theory of complex variables, the singularities of an analytic function are the boundary points of its domain of analyticity. Inside its domain an analytic function is smooth, infinitely differentiable. An aphorism attributed to Riemann goes roughly like this: You tell me how an analytic function behaves at and near every one of its singularities; then I shall tell you what it does everywhere else. When that aphorism was first uttered, analytic functions being studied all had countably many singularities, so characterizing a function by its singularities was the ultimate kind of data compression. Logical functions of numerical arguments are piecewise constant; their singularities are the arguments where the functions change value. To know a logical function at and near its every singularity is to know it everywhere. To prove the logical function of the PLA correct SRTEST need check it only at and near every ( whew! ) numerical input where either the state of the PLA or the desired result changes. How near is "near" ? Roundoff blurs that question, so SRTEST is designed to avoid interference from roundoff, as we shall see later. Let us think of the singularities of the SRT PLA as points plotted in the ( Divisor, Dividend ) - plane. We need consider only the square 1 ó Divisor < 2 and 1 ó Dividend < 2 since floating-point exponents do not interact with the PLA. Singular divisors are those at which selection switches from one column of the PLA to the next, plus endpoints; there are 2^(N-1) + 1 at Divisor = 1, 1 + h, 1 + 2h, 1 + 3h, ..., 2 - 2h, 2 - h, 2 where h = 1/2^(N-1) . They cut the square into 2^(N-1) vertical strips, one for each column of the PLA. Each of those strips breaks into tiles across whose boundaries selection switches from one row of the PLA to another. In the ( Divisor, Dividend ) - plane, all the tiles' boundaries constitute the singularities of the SRT PLA. There are very many of them, far too many for anyone to test them all. At best, we might hope to test some singularities in a reasonable order. Some of those boundaries lie on lines that are easy to describe. The easiest, as we have seen, are the vertical boundaries of the strips. Next easiest are the horizontal lines across which selection for the first quotient digit switches. The boundaries across which switching occurs for subsequent quotient digits are not so easy to decribe. Strips of Layers of Tiles: ~~~~~~~~~~~~~~~~~~~~~~~~~~ In each vertical strip, each quotient digit position generates a layer of overlapping tiles across whose non-vertical boundaries that digit's selection changes from one row of the PLA to the next. For each succeeding quotient digit position, the tiles in its layer become more numerous and more complicated to describe. The tiles' shapes depend upon an arcane question about the division operation's implementation: Are remainders represented in a Carry-Save fashion ? If remainders were NOT represented in carry-save fashion, every tile in the K-th quotient digit position's layer would have a simple shape consisting of a parallelogram with two opposite sides embedded in the boundaries of a strip and two whose slope is a quotient with zeros in the K-th digit and beyond. Carry-save complicates the tiles; they can have complicated interpenetrating shapes depending upon how the remainder's two carry-save components combine their leading several bits before they go to the PLA to select a row. This is the kind of detail we wish not to have to explain to SRTEST, which does not know even how many rows the PLA has. Were we to try to take such details into account we might have to write a test program so complicated as would blatantly invite blunders. Let us put off thinking about details unless and until we are compelled to abandon hope for a simple test. SRTEST generates and tests sample points ( Divisor, Dividend ) each of which touches one tile from each of as many layers as there are digits in a computed quotient. Each tile is the image, in a strip of the ( Divisor, Dividend ) square, of an entry in the PLA selected for the quotient digit position corresponding to the tile's layer. If that entry is erroneous then, amongst its images, there must be some tiles which, if touched in a suitable place by a test, will expose the error. Other tiles, though images of that erroneous entry, will when tested conceal the error because SRT division is partially self- correcting, so some of its quotients may be correct despite the error. None the less, in order to expose whatever error may exist, testing must ultimately touch every tile; moreover, testing must touch one of the places in an erroneous tile where self-correction fails though such places may occupy a relatively tiny fraction of the tile's area. How can SRTEST touch those places if it shall not know where they are? Out of Desperation: ~~~~~~~~~~~~~~~~~~~ SRTEST knows locations for no other singularities than the boundaries of the 2^(N-1) vertical strips, so it distributes its tests with ever increasing density along and next to those boundaries. If only to turn this necessity into virtue, we shall contemplate two hypotheses: Hypothesis V : Every tile has at least one edge that runs along ~~~~~~~~~~~~~~ the vertical boundary of a strip and, whenever self-correction fails, it fails at or next to such an edge ( and perhaps elsewhere too ). Hypothesis W : Self-correction fails in regions shaped roughly ~~~~~~~~~~~~~~ like triangles, sometimes very small, of which the bigger ones press against tiles' corners. Hypothesis V is provably true without carry-save representation; my proof exploits the monotonicity of column selection vs. remainder. And casual exploration of a few examples turned up no counter-example for the kind of carry-save representation I tried. Hypothesis W comes from empirical observation supported by the following thoughts: when an erroneous quotient digit, too big or too small, cannot be self- corrected then it is often followed in the correct quotient by at least one digit that is already extreme, minimal or maximal resp. Quotients that contain a string of extremal digits occur preponderantly at tiles' corners. If hypothesis V holds then running SRTEST long enough will, by planting a test sample on or next to a vertical edge of every tile, ensure ultimately that the PLA has been tested exhaustively. But if hypothesis V fails, tiles that do not touch a strip's vertical edge will escape testing by SRTEST, which may then fail to test the PLA exhaustively no matter how long it runs. Hypothesis W , if true, accounts for SRTEST's efficiency: random tests farther away from the vertical strip's edges would have to hit smaller triangles of uncorrectable errors and would miss more often. True or not, the two hypotheses motivated SRTEST. Scattering Samples: ~~~~~~~~~~~~~~~~~~~ Random number generation comes naturally to mind as a way to scatter sample dividends, but random samples yield fewer hits than SRTEST's simple formula, so it contains no explicit random number generator. | Footnote: +-----------------------------------------------------+ | This is not to disparage random testing altogether. In general, | | additional random testing continues to be necessary lest testing | | focused at known singularities miss others that were overlooked. | +-------------------------------------------------------------------+ Instead, inspired by Hypotheses V and W , SRTEST chooses divisors at and near the singular divisors that mark the vertical boundaries of strips, and chooses dividends that will ultimately, if SRTEST runs long enough, exhaust all integers of some preassigned width. These choices are determined by three parameters: N = the number of leading bits of divisors that go to the PLA. L = Log2( Radix ) ; i.e., Radix = 2^L . M = maximum bit-width preassigned for divisors. SRTEST uses (2*L + 1)*2^(N-1) divisors, each an M-bit integer of the form 2.0^M - j*2.0^(M-N) + S for j = 0, 1, 2, ..., 2^(N-1) where S runs through the set { -2^(L-1), ..., -4, -2, -1, 0, 1, 2, 4, ..., 2^(L-1) } . These perturbations S provide divisors on and near the boundaries of vertical strips scaled up by 2^(M-1) . Early versions of SRTEST ran S through the set { -1, 0, 1 } , but patterns of period L in carry-save remainders were noticed by Tim Coe, who observed that these patterns could severely delay revelation of some kinds of flaws in a PLA. He proposed that the bigger set could hasten the revelation of flaws even if it did not increase SRTEST's yield ( number of failures found per division tested ). His proposal, augmented slightly as decribed below, has worked well so far. SRTEST's dividends start with consecutive odd integers 1, 3, 5, ... . These ensure that every row of the PLA will be touched as soon as possible during the earlier quotient digits' selection, and also tend to generate sequences of extremal ( hence uncorrectable ) quotient digits when divided by divisors that are adjacent to the edges of the vertical strips. That tendency is valuable enough to merit enhancement by the following measures, suggested by Tim Coe's recollections of conversations with Dan Zuras held at Hewlett-Packard Laboratories. Each odd integer generates three neighboring dividends. First that integer is normalized into the range 2^(M-1) ó Dividend < 2^M by scaling by a power of 2 . Then two more dividends are obtained from that one by perturbing it slightly: choose a small integer I ò 0 and add +2^I and -2^I to the normalized Dividend. The choice I = 0 has worked well enough; I = L-1 has worked better. That perturbation has enhanced the yield of SRTEST over fiftyfold on Pentium. Since SRTEST may be run for a long time, some provision for stopping and restarting it is necessary. To that end, the current odd integer from which three dividends were derived is displayed continually and is available for initialization when SRTEST is restarted. Integer formulas prevent the troubles that could be caused by faults in a compiler's decimal-binary conversion of nonintegers, especially if that conversion uses the floating-point divide instruction that SRTEST is testing. Analogous hazards inherent in binary-decimal conversion of displayed output have been addressed by displaying also in hexadecimal. Checking a Quotient: ~~~~~~~~~~~~~~~~~~~~ After what ideally would be Q = Dividend/Divisor has been computed and rounded off, how can its correctness be ascertained without using possibly incorrect division operations? SRTEST computes the remainder R = Dividend - Q*Divisor exactly and checks that it is no bigger than is compatible with a correctly rounded Q . Two details deserve a few moments' attention: How is R computed exactly? How big is " compatible " ? Provided floating-point addition, subtraction and multiplication work correctly, R can be computed exactly if each factor of Q*Divisor is first broken into pieces that can be multiplied without roundoff. For instance, suppose Ql is obtained from Q by rounding it off to half as many significant bits as would normally be carried; how to do that will be discussed later. Then Qt = Q - Ql is computed exactly with a floating-point subtraction, and consists of the trailing half of the significant bits of Q . Similarly split up Divisor = Dl + Dt . Then no product Ql*Dl, Ql*Dt, Qt*Dl, Qt*Dt engenders any roundoff, so R = ((( Dividend - Ql*Dl ) - Ql*Dt ) - Qt*Dl ) - Qt*Dt exactly provided the subtractions occur without roundoff. For the sake of simplicity and speed, SRTEST chooses divisors with Dt = 0 , so R = ( Dividend - Ql*Divisor ) - Qt*Divisor suffers no rounding error and is therefore computed exactly. ( Should such half-width Divisors be found inadequate for some reason, SRTEST could easily be modified to accommodate wider Divisors.) The simplest way to break Q into a sum Ql + Qt is to copy Q to Ql and then write zeros over as many trailing significant bits of Ql as suffice to ensure that Ql*Divisor will suffer no roundoff. This bit-twiddling can be replaced by a portable but slower alternative: T = H*Q and Ql = ( Q + T ) - T using IEEE Standard rounded floating-point arithmetic and a constant H = 2^M - 1 in which M is the number of trailing bits to be cleared from Q . This alternative must be used when floating-point arithmetic carries an odd number 2M-1 of significant bits, in which case M must barely exceed half that number and every Divisor must be at most M bits wide; then Qt = Q - Ql will turn out at most M-1 bits wide but possibly negative. In any event, IEEE Standard 754's rounded floating-point arithmetic is more than good enough to deliver Qt with no rounding error, and likewise R . How big can the remainder R not get if the quotient Q is correctly rounded? Then the error in Q must be smaller than a floating-point quantity Qe obtained by clearing to zero all the significant bits of Q except its ( possibly implicit ) leading 1 , and decrementing its exponent by the number of significant bits carried. Just as Qe should exceed the magnitude of Q - Dividend/Divisor , so Qe*Divisor should exceed the magnitude of the remainder R ; otherwise floating- point division is flawed and fails the test. Once again, bit-twiddling can be replaced by a portable alternative: Qe*Divisor = ScalB( Divisor, LogB(Q) - NumberofSigBits ) wherein ScalB and LogB are functions recommended by IEEE Standard 754 but " more honour'd in the breach than the observance." The C functions frexp and ldexp can be used instead but less conveniently. Similar steps assist normalization and perturbation of dividends. The Proof of a Pudding is in the Eating: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The earliest version of SRTEST was programmed in Fortran and run on a Pentium in December of 1994; it found a flaw in the 13,800 th quotient tested. When the current version was programmed and run in February of 1995, it took a tiny fraction of a second to uncover the first failure in the 17 th decimal of the 356 th quotient tested: 3221225470 BFFFFFFEh ---------- = --------- = 0.9999999996895591417 , 3221225471 BFFFFFFFh ~~~ not Pentium's 0.9999999996895591047 . ~~~ If SRTEST were less sensitive to errors in quotients it might have had to wait for the 686 th quotient tested to uncover a failure in its 8 th decimal: 2684354558 9FFFFFFEh ---------- = --------- = 0.6666666663 , 4026531839 EFFFFFFFh ~~~ not Pentium's 0.6666666347 . ~~~ The worst relative error found, about 1/2^14 = 6.1 E-5 , occurred in the 50,096 th quotient tested: 2148268000 800BF7E0h ---------- = --------- = 0.666910 , not Pentium's 0.666869 . 3221225471 BFFFFFFFh ~~ ~~ Almost the same error had first occurred in the 18,932 nd quotient tested. That Pentium's floating-point division cannot do much worse has been confirmed by A. Edelman as an extension of results obtained by Coe and Tang (1995) to be mentioned below. Like most gleaners, SRTEST ran into the Law of Diminishing Returns after running for a while; its initial yield of one faulty quotient per 200 or 300 tested dwindled to one per 600,000 . Here is a tabulation that shows how the yield dwindled: Total Number of Total Number of Failures Found Divisions Tested ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~ 1 356 - 684 4 836 - 1,119 10 2,275 48 9,476 - 10,194 100 20,756 - 20,950 450 99,956 - 100,284 1,000 285,715 2,295 999,625 - 1,000,285 7,884 9,998,470 - 10,003,509 10,000 19,424,950 - 19,425,865 17,668 99,898,960 - 100,002,685 30,506 994,345,751 - 1,001,442,070 53,935 9,999,879,331 - 10,000,743,330 100,000 47,689,282,440 112,670 56,902,042,535 - ... Though this yield far surpasses the one per 8,000,000,000 that random testing would yield, its dwindling is significant for reasons to be discussed later. A Short History: ~~~~~~~~~~~~~~~~ Early in the summer of 1994, when comparing another processor's test results with Pentium's, Intel discovered that the latter's floating- point division was defective, but kept silent in adherence to a long- standing policy ( reversed nine months later ). In October Prof. Thomas M. Nicely, of Lynchburg College Math. Dept. in Virginia, traced a discrepancy between his computations on other computers and on Pentium to that defect and brought it to public attention. He deserves a medal for checking his work. Then Intel compounded its error by attempting to minimize the bug, thereby ensuring that it would hatch into a major embarrassment. Soon afterward Tim Coe, at Vitesse Semiconductor Corp., Camarillo CA., reverse-engineered Pentium's divider well enough to predict, contrary to opinions expressed at that time ( cf. Anonymous (1994)), that Pentium's single-precision division could suffer relative errors as bad as 6.1 E-5 : 4195835 4005FBh ------- = ------- = 1.33382 , not Pentium's 1.33374 . 3145727 2FFFFFh ~~ ~~ Coe had deduced that five entries in Pentium's PLA that should be 2 were 0 , and that therefore divisors very slightly less than the five integers 18, 21, 24, 27 and 30, or numbers obtained by scaling them by powers of 2 , were the divisors most likely to cause errors. These are five of the seventeen singular divisors near which SRTEST focuses its testing. In December 1994, Coe's deductions were confirmed in the white paper by Barton and Sharangpani, which had been released by Intel to support its estimates of failure rates like one division in 8.8 billion, one failure in 27,000 years for most spreadsheet users, and so on. Prof. Nicely had estimated the failure rate for random divisions at one in 31 billion. Estimates like these were cited to justify Intel's public stance: Pentium is safe for almost everybody. Meanwhile all numbers very slightly less than a small positive integer had come to be called " Bruised Integers " on the InterNet, where a brisk correspondence flourished among participants ranging from idle curiosity seekers to technology managers fearful of worse revelations to come. Since all failures found initially involved bruised integers, an analyst within IBM decided to investigate how often they arise out of decimal-binary conversion errors; for example, " 0.3/0.1 " incurs two rounding errors before the division is performed, and that is why double-precision calculation of that quotient produces the bruised integer 2.999...999 , in error by 1/2^51 . Although its author did acknowledge that his was a " simplistic analysis," it inspired IBM to suspend sales of Pentium-based PCs for fear that a spreadsheet user might encounter a mishap every 24 days on average. IBM's public stance was that Pentium is unsafe for almost everybody. Nobody will ever know which estimate, Intel's or IBM's, is closer to the truth unless someone is employed for several months to compare results for identical wide-ranging real-life tasks run on Pentiums with and without the division bug, or else run on a flawed Pentium using a slow counting option in the software patch described below. Since almost everybody can see how self-interest aligns with IBM's and Intel's public stances, the prevailing opinion ( equally unjustified ) is that the truth must lie somewhere between. This numb skepticism is a predictable consequence, well known to litigators and plotters of political campaigns, of conflicting pronouncements, some reassuring and others frightening. The public interest would be served better by the promulgation of do-it-yourself triage kits designed to help individuals assess each his own exposure to hazard. Bruised Integers: ~~~~~~~~~~~~~~~~~ To help assess the hazards posed by Pentium's FDIV bug, a summary of what is known about it will be presented here. Five entries in its PLA are wrong, 0 instead of 2 , in columns belonging to divisors barely less than the five integers 18 = 12h, 21 = 15h, 24 = 18h, 27 = 1Bh and 30 = 1Eh and multiples of them by powers of 2 ; these five, of the seventeen singular divisors near which all of SRTEST's tests occur, are all multiples of 3 . Divisions tend roughly ÄÄ not infallibly ÄÄ to fail less often as divisors fall farther short of these singular divisors; that is why most failures have divisors that are bruised integers. Division never fails at divisors too far from singular; Tim Coe and Dr. P-T. Peter Tang (1995) have demonstrated that every divisor for which division fails on Pentium must, when scaled by a suitable power of 2 , fall into one of these five narrow intervals: 11.FCh = 18 - 1/64 < Divisor < 18 = 12.00h , or 14.FCh = 21 - 1/64 < Divisor < 21 = 15.00h , or 17.FCh = 24 - 1/64 < Divisor < 24 = 18.00h , or 1A.FCh = 27 - 1/64 < Divisor < 27 = 1B.00h , or 1D.FCh = 30 - 1/64 < Divisor < 30 = 1E.00h . These intervals must be practically best-possible because Tim Coe has found failures as close to their lower boundaries as these divisors: 17.FC01FF02ECh , 1A.FC01FF03Eh , 1D.FC01FF006Ch . ( He has also predicted that failures with random operands should occur in quotients' 14 th bit-position half as often as in the 16 th and subsequent even bit-positions, where he expects failures to occur uniformly distributed.) Knowing those intervals, Dr. Cleve Moler, Terje Mathisen of Norsk Hydro in Oslo, and Tim Coe have been able to devise an efficient software patch to work around defective Pentiums. Briefly, the patch tests every divisor and decides very quickly whether it lies in one of those intervals, in which case the potentially incorrect division is preceded by multiplications of divisor and dividend by 15/16 . That moves the divisor out of danger and, thanks to Pentium's Double- Extended registers, produces exactly the same quotient as would have been obtained by an unflawed machine from Single- and Double-Precision floating-point operands. As implemented in The MathWorks' MATLAB and MetaWare's C and C++ compilers, the patch is augmented by an option that, when turned on, counts the instances when a flawed Pentium's division fails; this is the slow counting option mentioned above. In accord with hypothesis W above, the majority of failures have been found at divisors and dividends that are bruised small integers; however, the extent of that preponderance may be due in part to the way searches for failures have been conducted. A substantial fraction of observed failures do not have bruised integer dividends and, more ominously, many failures have modest integer divisors. Each row of the tabulation below shows how many failures were counted as dividends ran through the first 2^31 odd integers ( exhausting all 31-bit wide dividends ) for every divisor, each a small integer, in that row. A disturbing aspect of these failures is that none hit beyond the 26 th sig. bit of the quotient and three quarters of them hit in the 16 th to 24 th sig. bits, so they would degrade single-precision results. ----------------------------------------------------------------------- Five divisors of equal binary widths Count of division failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~ in Decimal: 4607, 5375, 6143, 6911, 7679 38 in Hexadecimal: 11FFh, 14FFh, 17FFh, 1AFFh, 1DFFh in Decimal: 9215, 10751, 12287, 13823, 15359 36 in Hexadecimal: 23FFh, 29FFh, 2FFFh, 35FFh, 3BFFh in Decimal: 36863, 43007, 49151, 55295, 61439 6355 in Hexadecimal: 8FFFh, A7FFh, BFFFh, D7FFh, EFFFh in Decimal: 73727, 86015, 98303, 110591, 122879 6642 in Hexadecimal: 11FFFh, 14FFFh, 17FFFh, 1AFFFh, 1DFFFh in Decimal: 147455, 172031, 196607, 221183, 245759 2070 in Hexadecimal: 23FFFh, 29FFFh, 2FFFFh, 35FFFh, 3BFFFh in Decimal: 294911, 344063, 393215, 442367, 491519 28242 in Hexadecimal: 47FFFh, 53FFFh, 5FFFFh, 6BFFFh, 77FFFh in Decimal: 589823, 688127, 786431, 884735, 983039 17648 in Hexadecimal: 8FFFFh, A7FFFh, BFFFFh, D7FFFh, EFFFFh ----------------------------------------------------------------------- Plots of all known failures have revealed little more than is obvious. ( Some plots seen on the InterNet are misleading because they take no account of decimal-binary conversion errors incurred before plotting.) Plotted in the ( Divisor, Dividend ) - plane, failures fell into near horizontal grass-like streaks corresponding to vulnerable quotients; in other words, all failures have divisors in the narrow intervals allowed by Coe and Tang (1995), and have peculiar quotients in so far as they contain strings of extremal quotient digits. Perusal of hundreds of thousands of failures indicates that quotients always exhibit digit patterns peculiar to the interval into which the divisor falls. For instance, for divisors strictly between 11.FCh and 12.00h , and for dividends between 10.00h and 20.00h , so that quotients must lie between 0.E38E38...h and 1.C781AB...h , all the quotients that Pentium got wrong looked like this in binary: right quotient = ...WwXx1y 101010Zz... wrong quotient = ...WwXx1y 011111..... where W, w, X, x, y, Z, z are bits and W+w = X+x = Z+z = 1 . Blanks have been inserted just before the first bit that changed in the final quotient although the error in Pentium's PLA actually changed the second 1 after the blank into a 0 , after which a negative quotient digit got selected; its propagated borrows decremented the first 1 . The numbers of bits after the binary point and ahead of the blank were always even and at least 12 . Those same even numbers but slightly different bit patterns arose with divisors in the other four intervals. A better understanding of these patterns awaits refinements of methods like those Coe and Tang used for their demonstration cited above. The patterns are obscured by a coincidence; all the singular divisors adjacent to Pentium's flaws are multiples of 3 , and generally the divisors that produce long sequences of extremal quotient digits from small integer dividends lie near small integer multiples of a divider's ( Radix - 1 ) , here 3 also. Illusory Randomness: ~~~~~~~~~~~~~~~~~~~~ Other than the foregoing, the failures appear random. Appearances are deceptive; the failures are not random. They look random just because we have yet to find a few simple formulas that predict them well. Such formulas must exist because the failures are "algebraic"; they arise out of the omission of a term from an expression sent to a CAD tool that constructed Pentium's PLA. Therefore the failures all lie in a smooth curve or surface that first gets chopped up and reflected, by the SRT division process, in a way that resembles the effect of a kaleidoscope with cracked mirrors. Second, carry-save representation of remainders further riddles the failures' locus with holes in a way, still ill understood, that contributes most to the failures' apparent randomness. But to presume randomness is to compound our ignorance. There are other reasons to suspect a pattern among the failures. It is foreshadowed by those failures with small integer divisors. A better harbinger is the initially high yield of failures exposed by SRTEST : among its 80 divisors only ten were capable of generating failures, and those ten did so at the rate of about one failure per 36 divisions among the first 35,715 in which they participated. So high a failure rate from so simple a program as SRTEST warns us that other software may generate high failure rates too, but inadvertently. What kind of applications software is most at risk? The risk has less to do with an application's data or with its frequency of divisions (unless that be zero) than with the formulas the software uses. Far from being independent random variables, dividends and divisors are generally correlated algebraically by the formulas that generate them, and therefore lie in curves and surfaces peculiar to the application. If these figures fall tangent to the curves or surfaces comprising all failures, then failures will swarm around that application. If, as is far more likely, intersections are instead transversal (not nearly tangent) then that application will experience almost no failures. It follows that almost all applications software can be divided into two classes. The larger by far consists of software whose failure rate due to Pentium's defective division is very tiny, much too tiny to estimate reliably within orders of magnitude. Software almost certain to fail constitutes the smaller class; containing SRTEST, it cannot be empty. Unfortunately, to classify any particular program that does perform floating-point divisions, we do not yet know a better way than to run it on a defective Pentium for a while. Why did SRTEST's yield, the number of failures found per division tested, slowly dwindle as testing continued? Apparently, the more productive M-bit dividends end in longer bit-strings like ...111111 or ...000000 or ...000001 . The dwindling yield is consistent with the geometrical picture portrayed above, and hence consistent with the existence of some simple but not yet recognized formula that would generate all failures. That formula and SRTEST's formula generate dividends and divisors falling along curves tangent to each other, but the neighborhood of tangency contains only earlier dividends SRTEST obtained from small integers. Later dividends, obtained from bigger integers, fall far enough from this neighborhood that the failure rate dwindles to something approaching the rate for random dividends. Unanswered Questions: ~~~~~~~~~~~~~~~~~~~~~ " Seek Singularities ! " makes a fine maxim but not a firm foundation for a test program like SRTEST until the singularities' locations are understood better. SRTEST lacks an adequate theory for its behavior with carry-save remainders; further investigation is needed badly. Among questions that come to mind are these: How best can knowledge of a divider's Radix be exploited? Without knowing the redundant quotient-digit set nor whether carry- save remainders are used, can we characterize all the most vulnerable quotient-digit sequences and then multiply them by near-singular divisors to get most vulnerable trial dividends? What can be done to render the program SRTEST invulnerable to the vagaries of compilers, and hence more widely portable? If run long enough can SRTEST prove a divider correct? How long? In our present state of (mis)understanding, we cannot substitute a test like SRTEST for a mathematical proof of correctness. The best that might be said for SRTEST is that it beats the Test of Time: millions of customers used Pentiums for months without noticing any error in division. That proves customers rarely check their work, and that very few of them get an opportunity to do what Intel and Nicely did to discover Pentium's flaw: compare results obtained for the same computation from different computers. Besides, what could a customer who noticed a discrepacy do about it? If caught during debugging, a numerical discrepancy would probably be attributed to roundoff, which is the paramount cause of discrepancies. Since rounding errors are so difficult to understand, a programmer can easily misdiagnose the discrepancy and alter her program ineffectually and, having changed a divisor by another rounding error, no longer see the discrepancy. ( This volatility is explained in the Appendix on the Vagaries of Carry-Save.) Consequently, despite many observed discrepancies, flawed division would remain unreported. Ultimate users of numerical software routinely misinterpret incorrect results. If noticed at all, usually long after they were committed, errors are often blamed upon roundoff; to evade them, users will change to different software or change data slightly rather than waste time performing someone else's debugging duties. Many a computation is hypothetical, performed to predict consequences of a proposed action; disagreeable consequences preclude that action. When numerical errors predict disagreeable consequences mistakenly, no action follows; the mistake goes unnoticed. Flawed division would remain unnoticed too. Thus, reports from the field must badly under-report the incidence of error due to faulty arithmetic so long as a fault remains unsuspected; an absence of reports during that time cannot be construed afterwards as evidence that nobody but the first to report it had been affected by the fault. As a fault becomes public knowledge it becomes a target for the idly curious and a scapegoat for ills that otherwise resist easy explanation, and the incidence of error tends to become over-reported. Worse, conscientious professionals must then start chasing wild geese, checking old results for contamination and burdening new software with extra checks against errors that continue, as before, to be caused almost always by some mistake other than faulty arithmetic. The computing industry is served badly by the Test of Time; if passed for a while it conveys a false sense of security; when failed it fails too late. Tests like SRTEST are far preferable however imperfect. Like any test program, SRTEST has to be tested too. It should be run on numerous PLAs each with a different intentional defect to see how long SRTEST takes to uncover defects. Such a test is made difficult by the self-correcting propensity of SRT division; many an intended "defect" could turn out to be not a defect after all, or to harm only wide quotients with more digits than the test examines. Whatever the difficulties, we have to attempt such a test before we can say with confidence that SRTEST's prompt exposure of Pentium's floating- point divide bug was no accident. Temptations to be Resisted: ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Neither SRTEST's nor any similar test's results can support estimates of the "probability" that an unspecified user of defective arithmetic will suffer from the defect. The probability that a random test would uncover the defect is logically irrelevant. To see why, suppose that a divider produced a grossly wrong quotient when and only when division by 3.0 occurred in double or double-extended precision. Since 3.0 is one of at least 2^52 = 4.5 E 15 double-precision floating-point numbers, a random test could be expected to uncover this bug with probability less than about 2.3 E-16 , which is far smaller than any of the probabilities that have been bandied about concerning Pentium's bug. However, division by 3.0 is so ubiquitous that its defect could not reasonably be expected to stay hidden for so long as one day. Similarly, SRTEST's yield of about one defective division per 250 of them during its first few seconds' running is no probability; if a substantial fraction of Pentium owners faced that kind of hazard they would have drowned Intel with complaints a few weeks after purchase. We must not promulgate dubious statistics, especially if they tend to make light of an obvious bug's impact upon the public, lest opponents respond with their own self-serving statistics tending the opposite way. Both sides will claim their predictions borne out by events. The controversy will draw attention to ever more embarrassing mishaps that turn up inevitably, most of them misattributed to the bug; and then prophets of doom, cartoonists and lawyers will close in to exact the public's revenge for a betrayal of their trust. Here is a better policy: 1: Never try to hide a bug nor postpone acknowledging its existence; the attempt merely destroys the public's confidence in whatever you are obliged to say at last. The rarer the bug, the more careful you must be to say only what can be said with certainty, since the statistics of extremely rare events, far out on the tails of probability distributions, are notoriously treacherous; see the Appendix on the Fragility of Low Probabilities for an explanation. Attempts to estimate extremely low rates of incidence are too vulnerable to unanticipated correlations to be worth publishing. 2: Undertake to replace defective parts as quickly as corrected parts can be made available. Past experience and statistical methods can serve validly to estimate the rate at which demand for replacements may ramp up, but alluding to uncertifiable statistics in order to dampen public demand may merely exacerbate legal liabilities. 3: Provide software patches to work around the bug, if that is possible, as soon as possible and without waiting for mitigations of performance degradation. Enhancements to performance can come later; prevention of further errors must come first. Besides, if the bug will depart soon then effort spent understanding it well enough to circumvent it cleverly is effort misplaced, and risky too; excessive cleverness invites more bugs. 4: Supply utility software that scans executable files to detect any vulnerabilities ( or to certify that none exist ) and then to patch vulnerable programs as nearly automatically as possible. Such a scanner would allow most customers confidently to delay replacing the defective part until a convenient occasion, perhaps forever. To produce such a scanner is not trivial; it must avoid false negatives arising from self-modifying code, cope with false positives arising from a variable-length instruction set, and deal with files that get decompressed only during execution. Aided by electronic bulletin-boards, the scanner can exploit catalogs of programs known some to be invulnerable, some to be unpatchable, and some to accept patches supplied with the help of providers of widely used software; and other software experts will help too if not asked to do too much. The foregoing " Counsel of Perfection " ÄÄ more easily said than done ÄÄ will appear altogether too high-minded to many a practical man of affairs called in to cope with bad news after it has broken. Perhaps a policy better than mine exists, yet it must have the same goal, which is to protect a corporation's most valuable asset ÄÄ the confidence of its customers. When that confidence is lost, the consequences are cynicism and distrust that have always, in my experience going back over four decades, incurred costs and lost revenues substantially greater than whatever short-term gain tempted somebody to adopt a mere expedient. The marketplace soon forgets the technical details of a bug but long remembers the sentiments aroused by the way it was handled. Therein lies the lesson to be learned from Intel's tragedy: Bruised by Intel's initial handling of Pentium's flaw, and confused by conflicting pronouncements and superficial journalism, the public has now lost some of its confidence in the reliability of Pentium's floating-point architecture. That is exasperating because, with its bug corrected, Pentium gives consumers what I deem to be much the most reliable among all currently commercially significant microprocessor arithmetics. ( Having helped design it that way in 1977, I may be biased.) Annotated Bibliography: ~~~~~~~~~~~~~~~~~~~~~~~ Anonymous (1994 ) on p. 5 of Microprocessor Report vol. 8 no. 16 for Dec. 5, 1994: " ... Intel has offered to replace the processor of any user affected by the problem. This offer should not have many takers, as the few divisors that generate erroneous results cause errors of a few parts per billion. Single-precision calculations are not affected at all. Most PC users don't even use their FPU, and engineers typically use single-precision." Early false report. T. Coe and P-T.P. Tang (1995) "It Takes Six Ones to Reach a Flaw" submitted for the 12th IEEE Symp. on Computer Arithmetic. W.J. Cody et al. (1984) "A Proposed Radix- and Word-length-independent Standard for Floating-point Arithmetic" p. 86-100 of IEEE Micro, Aug. 1984. This explains IEEE Standards 754 and 854 for Floating- Point Arithmetic better than the texts of the standards. G. Barrett (1987) "A Formal Approach to Rounding" p. 247-254 of Proc. 8th IEEE Symp. on Computer Arithmetic, ed. by Mary Irwin and R. Stefanelli. This mentions David Shepherd's 1986 "Proof" that the Inmos T800 conforms to IEEE 754. I do not know where Shepherd's work has been published, other than as a report from Oxford University's Programming Research Group, nor whether the errors in the T800 have been acknowledged in print or corrected. M.L. Barton and H. Sharangpani (1994) "Statistical Analysis of Floating Point Flaw in the Pentium(TM) Processor (1994)", an Intel white paper dated November 30th 1994. J. Fandrianto (1987) "Algorithm for High Speed Shared Radix 4 Division and Radix 4 Square Root " p. 73-79 of Proc. 8th IEEE Symposium on Computer Arithmetic, ed. by Mary Irwin and R. Stefanelli. This converts 8 leading bits of a carry-save remainder to sign- magnitude and feeds 7 bits of that magnitude to a smaller PLA to select the magnitude of a quotient digit from { 0, 1, 2 }. J. Jarrett (1995) "Postmortem on the Pentium (TM) Processor Crisis" p. 8-11 of the Jan. 1995 issue of Intel Leads, "published by and for Intel Corporation." Explains Intel's position. W. Kahan (1986) "DIVCK", a suite of BASIC programs and documentation, distributed electronically via NETLIB in 1987, written at the behest of Dr. Cleve Moler, then at Intel in Oregon, to test whether binary floating-point division is correctly rounded. These programs test efficiently only the quotient's last digit. Updated versions of these and several more programs to test floating-point arithmetic and its library of elementary mathematical functions will appear in a suite soon to be released to the public domain by my former students at Sun Microsystems led by Dr. David G.Hough. I. Koren (1993) "Computer Arithmetic Algorithms" Prentice-Hall, N.J. Ch. 7 of this text explains details of SRT division well. A new edition is imminent. However, neither its author nor any other I know discusses the challenge posed by black-box testing of SRT dividers. If I have missed someone I shall welcome being put right; SRTEST is too simple for me to believe that I am the first to think of it. C. Moler (1995) "A Tale of Two Numbers" p. 10-12 of MATLAB News & Notes for Winter 1995; also p. 1 & 16 of SIAM News vol. 28 no. 1 (Jan. 1995) Recounts the Pentium FDIV bug's history and proposed patch. G.S. Taylor (1981) "Compatible Hardware for Division and Square Root" p. 127-134 of ... G.S. Taylor and D.A. Patterson (1981) "VAX Hardware for the Proposed IEEE Floating-Point Standard" p. 190-196 of Proc. 5th IEEE Symp. on Computer Arithmetic. Acknowledgements: ~~~~~~~~~~~~~~~~~ SRTEST is based upon conversations I held with Dr. George S. Taylor in 1980 when, as a student, he designed an SRT divider very much like Pentium's; see his 1981 papers. His design was built as part of a larger project that was discontinued after another student, David Cary, died in a tragic swimming accident. The divider never got to be tested by a program like SRTEST, which was therefore not distributed. I am indebted to colleague Prof. L. Rowe for access to his Pentium to run my tests initially, to Intel's H.P. Sharangpani for an early copy of his white paper containing the probabilistic estimates that I disparage herein, to Intel's Jon Marshall for a Pentium to test more thoroughly, and to Tim Coe of Vitesse Semiconductors and Prof. A. Edelman of MIT for indispensable advice and assistance. ======================================================================= Appendix: Fragility of Low Probabilities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Whenever an engineer's mistake is discovered too late, engineering managers will require assessments of the mistake's impact before they can choose a course of remedial action. Decisions are based often upon probabilistic estimates of failure rates; lower estimates incur less desperate remedial activity. Unfortunately, the lowest estimates tend also to be the most unreliable, and not merely because they may be self-serving. Circumstances do arise wherein low probabilities lack predictive power, and then decisions based upon them will fare worse than decisions reached after a frank acknowledgement of ignorance. The kind of probabilistic reasoning customarily employed to predict failure rates for telephone equipment and semiconductor fabrication cannot be employed with confidence to predict failure rates due to an error in the design of an arithmetic unit. Confidence is undermined by several considerations, foremost among them being the utterly non- random nature of that error after it has been found. To proceed with probabilistic reasoning we must assume that operands are random, that operations occur with frequencies revealed by random sampling, that inaccuracies in intermediate results propagate into final results at rates determinable by statistical methods, and so on. Were every such assumption a reasonably good approximation in its own right, yet their concatenation into a chain of inference would still be unreasonable, the more so if it predicted an extremely low failure rate. Predictions like that are often falsified by rare events that have been overlooked completely, and by correlations that have been neglected. + Footnote: ---------------------------------------------------------+ | Among recent completely unanticipated rare events have been ... | | -- Keeping a Patriot anti-missile battery in continuous action | | for more than a few hours during the Gulf War of 1991, | | fatally degrading the battery's interception capability. | | -- Unseasonably cold weather that embrittled a sealing ring on | | the fuel tank of the ill-fated Challenger space craft. | +---------------------------------------------------------------------+ Probabilistic inferences about rare events are easily vitiated by small correlations that affect commonplace events negligibly. For instance, suppose a calamity occurs only when two rare events A and B both occur, and let their probabilities be à and á respectively, both very tiny numbers like 1.0 E-6 . If A and B are independent, the calamity's probability à á is tinier by far. But if A and B are slightly correlated, with small correlation coefficient ‡ , then the probability of calamity can turn out far bigger than à á , depending much more strongly upon ‡ than do the probabilities of other compound events like ( A and not B ). Here are relevant formulas simplified by the substitutions a = 1-à and b = 1-á : Compound Events Probabilities ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~ A and B (calamity) ( û(àá) + ‡û(ab) ) û(àá) A and not B ( û(àb) - ‡û(aá) ) û(àb) B and not A ( û(aá) - ‡û(àb) ) û(aá) neither A nor B ( û(ab) + ‡û(àá) ) û(ab) For example take à = 1.0 E-6 , á = 4.0 E-6 , and try three values ‡ = 1.0 E-2 ( A slightly correlated with B ), or = 0 ( A independent of B ), or = -2.0 E-6 ( A very slightly anti-correlated with B ). Probabilities with ... Compound Events ‡ = 1.0 E-2 ‡ = 0 ‡ = -2.0 E-6 ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~ A and B (calamity) 2.00 E-8 4.00 E-12 1.00 E-17 A and not B 9.80 E-7 1.00 E-6 1.00 E-6 B and not A 3.98 E-6 4.00 E-6 4.00 E-6 neither A nor B 0.999995 0.999995 0.999995 Thus can probabilistic assessments of rare events be destroyed utterly by almost imperceptible correlations that would customarily be ignored quite properly during assessments of populations' central tendencies. To ignore such correlations just because we cannot ascertain them is methodologically unsound. Appendix: Vagaries of Carry-Save ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The carry-save representations of a number X = C + S consist of all pairs C and S that sum to X ; they are as numerous as are all the binary numbers of the wordsize chosen for C and S . However, they occur with rather different frequencies, and that is why carry-save so severely complicates the task of testing SRT dividers. The typical provenance of carry-save representations is illustrated by an example that shows what happens when an addend A is added to X to produce a new pair C'+ S' = A + X = A + C + S : addend A = 000...11110000 + C = 000...11001100 + S = 000...10101010 ------ ----------------- new sum word S' = 000...10010110 new carry word C'/2 = 000...11101000 The bits of C' have been displayed shifted right one bit-position to expose a phenomenon: Aside from common leading zeros, corresponding bits of S and C/2 tend to be opposites three times as often as they match. The significance of this phenomenon will become apparent later. Quotient digit selection is based upon only a few leading bits of the current remainder X ; let Z(X) denote their value. When this is obtained from the leading few bits of C and S in the carry-save representation of X = C+S , the numerical value of X need not determine Z(X) uniquely. For example, suppose two five-bit carry-save words, with four bits after the point, are chopped to produce an integer Z(X) ; here are all possibilities ( disregarding words' order ) for X = 9/8 : Z(9/8) = 1 for 3 instances: 0.0000 0.0001 0.0010 ( Carry-save adders cannot 1.0010 1.0001 1.0000 yield instances in which both words end with .....1 .) Z(9/8) = 0 for 7 instances: 0.0011 0.0100 0.0101 0.0110 0.0111 0.1000 0.1001 0.1111 0.1110 0.1101 0.1100 0.1011 0.1010 0.1001 Thus, carry-save renders the " function " Z(X) ambiguous in so far as it may take either of two different adjacent values at different instances of the same numerical value X . That ambiguity exacerbates the volatility of quotient digit selection, as we shall see now. Quotient digit selection is intrinsically volatile whenever a final quotient ends in a long string of zeros because then changes in the least-sig. bit of divisor and/or dividend that alter at most the least- sig. bit of the final quotient can change so many earlier selections of quotient digits. The changes in those selections get undone by carry- propagation after the final rounding. Carry-save representations of remainders are volatile in a similar way, especially when divisors end in long strings of zeros or of ones, as do SRTEST's. The ambiguity of Z(X) conducts this volatility to the selection of quotient digits. Thus, slightly different dividend/- divisor pairs that generate almost the same final quotient frequently generate very different sequences of (redundant) quotient digits. In consequence, the leading digits of divisors and dividends, though they do determine leading digits of final quotients, frequently do not determine as many leading selections of (redundant) quotient digits. We shall find two practical consequences of this volatility. First, it will attenuate the capability of tests to uncover certain kinds of errors in whatever PLA or equivalent logic is used for quotient digit selection. Second, that kind of error will almost certainly get misdiagnosed as a common flaw, namely hypersensitivity to roundoff, in an applications program because the error's effect will flicker out of sight after a change in end-figures of input data. ( Programmers tend to cope with roundoff very poorly if at all.) What kind of error is so closely linked to carry-save's volatility? Think of the PLA as a table of quotient digit selections whose every entry we wish to exercise by tests that generate dividends and divisors for that purpose. The table's most extreme entries ÄÄ those on the boundary beyond which selection should never go ÄÄ are most likely to be reachable only by deftly orchestrated sequences of division steps for which divisors and dividends must be calculated quite precisely lest, because of volatility, roundoff defeat their purpose. Lacking knowledge of those sequences, the architect of a test program like SRTEST must exercise those extreme entries some other way, and that turns out to be extremely difficult. The difficulty arises out of a peculiarity of carry-save adders already mentioned. For definiteness continue to suppose that Z(X) is an integer obtained by chopping fractional bits. Without carry-save representation of a non-integer remainder X , its integer part Z(X) is determined uniquely: Z(X) = N ó X = N+f < N+1 . With carry-save there are two possibilities: Z(X) = N or Z(X) = N-1. If one of the two carry-save words representing X = C+S were a uniformly distributed random variable, the former possibility would occur with probability very near the fractional part f of X . ( A corroborating probabilistic computation has been omitted here.) However, carry-save adders produce pairs of words of which each tends to have, after their common leading zeros, roughly as many zero bits as the other has ones. This happens because those adders produce pairs of "sum" and "carry" bits that are three times more likely to be different than to be the same, given randomly chosen addends. In consequence the error committed when both words' fractional bits are discarded tends to avoid the extremes, 0 and 2 , of its range; when the fraction f is small the larger possibility Z = N occurs with probability far less than f , and when f is nearly 1 the lesser possibility Z = N-1 has probability far less than 1-f . That is why some extremal values of Z(X) may occur very infrequently. In experiments with random numbers, the probability that Z(N+f) = N was observed to be approximated surprisingly well by a smooth function arcsin( sin( ãf )^4 )/ã if 0 ó f ó 1/2 , 1 - arcsin( sin( ã(1-f) )^4 )/ã if 1/2 ó f < 1 . It is a mystery. If this approximation can be trusted, Z(9/8) = 1 can be expected to occur not 1/8 but somewhat less often than 1% of the time. Perhaps bias of this kind could be alleviated by retaining one more bit of each carry-save word than will be delivered to Z(X) from their sum, as Fandrianto (1987) did, though doing so might accomplish no more for an SRT divider than render its defect, if there be one, easier to uncover by testing. ??????????????????????????????????????????????????????????????????????? Awaiting Exposition: Tests of SRTEST on artificial flaws in simulated SRT dividers. Convert Fortran SRTEST to C . Give example of financial miscalculation due to Pentium's bug? ( Hearsay on p. 32 of PC Magazine, 7 Feb. 1995 ?) Are corrected Pentiums distinguishable visually from older faulty chips? Vaughan Pratt claims that the older ones have a mark " SX 857 " on their bottom, newer have " SX 957 ". But my old one is marked "SX 879" . ???????????????????????????????????????????????????????????????????????