SRTEST 15 March 1995
A Test for SRT Division
=========================
**************************************************************
* CAUTION: This is work in progress. It may be WRONG. *
* It is being distributed now only so that my colleagues *
* in academe and in industry, who need it immediately, *
* can try it out and help me get it right. *
**************************************************************
Prof. W. Kahan
Elect. Eng. & Computer Science
University of California
Berkeley CA 94720-1776
(510) 642-5638
Abstract:
~~~~~~~~~
SRTEST is an astonishingly simple computer program devised to test the
correctness of quotient-prediction hardware for binary floating-point
division that uses any Non-Restoring (SRT) algorithm. Nowadays that
hardware often takes the form of a Programmed Logic Array (PLA). How
long SRTEST should run to "prove" a PLA correct remains an open
question; however, SRTEST is intended to uncover a flaw in a PLA
far sooner than random testing can. For example, when it ran on an
Intel Pentium with the notorious binary floating-point divide bug of
1994, SRTEST uncovered the flaw in a tiny fraction of a second after
performing its 356 th trial division and a few thousand arithmetic
operations in all, although the flaw had earlier been found to afflict
one division in about 8,800,000,000 with independent random operands.
However, to use either that kind of finding or SRTEST's performance
to predict failure rates among diverse users of a defective divider is
an abuse of statistics. A study of SRTEST's workings sheds light on
that abuse and helps us avoid repeating it.
Table of Contents:
~~~~~~~~~~~~~~~~~~
Introduction
SRTEST's Needs and Limitations
Outline of SRT Division
Carry-Save Remainders
What SRTEST Doesn't Need
Why SRT Division is Hard to Test
Proofs vs. Tests
Where to Test
Strips of Layers of Tiles
Out of Desperation
Scattering Samples
Checking a Quotient
The Proof of a Pudding is in the Eating
A Short History
Bruised Integers
Illusory Randomness
Unanswered Questions
Temptations to be Resisted
Annotated Bibliography
Acknowledgements
Appendix: Fragility of Low Probabilities
Appendix: Vagaries of Carry-Save
Appendix: Program Listing
Introduction:
~~~~~~~~~~~~~
A millennium ago, the Moorish craftsmen who built some of Spain's
architectural jewels manifested humility by incorporating small flaws
into their handiwork lest the deity punish any mortal's presumption of
artistic perfection. To-day's computer architects accomplish the same
effect, but not out of humility. Now complexity punishes the naive
presumption that several million transistors can be commanded each to
do the right deed at the right moment, and now knowledgeable people
expect no modern microprocessor to be perfectly correct.
How much imperfection warrants replacement?
Nobody disputes that a task that can be performed correctly, and whose
correctness is provable mathematically, should be performed correctly
lest our ability to cope with unavoidable imperfections be overwhelmed.
If there is a dispute, it concerns the extent to which avoidable but
humanly inevitable errors should occasion expensive remedies.
Intel faced that question again in the summer of 1994 after a flaw
had come to light in their Pentium (TM) processor's floating-point
division. According to Jim Jarrett (1995), Intel's Head of Investor
Relations, ...
" A task force was formed immediately to characterize the problem and
determine its impact. The group looked at everything from spread-
sheets and computer-aided design programs to fluid dynamics and
financial applications. After a lot of work, a relieved task
force reported last summer that the floating-point flaw was truly
insignificant. Specifically, on a statistical basis, a user
would encounter reduced precision in floating-point divide
operations once every 9 billion random number pairs. For the
average spreadsheet user performing 1,000 floating-point divide
operations daily, this meant reduced precision once every 27,000
years. The user was much more likely to encounter failures in
other PC components; for example, a soft error in a DRAM chip
occurs statistically once every seven years in an 8-MB array.
...
It was insignificant enough that it wasn't even listed on an errata
sheet. The masks were changed to fix the flaw."
Many a statistician and almost all engineers could find no fault with
the task force's analysis nor with the conclusions drawn above nor with
Intel's consequent decision to replace defective Pentiums only for
customers who could demonstrate that its flaw put their work at risk.
That seemed to be a sound decision considering that Pentiums had been
running for over two years before its divider's flaw had been noticed.
I disagree.
I shall try to explain how flaws like Pentium's can stay hidden for
astoundingly long times almost regardless of failure rates. I shall
try to explain why, regardless of the correctness of failure rate
statistics for random divisions, subsequent probabilistic conclusions
were illogical and could not justify the decision to replace defective
chips only grudgingly. That decision was soon rescinded, but for
reasons having more to do with lawsuits, public relations and public
confidence than with doubts about its probabilistic justification.
What are the right probabilistic conclusions? Too little is known now
to support scientifically any better assessment of risk than has been
tendered by The MathWorks' Dr. Cleve Moler (1995):
" With the Pentium, there is a very small chance of making a
very large error."
To his assessment this paper will add a strong suggestion, but short
of proof, that practically all users of defective Pentiums must fall
into two classes. By far the larger class consists of users who are so
unlikely to be harmed by the defect that no underwriter could do better
than guess at the premium they should pay for insurance against harm.
The smaller class consists of users almost certain to be harmed by the
defect almost regardless of the frequency of divisions in their work.
Unfortunately, only for users who perform no floating-point divisions
( and hence face no hazard ) do we know a better way than trial and
error to determine the class into which any particular user falls.
Our conclusions leave unanswered the question Intel faced as Jarrett
(1995) rephrased it:
" What happens in the future if we discover a flaw that produces a
problem once every 90 billion random number pairs; do we still
need to replace processors? What about once every 900 billion?
No processor is ever perfect, and if the answer to every bug ÄÄ
no matter how minor ÄÄ is to replace the processor, what does
this do to the economics of computing? What happens to the pace of
innovation in an industry paralyzed by caution?"
Questions like these deserve answers, but these do not because they
beg the question. They take for granted a kind of randomness that
makes sense for transient disruptions to DRAMs from alpha particles
but makes no sense for a reproducible and systematic arithmetic bug
like Pentium's. The lower the failure rate for random number pairs,
the less relevant it is to a realistic appraisal of the bug's impact,
since hardly anyone computes quotients of uncorrelated random numbers.
The impact of a systematic bug is best revealed through preponderantly
non-probabilistic considerations.
Complexity is one of those considerations. If the bug can be awakened
only by extremely complicated ( and therefore presumably unlikely )
combinations of circumstances, that bug may be eligible for promotion
to a "feature" in the documentation provided also that a relatively
simple software patch will let the sleeping bug lie. Conversely, if
someone unaware of the bug's existence may write a very simple program
that awakens that bug to destructive action, that bug is dangerous.
A simple program that will expose a bug without prior knowledge of that
bug's location makes a good test program. Conversely, if this test
program is general enough and simple enough, that bug is dangerous.
SRTEST is such a test program. Its design predates Pentium's by a
decade. It hits Pentium's bug first on the 356 th division tested
and again 2,294 times in the first million tested. And SRTEST is
very simple; about half a dozen arithmetic instructions in its inner
loop generate test divisors and dividends, and another half dozen test
whether quotients are correctly rounded. That no such test was known
to Pentium's designers is regrettable; it is also hard to believe
that so simple a test has remained entirely unknown until now.
Two integer parameters sum up what SRTEST has to know about the SRT
divider under test: one estimates how many leading bits of divisor go
into the quotient digit selection circuitry; the other estimates each
digit's width. ( These are 5 and 2 for Pentium.) These estimates
need not be perfect. SRTEST need not know how the divider manipulates
remainders nor whether they are held in a carry-save form. Though the
strategy behind SRTEST is simple, its derivation is not, nor is it
yet logically complete; in short, SRTEST works better than I can
explain. None the less, its existence, its simplicity, and its high
yield of division failures cast doubt upon appraisals of risk inferred
from randomly generated failures.
How SRTEST was derived and what it does will be explained. I cannot
say how far to trust SRTEST, nor how far to trust alleged proofs of
correctness since they can lie too. Still, Intel's experience should
teach us what not to do again; that will be summarized. Technical
details about the fragility of extremely low probabilities and the
vagaries of carry-save adders have been banished to appendices along
with a listing of the program SRTEST.
SRTEST's Needs and Limitations:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SRTEST needs values for two of the integer parameters that distinguish
different variants of SRT division; they will be specified later.
SRTEST presumes that floating-point arithmetic operations other than
division, namely addition, subtraction and multiplication, are all
performed correctly as prescribed by IEEE Standard 754 for Binary
Floating-Point Arithmetic. Some versions of SRTEST perform also
integer operations upon the exponent and logical manipulations upon the
significand of a floating-point number but, if portability is assigned
precedence over speed, these can all be replaced by slightly slower
standard floating-point arithmetic operations including IEEE 754's
LogB and ScalB functions. Every version of SRTEST is intended to
run on floating-point hardware whose division, whether built-in or
simulated in software, is the sole rational arithmetic operation in
doubt. Except in that operation, roundoff never figures in versions
of SRTEST that perform logical manipulations upon significands.
SRTEST is not intended to test efficiently whether division is rounded
correctly; that task is performed by my program DIVCK (1986), which
widespread use has shown to be effective. By the way, Dr. D. Hough
of Sun Microsystems has reported that DIVCK, though not designed to
do what SRTEST does, found an error in Pentium's single-precision
division after two minutes' running. That is dumb luck, and bad luck
for Intel where DIVCK had been run on a simulation of Pentium too
slow to run long enough. Thus do ever more ambitious designs overwhelm
our simulators and other design aids; to redress the imbalance we must
invest more design effort into tests better focused upon our designs'
vulnerabilities, if we can figure out what they are.
SRTEST is focused; it is intended to test efficiently only the logic
that selects quotient digits in SRT division algorithms. Any other
error SRTEST may uncover will have been uncovered accidentally.
Outline of SRT Division:
~~~~~~~~~~~~~~~~~~~~~~~~~~
Almost nothing need be known about Non-Restoring (SRT) division to
appreciate what SRTEST does. That division process will be outlined
briefly here in terms chosen to resemble as much as possible the way
Long Division has long been taught in schools.
During the process, each quotient digit ( it can consist of more than
one bit ) is selected from a table of eligible digits; that table is
often implemented as a PLA. It is best viewed as if organized in rows
and columns. A few leading bits of the divisor determine which column
of the PLA will supply all the quotient digits for any given instance
of division. A few leading bits of the current remainder ( initially
it is the dividend ) determine the row. The quotient digit extracted
from the selected row and column is then multiplied by the divisor and
their product subtracted from the current remainder; the difference is
the new remainder that will determine the next quotient digit.
After sufficiently many quotient digits have been extracted by the
foregoing process, they are assimilated into a conventional quotient
and rounded to produce the final result of the division operation.
Every digit-by-digit divider has a Radix, which indicates roughly the
width of each quotient digit. For long division as taught in schools,
where Radix = 10 (decimal arithmetic), the quotient digits come from
the set { 0, 1, ..., Radix-1 }. Most computers use a power of 2 as
radix; for quotient digits L bits wide, Radix = 2^L .
SRT division derives its speed advantage by selecting each quotient
digit from a redundant set of digits. For instance, when each digit
is one bit wide it will be selected from the set { -1, 0, 1 } ; when
two bits wide it may well be selected from { -2, -1, 0, 1, 2 }, as
happens on Pentium. Redundancy renders unnecessary the revisions of
ill-guessed quotient digits that may be remembered as part of long
division as it is taught in schools. Instead, subsequent digits get
chosen in a way that compensates for any slightly ill-guessed quotient
digit. The speed advantage comes about because, once chosen, no
quotient digit need be revised. The disadvantage is that a quotient
can have many redundant representations from any one of which its
conventional binary representation must ultimately be derived by an
assimilation process akin to subtraction though potentially faster.
Carry-Save Remainders:
~~~~~~~~~~~~~~~~~~~~~~
There is another complication: SRT remainders may be represented in a
" Carry-Save " way. That representation of a binary number consists
of two numbers whose sum is the number represented, but which are not
added explicitly to produce that sum lest doing so impose a delay while
carries propagate a long way from right to left. Circuitry to add or
subtract a third number to or from the two can be designed to produce a
result in carry-save form ( as a sum of two new numbers ) extremely
quickly and compactly too because it need not provide for propagations
of carries or borrows. The wider the numbers, the greater is carry-
save's advantage in speed and compactness. For high-precision SRT
division like Pentium's (64 sig. bits), carry-save's advantage is
too great to forego.
The disadvantage is that each numerical value has vastly many different
carry-save representations, which complicates the designer's task a
little and the tester's task a lot. Those complications are explained
in detail in the Appendix: Vagaries of Carry-Save.
What SRTEST Doesn't Need:
~~~~~~~~~~~~~~~~~~~~~~~~~~~
SRTEST is oblivious to all the details of the division process except
the divider's Radix and the number N of leading bits of divisor
used to select a column from the PLA. The PLA must have 2^(N-1)
columns. ( Why not 2^N ? Because a normalized divisor's first leading
bit is always 1 .) The time taken by SRTEST is proportional to the
number of columns, other things being the same. Therefore, if N is
not known, supplying an over-estimate instead to SRTEST will merely
slow its work. ( N = 5 for Intel's Pentium, whose PLA has 16
columns.) Similarly, an over-estimate of the Radix is expected only
to retard SRTEST. ( Pentium's Radix = 4 .) However under-estimating
Radix or N may invalidate the test by preventing or excessively
delaying SRTEST's inspection of parts of the PLA.
SRTEST is oblivious to the number of rows in the PLA. That number is
determined by the number of leading bits of the current remainder used
to select a row. ( Intel's Pentium uses 7 bits, so its PLA has
128 rows.) Increasing the number of rows is expected to slow SRTEST
proportionately, but that expectation has not yet been tested.
SRTEST is oblivious to the entries in the PLA whose correctness is
being tested. That is not so paradoxical as it may seem at first; we
care only that the PLA as a whole be correct regardless of how it was
constructed. Besides, we must be grateful that SRTEST does not ask
us for a copy of the PLA; Pentium's table has 2048 entries.
A characteristic of SRT division is that many of the entries in its
PLA could be changed without invalidating the division process. There
are two reasons for this. One is that over a quarter of the PLA's
entries, particularly the topmost and bottommost few in each column,
are " Don't Cares ": they never get selected by the division process.
Redundancy in the set of quotient digits is the second reason; this
implies that many an error in an early quotient digit can be put right
by changing all subsequent digits in ways that will cancel the error at
the end when all digits are assimilated into a correct final quotient.
SRT division's limited capacity for self-correction is advantageous in
some respects; it lets circuit designers vary quotient selection logic
within limits in order to optimize some aspect ( cost or speed ) of a
design. Neither we nor SRTEST wish to be distracted by variations,
in quotient digit sequences, that yield the same quotient at the end.
Why SRT Division is Hard to Test:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SRT division's capacity for self-correction is limited in a way that
is vulnerable to errors of a kind that we may expect to uncover only by
extravagantly prolonged testing. Some entries in the PLA, especially
those adjacent to entries that can never be selected, are selected in
so relatively few instances of division that errors in those entries
big enough to defeat self-correction relatively often could still stay
hidden until one of a few critical examples were tested. Other entries
in the PLA, though selected relatively frequently, could be altered
in a way that would fail to be self-corrected only in relatively rare
instances; such errors too could remain unexposed until one of those
rare critical instances were tested. And representing remainders in a
Carry-Save fashion further rarifies critical instances, leaving them
excruciatingly difficult to uncover; see the Carry-Save Appendix.
The statistics of those critical instances are appallingly worse than
anyone had suspected before Pentium's FDIV bug came to light. Its
PLA had errors, 0 instead of 2 , in five out of 2048 entries.
These gross errors were not uncovered until too late despite billions
of divisions tested. Alas, these five entries lay along the boundary
between entries that do and entries that never get selected, so only a
very tiny fraction like 1/8,000,000,000 of all possible divisions
ever selected them and therefore failed to be self-corrected.
This tiny fraction was estimated, after the errors had come to light,
in an Intel "white paper" by Dr. M.L. Barton and H.P. Sharangpani
(1994). They obtained two estimates, one from a model by Markoff
chains, the other from a count of the number wrong in more than 10^12
double-precision divisions tested at random on Pentium. The estimates
agreed. Barton and Sharangpani also observed that defects appeared
uniformly distributed in erroneous quotients from their 14th to 52nd
sig. bits; and in so far as divisions were infrequent in applications
software they inferred that all but a tiny fraction of Pentium users
were extremely unlikely ever to be affected adversely by these defects.
If random testing were the only way to test SRT division, we could
not afford it. Neither can we afford the damage done by errors that,
as we shall see, can occur far more often than the tiny fraction cited
above suggests. The tiny fraction merely tells us that random testing
of SRT dividers is too wasteful unless it fails early, and that the
first million or so customers who use such dividers can easily overlook
something wrong for at least several months, rather than that errors
unexposed by billions of random tests must occur too rarely to matter
much to most of us. We shall revisit these issues.
Proofs vs. Tests:
~~~~~~~~~~~~~~~~~~~
If an SRT divider's PLA is correct, it can be proved correct
mathematically, so why not do just that and skip bothersome testing?
The designers of Pentium's PLA did prove it correct mathematically,
or so they thought, but their "proof" of correctness was incorrect.
They had " simplified " it by presuming their PLA sign-symmetric.
A natural presumption, it can be found also in J. Fandrianto (1987),
though he had to modify his PLA a little to make it work correctly.
That presumption is slightly incompatible with carry-save remainders,
but by itself it did no harm to Pentium. However, a characterization
of Pentium's PLA, submitted to the CAD tool that laid it out, was
inadvertently incomplete, rendering five entries in the PLA invalid.
Unluckily, those invalid entries were overlooked because they fell in
that half of the PLA that was presumed identical but for sign to the
half that was checked for correctness. As if in collusion, the two
mistakes concealed each other; still, they need not have been fatal.
The designers' fatal mistake, induced by faith in their "proof" of
correctness, was not to contemplate whether their divider differed
enough from Intel's previous practice to require new tests. Instead,
the corporate department responsible for testing ran its old tests ÄÄ
random tests and DIVCK ÄÄ on a simulation of the new design. We know
now why those tests turned up nothing wrong.
Reviewing the sequence of events, many a practitioner must be thinking
" There go I but for the grace of ... ."
Proofs are not infallible. In the mid 1980s the implementation of
floating-point on the Inmos Transputer T-800 was "proved" correct
mechanically. Its specifications had been spelled out in a language
called " Z " and an implementation microcoded in " Occam," both
languages so contrived that the implementation's compliance with its
specifications could be confirmed by a computerized theorem prover.
However the specifications contained two mistakes. One was unintended:
the floating-point number -0 had been confused with the most negative
binary integer, whose bit-pattern is the same. The other mistake was
misguidedly intentional: the flags for Invalid Operation, Overflow
and Divide-by-Zero were merged into one flag. Consequently the T800
occasionally mishandles -0 , and spuriously signals INVALID when a
computation creates and then consumes infinity in a way that would be
innocuous if it conformed to IEEE Standard 754. Early advertisements
that perfect conformity had been "proved" are still believed widely.
Come to think of it, why should specifications and proofs be any less
vulnerable to error than designs and programs? Specifications are mere
translations of good intentions into formal language; something can
get lost in translation, as poets and contract lawyers know. Proofs
run to considerably greater lengths than the designs and programs they
are intended to validate, so proofs have a larger capture cross-
section for error. We yearn for the sense of security that proofs
provide, but certainty eludes us and security is an illusion.
That is not to say that proofs are worthless. Quite the contrary:
+---------------------------------------------------------------+
\ The best testimony to a design's correctness is a proof. /
\ But it is testimony of uncertain credibility without /
\ corroboration. The best corroboration is another /
\ independent proof provided it is not too long. /
+-----------------------------------------------------+
Tests provide corroboration too, some more so than others. Testimony
from purely random tests is undermined by a suspicion that too many of
them may have fallen into terrain already tested instead of into Terra
Incognita. That is how random tests of Pentium's SRT division were
wasteful; they almost all exercised only the PLA's non-extreme rows,
leaving extreme rows inadequately tested until after billions of tests.
Regardless of their inefficiencies, tests are indispensable; tests of
devices "proved" correct actually test our own thinking. And since
test programs can deceive us so easily, they too must be assayed for
efficiency, proved correct and tested. Of doubts there can be no end
but instead a Law of Diminishing Returns. To enhance the returns from
tests we must concentrate them where they will do more good.
Where to Test:
~~~~~~~~~~~~~~
" Cherchez la femme ! " ... attributed by Alexandre Dumas to his
fictionalized detective Joseph Fouch‚.
Our motto will be " Seek singularities ! " There lie all errors.
In the theory of complex variables, the singularities of an analytic
function are the boundary points of its domain of analyticity. Inside
its domain an analytic function is smooth, infinitely differentiable.
An aphorism attributed to Riemann goes roughly like this:
You tell me how an analytic function behaves at
and near every one of its singularities; then
I shall tell you what it does everywhere else.
When that aphorism was first uttered, analytic functions being studied
all had countably many singularities, so characterizing a function by
its singularities was the ultimate kind of data compression.
Logical functions of numerical arguments are piecewise constant; their
singularities are the arguments where the functions change value. To
know a logical function at and near its every singularity is to know it
everywhere. To prove the logical function of the PLA correct SRTEST
need check it only at and near every ( whew! ) numerical input where
either the state of the PLA or the desired result changes.
How near is "near" ? Roundoff blurs that question, so SRTEST is
designed to avoid interference from roundoff, as we shall see later.
Let us think of the singularities of the SRT PLA as points plotted in
the ( Divisor, Dividend ) - plane. We need consider only the square
1 ó Divisor < 2 and 1 ó Dividend < 2
since floating-point exponents do not interact with the PLA.
Singular divisors are those at which selection switches from one column
of the PLA to the next, plus endpoints; there are 2^(N-1) + 1 at
Divisor = 1, 1 + h, 1 + 2h, 1 + 3h, ..., 2 - 2h, 2 - h, 2
where h = 1/2^(N-1) . They cut the square into 2^(N-1) vertical
strips, one for each column of the PLA. Each of those strips breaks
into tiles across whose boundaries selection switches from one row of
the PLA to another. In the ( Divisor, Dividend ) - plane, all the
tiles' boundaries constitute the singularities of the SRT PLA. There
are very many of them, far too many for anyone to test them all. At
best, we might hope to test some singularities in a reasonable order.
Some of those boundaries lie on lines that are easy to describe. The
easiest, as we have seen, are the vertical boundaries of the strips.
Next easiest are the horizontal lines across which selection for the
first quotient digit switches. The boundaries across which switching
occurs for subsequent quotient digits are not so easy to decribe.
Strips of Layers of Tiles:
~~~~~~~~~~~~~~~~~~~~~~~~~~
In each vertical strip, each quotient digit position generates a layer
of overlapping tiles across whose non-vertical boundaries that digit's
selection changes from one row of the PLA to the next. For each
succeeding quotient digit position, the tiles in its layer become more
numerous and more complicated to describe. The tiles' shapes depend
upon an arcane question about the division operation's implementation:
Are remainders represented in a Carry-Save fashion ?
If remainders were NOT represented in carry-save fashion, every tile
in the K-th quotient digit position's layer would have a simple shape
consisting of a parallelogram with two opposite sides embedded in the
boundaries of a strip and two whose slope is a quotient with zeros in
the K-th digit and beyond. Carry-save complicates the tiles; they
can have complicated interpenetrating shapes depending upon how the
remainder's two carry-save components combine their leading several
bits before they go to the PLA to select a row. This is the kind of
detail we wish not to have to explain to SRTEST, which does not know
even how many rows the PLA has. Were we to try to take such details
into account we might have to write a test program so complicated as
would blatantly invite blunders. Let us put off thinking about details
unless and until we are compelled to abandon hope for a simple test.
SRTEST generates and tests sample points ( Divisor, Dividend ) each
of which touches one tile from each of as many layers as there are
digits in a computed quotient. Each tile is the image, in a strip of
the ( Divisor, Dividend ) square, of an entry in the PLA selected
for the quotient digit position corresponding to the tile's layer. If
that entry is erroneous then, amongst its images, there must be some
tiles which, if touched in a suitable place by a test, will expose
the error. Other tiles, though images of that erroneous entry, will
when tested conceal the error because SRT division is partially self-
correcting, so some of its quotients may be correct despite the error.
None the less, in order to expose whatever error may exist, testing
must ultimately touch every tile; moreover, testing must touch one of
the places in an erroneous tile where self-correction fails though such
places may occupy a relatively tiny fraction of the tile's area. How
can SRTEST touch those places if it shall not know where they are?
Out of Desperation:
~~~~~~~~~~~~~~~~~~~
SRTEST knows locations for no other singularities than the boundaries
of the 2^(N-1) vertical strips, so it distributes its tests with
ever increasing density along and next to those boundaries. If only to
turn this necessity into virtue, we shall contemplate two hypotheses:
Hypothesis V : Every tile has at least one edge that runs along
~~~~~~~~~~~~~~ the vertical boundary of a strip and, whenever
self-correction fails, it fails at or next to
such an edge ( and perhaps elsewhere too ).
Hypothesis W : Self-correction fails in regions shaped roughly
~~~~~~~~~~~~~~ like triangles, sometimes very small, of which
the bigger ones press against tiles' corners.
Hypothesis V is provably true without carry-save representation; my
proof exploits the monotonicity of column selection vs. remainder. And
casual exploration of a few examples turned up no counter-example for
the kind of carry-save representation I tried. Hypothesis W comes
from empirical observation supported by the following thoughts: when
an erroneous quotient digit, too big or too small, cannot be self-
corrected then it is often followed in the correct quotient by at least
one digit that is already extreme, minimal or maximal resp. Quotients
that contain a string of extremal digits occur preponderantly at tiles'
corners.
If hypothesis V holds then running SRTEST long enough will, by
planting a test sample on or next to a vertical edge of every tile,
ensure ultimately that the PLA has been tested exhaustively. But if
hypothesis V fails, tiles that do not touch a strip's vertical edge
will escape testing by SRTEST, which may then fail to test the PLA
exhaustively no matter how long it runs.
Hypothesis W , if true, accounts for SRTEST's efficiency: random
tests farther away from the vertical strip's edges would have to hit
smaller triangles of uncorrectable errors and would miss more often.
True or not, the two hypotheses motivated SRTEST.
Scattering Samples:
~~~~~~~~~~~~~~~~~~~
Random number generation comes naturally to mind as a way to scatter
sample dividends, but random samples yield fewer hits than SRTEST's
simple formula, so it contains no explicit random number generator.
| Footnote: +-----------------------------------------------------+
| This is not to disparage random testing altogether. In general, |
| additional random testing continues to be necessary lest testing |
| focused at known singularities miss others that were overlooked. |
+-------------------------------------------------------------------+
Instead, inspired by Hypotheses V and W , SRTEST chooses divisors
at and near the singular divisors that mark the vertical boundaries of
strips, and chooses dividends that will ultimately, if SRTEST runs
long enough, exhaust all integers of some preassigned width. These
choices are determined by three parameters:
N = the number of leading bits of divisors that go to the PLA.
L = Log2( Radix ) ; i.e., Radix = 2^L .
M = maximum bit-width preassigned for divisors.
SRTEST uses (2*L + 1)*2^(N-1) divisors, each an M-bit integer of
the form
2.0^M - j*2.0^(M-N) + S for j = 0, 1, 2, ..., 2^(N-1)
where S runs through the set
{ -2^(L-1), ..., -4, -2, -1, 0, 1, 2, 4, ..., 2^(L-1) } .
These perturbations S provide divisors on and near the boundaries of
vertical strips scaled up by 2^(M-1) .
Early versions of SRTEST ran S through the set { -1, 0, 1 } , but
patterns of period L in carry-save remainders were noticed by Tim
Coe, who observed that these patterns could severely delay revelation
of some kinds of flaws in a PLA. He proposed that the bigger set could
hasten the revelation of flaws even if it did not increase SRTEST's
yield ( number of failures found per division tested ). His proposal,
augmented slightly as decribed below, has worked well so far.
SRTEST's dividends start with consecutive odd integers 1, 3, 5, ... .
These ensure that every row of the PLA will be touched as soon as
possible during the earlier quotient digits' selection, and also tend
to generate sequences of extremal ( hence uncorrectable ) quotient
digits when divided by divisors that are adjacent to the edges of the
vertical strips. That tendency is valuable enough to merit enhancement
by the following measures, suggested by Tim Coe's recollections of
conversations with Dan Zuras held at Hewlett-Packard Laboratories.
Each odd integer generates three neighboring dividends. First that
integer is normalized into the range 2^(M-1) ó Dividend < 2^M by
scaling by a power of 2 . Then two more dividends are obtained from
that one by perturbing it slightly: choose a small integer I ò 0 and
add +2^I and -2^I to the normalized Dividend. The choice I = 0
has worked well enough; I = L-1 has worked better. That perturbation
has enhanced the yield of SRTEST over fiftyfold on Pentium.
Since SRTEST may be run for a long time, some provision for stopping
and restarting it is necessary. To that end, the current odd integer
from which three dividends were derived is displayed continually and is
available for initialization when SRTEST is restarted.
Integer formulas prevent the troubles that could be caused by faults in
a compiler's decimal-binary conversion of nonintegers, especially if
that conversion uses the floating-point divide instruction that SRTEST
is testing. Analogous hazards inherent in binary-decimal conversion of
displayed output have been addressed by displaying also in hexadecimal.
Checking a Quotient:
~~~~~~~~~~~~~~~~~~~~
After what ideally would be Q = Dividend/Divisor has been computed
and rounded off, how can its correctness be ascertained without using
possibly incorrect division operations? SRTEST computes the remainder
R = Dividend - Q*Divisor
exactly and checks that it is no bigger than is compatible with a
correctly rounded Q . Two details deserve a few moments' attention:
How is R computed exactly? How big is " compatible " ?
Provided floating-point addition, subtraction and multiplication work
correctly, R can be computed exactly if each factor of Q*Divisor is
first broken into pieces that can be multiplied without roundoff. For
instance, suppose Ql is obtained from Q by rounding it off to half
as many significant bits as would normally be carried; how to do that
will be discussed later. Then Qt = Q - Ql is computed exactly with a
floating-point subtraction, and consists of the trailing half of the
significant bits of Q . Similarly split up Divisor = Dl + Dt . Then
no product Ql*Dl, Ql*Dt, Qt*Dl, Qt*Dt engenders any roundoff, so
R = ((( Dividend - Ql*Dl ) - Ql*Dt ) - Qt*Dl ) - Qt*Dt
exactly provided the subtractions occur without roundoff. For the sake
of simplicity and speed, SRTEST chooses divisors with Dt = 0 , so
R = ( Dividend - Ql*Divisor ) - Qt*Divisor
suffers no rounding error and is therefore computed exactly. ( Should
such half-width Divisors be found inadequate for some reason, SRTEST
could easily be modified to accommodate wider Divisors.)
The simplest way to break Q into a sum Ql + Qt is to copy Q to
Ql and then write zeros over as many trailing significant bits of Ql
as suffice to ensure that Ql*Divisor will suffer no roundoff. This
bit-twiddling can be replaced by a portable but slower alternative:
T = H*Q and Ql = ( Q + T ) - T
using IEEE Standard rounded floating-point arithmetic and a constant
H = 2^M - 1 in which M is the number of trailing bits to be cleared
from Q . This alternative must be used when floating-point arithmetic
carries an odd number 2M-1 of significant bits, in which case M
must barely exceed half that number and every Divisor must be at most
M bits wide; then Qt = Q - Ql will turn out at most M-1 bits wide
but possibly negative. In any event, IEEE Standard 754's rounded
floating-point arithmetic is more than good enough to deliver Qt with
no rounding error, and likewise R .
How big can the remainder R not get if the quotient Q is correctly
rounded? Then the error in Q must be smaller than a floating-point
quantity Qe obtained by clearing to zero all the significant bits of
Q except its ( possibly implicit ) leading 1 , and decrementing
its exponent by the number of significant bits carried. Just as Qe
should exceed the magnitude of Q - Dividend/Divisor , so Qe*Divisor
should exceed the magnitude of the remainder R ; otherwise floating-
point division is flawed and fails the test.
Once again, bit-twiddling can be replaced by a portable alternative:
Qe*Divisor = ScalB( Divisor, LogB(Q) - NumberofSigBits )
wherein ScalB and LogB are functions recommended by IEEE Standard
754 but " more honour'd in the breach than the observance." The C
functions frexp and ldexp can be used instead but less conveniently.
Similar steps assist normalization and perturbation of dividends.
The Proof of a Pudding is in the Eating:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The earliest version of SRTEST was programmed in Fortran and run on
a Pentium in December of 1994; it found a flaw in the 13,800 th
quotient tested. When the current version was programmed and run in
February of 1995, it took a tiny fraction of a second to uncover the
first failure in the 17 th decimal of the 356 th quotient tested:
3221225470 BFFFFFFEh
---------- = --------- = 0.9999999996895591417 ,
3221225471 BFFFFFFFh ~~~
not Pentium's 0.9999999996895591047 .
~~~
If SRTEST were less sensitive to errors in quotients it might have
had to wait for the 686 th quotient tested to uncover a failure in
its 8 th decimal:
2684354558 9FFFFFFEh
---------- = --------- = 0.6666666663 ,
4026531839 EFFFFFFFh ~~~
not Pentium's 0.6666666347 .
~~~
The worst relative error found, about 1/2^14 = 6.1 E-5 , occurred in
the 50,096 th quotient tested:
2148268000 800BF7E0h
---------- = --------- = 0.666910 , not Pentium's 0.666869 .
3221225471 BFFFFFFFh ~~ ~~
Almost the same error had first occurred in the 18,932 nd quotient
tested. That Pentium's floating-point division cannot do much worse
has been confirmed by A. Edelman as an extension of results obtained
by Coe and Tang (1995) to be mentioned below.
Like most gleaners, SRTEST ran into the Law of Diminishing Returns
after running for a while; its initial yield of one faulty quotient
per 200 or 300 tested dwindled to one per 600,000 . Here is a
tabulation that shows how the yield dwindled:
Total Number of Total Number of
Failures Found Divisions Tested
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
1 356 - 684
4 836 - 1,119
10 2,275
48 9,476 - 10,194
100 20,756 - 20,950
450 99,956 - 100,284
1,000 285,715
2,295 999,625 - 1,000,285
7,884 9,998,470 - 10,003,509
10,000 19,424,950 - 19,425,865
17,668 99,898,960 - 100,002,685
30,506 994,345,751 - 1,001,442,070
53,935 9,999,879,331 - 10,000,743,330
100,000 47,689,282,440
112,670 56,902,042,535 - ...
Though this yield far surpasses the one per 8,000,000,000 that random
testing would yield, its dwindling is significant for reasons to be
discussed later.
A Short History:
~~~~~~~~~~~~~~~~
Early in the summer of 1994, when comparing another processor's test
results with Pentium's, Intel discovered that the latter's floating-
point division was defective, but kept silent in adherence to a long-
standing policy ( reversed nine months later ). In October Prof.
Thomas M. Nicely, of Lynchburg College Math. Dept. in Virginia,
traced a discrepancy between his computations on other computers and on
Pentium to that defect and brought it to public attention. He
deserves a medal for checking his work. Then Intel compounded its
error by attempting to minimize the bug, thereby ensuring that it
would hatch into a major embarrassment.
Soon afterward Tim Coe, at Vitesse Semiconductor Corp., Camarillo
CA., reverse-engineered Pentium's divider well enough to predict,
contrary to opinions expressed at that time ( cf. Anonymous (1994)),
that Pentium's single-precision division could suffer relative errors
as bad as 6.1 E-5 :
4195835 4005FBh
------- = ------- = 1.33382 , not Pentium's 1.33374 .
3145727 2FFFFFh ~~ ~~
Coe had deduced that five entries in Pentium's PLA that should be
2 were 0 , and that therefore divisors very slightly less than the
five integers 18, 21, 24, 27 and 30, or numbers obtained by scaling
them by powers of 2 , were the divisors most likely to cause errors.
These are five of the seventeen singular divisors near which SRTEST
focuses its testing.
In December 1994, Coe's deductions were confirmed in the white paper
by Barton and Sharangpani, which had been released by Intel to
support its estimates of failure rates like one division in 8.8
billion, one failure in 27,000 years for most spreadsheet users,
and so on. Prof. Nicely had estimated the failure rate for random
divisions at one in 31 billion. Estimates like these were cited to
justify Intel's public stance: Pentium is safe for almost
everybody.
Meanwhile all numbers very slightly less than a small positive integer
had come to be called " Bruised Integers " on the InterNet, where a
brisk correspondence flourished among participants ranging from idle
curiosity seekers to technology managers fearful of worse revelations
to come. Since all failures found initially involved bruised integers,
an analyst within IBM decided to investigate how often they arise out
of decimal-binary conversion errors; for example, " 0.3/0.1 " incurs
two rounding errors before the division is performed, and that is why
double-precision calculation of that quotient produces the bruised
integer 2.999...999 , in error by 1/2^51 . Although its author did
acknowledge that his was a " simplistic analysis," it inspired IBM
to suspend sales of Pentium-based PCs for fear that a spreadsheet
user might encounter a mishap every 24 days on average. IBM's
public stance was that Pentium is unsafe for almost everybody.
Nobody will ever know which estimate, Intel's or IBM's, is closer
to the truth unless someone is employed for several months to compare
results for identical wide-ranging real-life tasks run on Pentiums
with and without the division bug, or else run on a flawed Pentium
using a slow counting option in the software patch described below.
Since almost everybody can see how self-interest aligns with IBM's
and Intel's public stances, the prevailing opinion ( equally
unjustified ) is that the truth must lie somewhere between. This numb
skepticism is a predictable consequence, well known to litigators and
plotters of political campaigns, of conflicting pronouncements, some
reassuring and others frightening. The public interest would be served
better by the promulgation of do-it-yourself triage kits designed to
help individuals assess each his own exposure to hazard.
Bruised Integers:
~~~~~~~~~~~~~~~~~
To help assess the hazards posed by Pentium's FDIV bug, a summary
of what is known about it will be presented here. Five entries in its
PLA are wrong, 0 instead of 2 , in columns belonging to divisors
barely less than the five integers 18 = 12h, 21 = 15h, 24 = 18h,
27 = 1Bh and 30 = 1Eh and multiples of them by powers of 2 ; these
five, of the seventeen singular divisors near which all of SRTEST's
tests occur, are all multiples of 3 . Divisions tend roughly ÄÄ not
infallibly ÄÄ to fail less often as divisors fall farther short of
these singular divisors; that is why most failures have divisors that
are bruised integers. Division never fails at divisors too far from
singular; Tim Coe and Dr. P-T. Peter Tang (1995) have demonstrated
that every divisor for which division fails on Pentium must, when
scaled by a suitable power of 2 , fall into one of these five narrow
intervals:
11.FCh = 18 - 1/64 < Divisor < 18 = 12.00h , or
14.FCh = 21 - 1/64 < Divisor < 21 = 15.00h , or
17.FCh = 24 - 1/64 < Divisor < 24 = 18.00h , or
1A.FCh = 27 - 1/64 < Divisor < 27 = 1B.00h , or
1D.FCh = 30 - 1/64 < Divisor < 30 = 1E.00h .
These intervals must be practically best-possible because Tim Coe has
found failures as close to their lower boundaries as these divisors:
17.FC01FF02ECh , 1A.FC01FF03Eh , 1D.FC01FF006Ch .
( He has also predicted that failures with random operands should occur
in quotients' 14 th bit-position half as often as in the 16 th and
subsequent even bit-positions, where he expects failures to occur
uniformly distributed.)
Knowing those intervals, Dr. Cleve Moler, Terje Mathisen of Norsk
Hydro in Oslo, and Tim Coe have been able to devise an efficient
software patch to work around defective Pentiums. Briefly, the patch
tests every divisor and decides very quickly whether it lies in one of
those intervals, in which case the potentially incorrect division is
preceded by multiplications of divisor and dividend by 15/16 . That
moves the divisor out of danger and, thanks to Pentium's Double-
Extended registers, produces exactly the same quotient as would have
been obtained by an unflawed machine from Single- and Double-Precision
floating-point operands. As implemented in The MathWorks' MATLAB and
MetaWare's C and C++ compilers, the patch is augmented by an option
that, when turned on, counts the instances when a flawed Pentium's
division fails; this is the slow counting option mentioned above.
In accord with hypothesis W above, the majority of failures have
been found at divisors and dividends that are bruised small integers;
however, the extent of that preponderance may be due in part to the
way searches for failures have been conducted. A substantial fraction
of observed failures do not have bruised integer dividends and, more
ominously, many failures have modest integer divisors. Each row of
the tabulation below shows how many failures were counted as dividends
ran through the first 2^31 odd integers ( exhausting all 31-bit wide
dividends ) for every divisor, each a small integer, in that row. A
disturbing aspect of these failures is that none hit beyond the 26 th
sig. bit of the quotient and three quarters of them hit in the 16 th
to 24 th sig. bits, so they would degrade single-precision results.
-----------------------------------------------------------------------
Five divisors of equal binary widths Count of division failures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
in Decimal: 4607, 5375, 6143, 6911, 7679 38
in Hexadecimal: 11FFh, 14FFh, 17FFh, 1AFFh, 1DFFh
in Decimal: 9215, 10751, 12287, 13823, 15359 36
in Hexadecimal: 23FFh, 29FFh, 2FFFh, 35FFh, 3BFFh
in Decimal: 36863, 43007, 49151, 55295, 61439 6355
in Hexadecimal: 8FFFh, A7FFh, BFFFh, D7FFh, EFFFh
in Decimal: 73727, 86015, 98303, 110591, 122879 6642
in Hexadecimal: 11FFFh, 14FFFh, 17FFFh, 1AFFFh, 1DFFFh
in Decimal: 147455, 172031, 196607, 221183, 245759 2070
in Hexadecimal: 23FFFh, 29FFFh, 2FFFFh, 35FFFh, 3BFFFh
in Decimal: 294911, 344063, 393215, 442367, 491519 28242
in Hexadecimal: 47FFFh, 53FFFh, 5FFFFh, 6BFFFh, 77FFFh
in Decimal: 589823, 688127, 786431, 884735, 983039 17648
in Hexadecimal: 8FFFFh, A7FFFh, BFFFFh, D7FFFh, EFFFFh
-----------------------------------------------------------------------
Plots of all known failures have revealed little more than is obvious.
( Some plots seen on the InterNet are misleading because they take no
account of decimal-binary conversion errors incurred before plotting.)
Plotted in the ( Divisor, Dividend ) - plane, failures fell into near
horizontal grass-like streaks corresponding to vulnerable quotients;
in other words, all failures have divisors in the narrow intervals
allowed by Coe and Tang (1995), and have peculiar quotients in so
far as they contain strings of extremal quotient digits.
Perusal of hundreds of thousands of failures indicates that quotients
always exhibit digit patterns peculiar to the interval into which the
divisor falls. For instance, for divisors strictly between 11.FCh
and 12.00h , and for dividends between 10.00h and 20.00h , so
that quotients must lie between 0.E38E38...h and 1.C781AB...h , all
the quotients that Pentium got wrong looked like this in binary:
right quotient = ...WwXx1y 101010Zz...
wrong quotient = ...WwXx1y 011111.....
where W, w, X, x, y, Z, z are bits and W+w = X+x = Z+z = 1 . Blanks
have been inserted just before the first bit that changed in the final
quotient although the error in Pentium's PLA actually changed the
second 1 after the blank into a 0 , after which a negative quotient
digit got selected; its propagated borrows decremented the first 1 .
The numbers of bits after the binary point and ahead of the blank were
always even and at least 12 . Those same even numbers but slightly
different bit patterns arose with divisors in the other four intervals.
A better understanding of these patterns awaits refinements of methods
like those Coe and Tang used for their demonstration cited above.
The patterns are obscured by a coincidence; all the singular divisors
adjacent to Pentium's flaws are multiples of 3 , and generally the
divisors that produce long sequences of extremal quotient digits from
small integer dividends lie near small integer multiples of a divider's
( Radix - 1 ) , here 3 also.
Illusory Randomness:
~~~~~~~~~~~~~~~~~~~~
Other than the foregoing, the failures appear random. Appearances are
deceptive; the failures are not random. They look random just because
we have yet to find a few simple formulas that predict them well. Such
formulas must exist because the failures are "algebraic"; they arise
out of the omission of a term from an expression sent to a CAD tool
that constructed Pentium's PLA. Therefore the failures all lie in a
smooth curve or surface that first gets chopped up and reflected, by
the SRT division process, in a way that resembles the effect of a
kaleidoscope with cracked mirrors. Second, carry-save representation
of remainders further riddles the failures' locus with holes in a way,
still ill understood, that contributes most to the failures' apparent
randomness. But to presume randomness is to compound our ignorance.
There are other reasons to suspect a pattern among the failures. It is
foreshadowed by those failures with small integer divisors. A better
harbinger is the initially high yield of failures exposed by SRTEST :
among its 80 divisors only ten were capable of generating failures,
and those ten did so at the rate of about one failure per 36
divisions among the first 35,715 in which they participated. So high
a failure rate from so simple a program as SRTEST warns us that other
software may generate high failure rates too, but inadvertently.
What kind of applications software is most at risk? The risk has less
to do with an application's data or with its frequency of divisions
(unless that be zero) than with the formulas the software uses. Far
from being independent random variables, dividends and divisors are
generally correlated algebraically by the formulas that generate them,
and therefore lie in curves and surfaces peculiar to the application.
If these figures fall tangent to the curves or surfaces comprising all
failures, then failures will swarm around that application. If, as
is far more likely, intersections are instead transversal (not nearly
tangent) then that application will experience almost no failures.
It follows that almost all applications software can be divided into
two classes. The larger by far consists of software whose failure rate
due to Pentium's defective division is very tiny, much too tiny to
estimate reliably within orders of magnitude. Software almost certain
to fail constitutes the smaller class; containing SRTEST, it cannot
be empty. Unfortunately, to classify any particular program that does
perform floating-point divisions, we do not yet know a better way than
to run it on a defective Pentium for a while.
Why did SRTEST's yield, the number of failures found per division
tested, slowly dwindle as testing continued? Apparently, the more
productive M-bit dividends end in longer bit-strings like ...111111
or ...000000 or ...000001 . The dwindling yield is consistent with
the geometrical picture portrayed above, and hence consistent with the
existence of some simple but not yet recognized formula that would
generate all failures. That formula and SRTEST's formula generate
dividends and divisors falling along curves tangent to each other, but
the neighborhood of tangency contains only earlier dividends SRTEST
obtained from small integers. Later dividends, obtained from bigger
integers, fall far enough from this neighborhood that the failure rate
dwindles to something approaching the rate for random dividends.
Unanswered Questions:
~~~~~~~~~~~~~~~~~~~~~
" Seek Singularities ! " makes a fine maxim but not a firm foundation
for a test program like SRTEST until the singularities' locations are
understood better. SRTEST lacks an adequate theory for its behavior
with carry-save remainders; further investigation is needed badly.
Among questions that come to mind are these:
How best can knowledge of a divider's Radix be exploited?
Without knowing the redundant quotient-digit set nor whether carry-
save remainders are used, can we characterize all the most
vulnerable quotient-digit sequences and then multiply them by
near-singular divisors to get most vulnerable trial dividends?
What can be done to render the program SRTEST invulnerable to the
vagaries of compilers, and hence more widely portable?
If run long enough can SRTEST prove a divider correct? How long?
In our present state of (mis)understanding, we cannot substitute a
test like SRTEST for a mathematical proof of correctness. The best
that might be said for SRTEST is that it beats the Test of Time:
millions of customers used Pentiums for months without noticing any
error in division. That proves customers rarely check their work, and
that very few of them get an opportunity to do what Intel and Nicely
did to discover Pentium's flaw: compare results obtained for the
same computation from different computers.
Besides, what could a customer who noticed a discrepacy do about it?
If caught during debugging, a numerical discrepancy would probably be
attributed to roundoff, which is the paramount cause of discrepancies.
Since rounding errors are so difficult to understand, a programmer can
easily misdiagnose the discrepancy and alter her program ineffectually
and, having changed a divisor by another rounding error, no longer
see the discrepancy. ( This volatility is explained in the Appendix
on the Vagaries of Carry-Save.) Consequently, despite many observed
discrepancies, flawed division would remain unreported.
Ultimate users of numerical software routinely misinterpret incorrect
results. If noticed at all, usually long after they were committed,
errors are often blamed upon roundoff; to evade them, users will
change to different software or change data slightly rather than waste
time performing someone else's debugging duties. Many a computation is
hypothetical, performed to predict consequences of a proposed action;
disagreeable consequences preclude that action. When numerical errors
predict disagreeable consequences mistakenly, no action follows; the
mistake goes unnoticed. Flawed division would remain unnoticed too.
Thus, reports from the field must badly under-report the incidence of
error due to faulty arithmetic so long as a fault remains unsuspected;
an absence of reports during that time cannot be construed afterwards
as evidence that nobody but the first to report it had been affected by
the fault. As a fault becomes public knowledge it becomes a target for
the idly curious and a scapegoat for ills that otherwise resist easy
explanation, and the incidence of error tends to become over-reported.
Worse, conscientious professionals must then start chasing wild geese,
checking old results for contamination and burdening new software with
extra checks against errors that continue, as before, to be caused
almost always by some mistake other than faulty arithmetic.
The computing industry is served badly by the Test of Time; if passed
for a while it conveys a false sense of security; when failed it fails
too late. Tests like SRTEST are far preferable however imperfect.
Like any test program, SRTEST has to be tested too. It should be run
on numerous PLAs each with a different intentional defect to see how
long SRTEST takes to uncover defects. Such a test is made difficult
by the self-correcting propensity of SRT division; many an intended
"defect" could turn out to be not a defect after all, or to harm only
wide quotients with more digits than the test examines. Whatever the
difficulties, we have to attempt such a test before we can say with
confidence that SRTEST's prompt exposure of Pentium's floating-
point divide bug was no accident.
Temptations to be Resisted:
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Neither SRTEST's nor any similar test's results can support estimates
of the "probability" that an unspecified user of defective arithmetic
will suffer from the defect. The probability that a random test would
uncover the defect is logically irrelevant. To see why, suppose that
a divider produced a grossly wrong quotient when and only when division
by 3.0 occurred in double or double-extended precision. Since 3.0
is one of at least 2^52 = 4.5 E 15 double-precision floating-point
numbers, a random test could be expected to uncover this bug with
probability less than about 2.3 E-16 , which is far smaller than any
of the probabilities that have been bandied about concerning Pentium's
bug. However, division by 3.0 is so ubiquitous that its defect
could not reasonably be expected to stay hidden for so long as one day.
Similarly, SRTEST's yield of about one defective division per 250
of them during its first few seconds' running is no probability; if a
substantial fraction of Pentium owners faced that kind of hazard they
would have drowned Intel with complaints a few weeks after purchase.
We must not promulgate dubious statistics, especially if they tend to
make light of an obvious bug's impact upon the public, lest opponents
respond with their own self-serving statistics tending the opposite
way. Both sides will claim their predictions borne out by events. The
controversy will draw attention to ever more embarrassing mishaps that
turn up inevitably, most of them misattributed to the bug; and then
prophets of doom, cartoonists and lawyers will close in to exact the
public's revenge for a betrayal of their trust.
Here is a better policy:
1: Never try to hide a bug nor postpone acknowledging its existence;
the attempt merely destroys the public's confidence in whatever you
are obliged to say at last. The rarer the bug, the more careful
you must be to say only what can be said with certainty, since the
statistics of extremely rare events, far out on the tails of
probability distributions, are notoriously treacherous; see the
Appendix on the Fragility of Low Probabilities for an explanation.
Attempts to estimate extremely low rates of incidence are too
vulnerable to unanticipated correlations to be worth publishing.
2: Undertake to replace defective parts as quickly as corrected parts
can be made available. Past experience and statistical methods can
serve validly to estimate the rate at which demand for replacements
may ramp up, but alluding to uncertifiable statistics in order to
dampen public demand may merely exacerbate legal liabilities.
3: Provide software patches to work around the bug, if that is
possible, as soon as possible and without waiting for mitigations
of performance degradation. Enhancements to performance can come
later; prevention of further errors must come first. Besides, if
the bug will depart soon then effort spent understanding it well
enough to circumvent it cleverly is effort misplaced, and risky
too; excessive cleverness invites more bugs.
4: Supply utility software that scans executable files to detect any
vulnerabilities ( or to certify that none exist ) and then to
patch vulnerable programs as nearly automatically as possible.
Such a scanner would allow most customers confidently to delay
replacing the defective part until a convenient occasion, perhaps
forever. To produce such a scanner is not trivial; it must avoid
false negatives arising from self-modifying code, cope with false
positives arising from a variable-length instruction set, and deal
with files that get decompressed only during execution. Aided by
electronic bulletin-boards, the scanner can exploit catalogs of
programs known some to be invulnerable, some to be unpatchable,
and some to accept patches supplied with the help of providers of
widely used software; and other software experts will help too if
not asked to do too much.
The foregoing " Counsel of Perfection " ÄÄ more easily said than done
ÄÄ will appear altogether too high-minded to many a practical man of
affairs called in to cope with bad news after it has broken. Perhaps a
policy better than mine exists, yet it must have the same goal, which
is to protect a corporation's most valuable asset ÄÄ the confidence of
its customers. When that confidence is lost, the consequences are
cynicism and distrust that have always, in my experience going back
over four decades, incurred costs and lost revenues substantially
greater than whatever short-term gain tempted somebody to adopt a mere
expedient. The marketplace soon forgets the technical details of a bug
but long remembers the sentiments aroused by the way it was handled.
Therein lies the lesson to be learned from Intel's tragedy:
Bruised by Intel's initial handling of Pentium's flaw, and
confused by conflicting pronouncements and superficial journalism,
the public has now lost some of its confidence in the reliability
of Pentium's floating-point architecture. That is exasperating
because, with its bug corrected, Pentium gives consumers what I
deem to be much the most reliable among all currently commercially
significant microprocessor arithmetics. ( Having helped design it
that way in 1977, I may be biased.)
Annotated Bibliography:
~~~~~~~~~~~~~~~~~~~~~~~
Anonymous (1994 ) on p. 5 of Microprocessor Report vol. 8 no. 16 for
Dec. 5, 1994: " ... Intel has offered to replace the processor of
any user affected by the problem. This offer should not have many
takers, as the few divisors that generate erroneous results cause
errors of a few parts per billion. Single-precision calculations
are not affected at all. Most PC users don't even use their FPU,
and engineers typically use single-precision." Early false report.
T. Coe and P-T.P. Tang (1995) "It Takes Six Ones to Reach a Flaw"
submitted for the 12th IEEE Symp. on Computer Arithmetic.
W.J. Cody et al. (1984) "A Proposed Radix- and Word-length-independent
Standard for Floating-point Arithmetic" p. 86-100 of IEEE Micro,
Aug. 1984. This explains IEEE Standards 754 and 854 for Floating-
Point Arithmetic better than the texts of the standards.
G. Barrett (1987) "A Formal Approach to Rounding" p. 247-254 of Proc.
8th IEEE Symp. on Computer Arithmetic, ed. by Mary Irwin and R.
Stefanelli. This mentions David Shepherd's 1986 "Proof" that
the Inmos T800 conforms to IEEE 754. I do not know where
Shepherd's work has been published, other than as a report from
Oxford University's Programming Research Group, nor whether the
errors in the T800 have been acknowledged in print or corrected.
M.L. Barton and H. Sharangpani (1994) "Statistical Analysis of
Floating Point Flaw in the Pentium(TM) Processor (1994)", an
Intel white paper dated November 30th 1994.
J. Fandrianto (1987) "Algorithm for High Speed Shared Radix 4 Division
and Radix 4 Square Root " p. 73-79 of Proc. 8th IEEE Symposium on
Computer Arithmetic, ed. by Mary Irwin and R. Stefanelli. This
converts 8 leading bits of a carry-save remainder to sign-
magnitude and feeds 7 bits of that magnitude to a smaller PLA
to select the magnitude of a quotient digit from { 0, 1, 2 }.
J. Jarrett (1995) "Postmortem on the Pentium (TM) Processor Crisis" p.
8-11 of the Jan. 1995 issue of Intel Leads, "published by and
for Intel Corporation." Explains Intel's position.
W. Kahan (1986) "DIVCK", a suite of BASIC programs and documentation,
distributed electronically via NETLIB in 1987, written at the
behest of Dr. Cleve Moler, then at Intel in Oregon, to test
whether binary floating-point division is correctly rounded. These
programs test efficiently only the quotient's last digit. Updated
versions of these and several more programs to test floating-point
arithmetic and its library of elementary mathematical functions
will appear in a suite soon to be released to the public domain by
my former students at Sun Microsystems led by Dr. David G.Hough.
I. Koren (1993) "Computer Arithmetic Algorithms" Prentice-Hall, N.J.
Ch. 7 of this text explains details of SRT division well. A new
edition is imminent. However, neither its author nor any other I
know discusses the challenge posed by black-box testing of SRT
dividers. If I have missed someone I shall welcome being put
right; SRTEST is too simple for me to believe that I am the first
to think of it.
C. Moler (1995) "A Tale of Two Numbers" p. 10-12 of MATLAB News & Notes
for Winter 1995; also p. 1 & 16 of SIAM News vol. 28 no. 1 (Jan.
1995) Recounts the Pentium FDIV bug's history and proposed patch.
G.S. Taylor (1981) "Compatible Hardware for Division and Square Root"
p. 127-134 of ...
G.S. Taylor and D.A. Patterson (1981) "VAX Hardware for the Proposed
IEEE Floating-Point Standard" p. 190-196 of Proc. 5th IEEE Symp.
on Computer Arithmetic.
Acknowledgements:
~~~~~~~~~~~~~~~~~
SRTEST is based upon conversations I held with Dr. George S. Taylor
in 1980 when, as a student, he designed an SRT divider very much
like Pentium's; see his 1981 papers. His design was built as part
of a larger project that was discontinued after another student, David
Cary, died in a tragic swimming accident. The divider never got to be
tested by a program like SRTEST, which was therefore not distributed.
I am indebted to colleague Prof. L. Rowe for access to his Pentium
to run my tests initially, to Intel's H.P. Sharangpani for an early
copy of his white paper containing the probabilistic estimates that I
disparage herein, to Intel's Jon Marshall for a Pentium to test
more thoroughly, and to Tim Coe of Vitesse Semiconductors and
Prof. A. Edelman of MIT for indispensable advice and assistance.
=======================================================================
Appendix: Fragility of Low Probabilities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Whenever an engineer's mistake is discovered too late, engineering
managers will require assessments of the mistake's impact before they
can choose a course of remedial action. Decisions are based often upon
probabilistic estimates of failure rates; lower estimates incur less
desperate remedial activity. Unfortunately, the lowest estimates tend
also to be the most unreliable, and not merely because they may be
self-serving. Circumstances do arise wherein low probabilities lack
predictive power, and then decisions based upon them will fare worse
than decisions reached after a frank acknowledgement of ignorance.
The kind of probabilistic reasoning customarily employed to predict
failure rates for telephone equipment and semiconductor fabrication
cannot be employed with confidence to predict failure rates due to an
error in the design of an arithmetic unit. Confidence is undermined by
several considerations, foremost among them being the utterly non-
random nature of that error after it has been found. To proceed with
probabilistic reasoning we must assume that operands are random, that
operations occur with frequencies revealed by random sampling, that
inaccuracies in intermediate results propagate into final results at
rates determinable by statistical methods, and so on. Were every such
assumption a reasonably good approximation in its own right, yet their
concatenation into a chain of inference would still be unreasonable,
the more so if it predicted an extremely low failure rate. Predictions
like that are often falsified by rare events that have been overlooked
completely, and by correlations that have been neglected.
+ Footnote: ---------------------------------------------------------+
| Among recent completely unanticipated rare events have been ... |
| -- Keeping a Patriot anti-missile battery in continuous action |
| for more than a few hours during the Gulf War of 1991, |
| fatally degrading the battery's interception capability. |
| -- Unseasonably cold weather that embrittled a sealing ring on |
| the fuel tank of the ill-fated Challenger space craft. |
+---------------------------------------------------------------------+
Probabilistic inferences about rare events are easily vitiated by small
correlations that affect commonplace events negligibly. For instance,
suppose a calamity occurs only when two rare events A and B both
occur, and let their probabilities be à and á respectively, both
very tiny numbers like 1.0 E-6 . If A and B are independent, the
calamity's probability à á is tinier by far. But if A and B are
slightly correlated, with small correlation coefficient ‡ , then the
probability of calamity can turn out far bigger than à á , depending
much more strongly upon ‡ than do the probabilities of other compound
events like ( A and not B ). Here are relevant formulas simplified by
the substitutions a = 1-à and b = 1-á :
Compound Events Probabilities
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~
A and B (calamity) ( û(àá) + ‡û(ab) ) û(àá)
A and not B ( û(àb) - ‡û(aá) ) û(àb)
B and not A ( û(aá) - ‡û(àb) ) û(aá)
neither A nor B ( û(ab) + ‡û(àá) ) û(ab)
For example take à = 1.0 E-6 , á = 4.0 E-6 , and try three values
‡ = 1.0 E-2 ( A slightly correlated with B ), or
= 0 ( A independent of B ), or
= -2.0 E-6 ( A very slightly anti-correlated with B ).
Probabilities with ...
Compound Events ‡ = 1.0 E-2 ‡ = 0 ‡ = -2.0 E-6
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~
A and B (calamity) 2.00 E-8 4.00 E-12 1.00 E-17
A and not B 9.80 E-7 1.00 E-6 1.00 E-6
B and not A 3.98 E-6 4.00 E-6 4.00 E-6
neither A nor B 0.999995 0.999995 0.999995
Thus can probabilistic assessments of rare events be destroyed utterly
by almost imperceptible correlations that would customarily be ignored
quite properly during assessments of populations' central tendencies.
To ignore such correlations just because we cannot ascertain them is
methodologically unsound.
Appendix: Vagaries of Carry-Save
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The carry-save representations of a number X = C + S consist of all
pairs C and S that sum to X ; they are as numerous as are all the
binary numbers of the wordsize chosen for C and S . However, they
occur with rather different frequencies, and that is why carry-save so
severely complicates the task of testing SRT dividers.
The typical provenance of carry-save representations is illustrated by
an example that shows what happens when an addend A is added to X
to produce a new pair C'+ S' = A + X = A + C + S :
addend A = 000...11110000
+ C = 000...11001100
+ S = 000...10101010
------ -----------------
new sum word S' = 000...10010110
new carry word C'/2 = 000...11101000
The bits of C' have been displayed shifted right one bit-position to
expose a phenomenon: Aside from common leading zeros, corresponding
bits of S and C/2 tend to be opposites three times as often as they
match. The significance of this phenomenon will become apparent later.
Quotient digit selection is based upon only a few leading bits of the
current remainder X ; let Z(X) denote their value. When this is
obtained from the leading few bits of C and S in the carry-save
representation of X = C+S , the numerical value of X need not
determine Z(X) uniquely.
For example, suppose two five-bit carry-save words, with four bits
after the point, are chopped to produce an integer Z(X) ; here are
all possibilities ( disregarding words' order ) for X = 9/8 :
Z(9/8) = 1 for 3 instances:
0.0000 0.0001 0.0010 ( Carry-save adders cannot
1.0010 1.0001 1.0000 yield instances in which both
words end with .....1 .)
Z(9/8) = 0 for 7 instances:
0.0011 0.0100 0.0101 0.0110 0.0111 0.1000 0.1001
0.1111 0.1110 0.1101 0.1100 0.1011 0.1010 0.1001
Thus, carry-save renders the " function " Z(X) ambiguous in so far
as it may take either of two different adjacent values at different
instances of the same numerical value X . That ambiguity exacerbates
the volatility of quotient digit selection, as we shall see now.
Quotient digit selection is intrinsically volatile whenever a final
quotient ends in a long string of zeros because then changes in the
least-sig. bit of divisor and/or dividend that alter at most the least-
sig. bit of the final quotient can change so many earlier selections of
quotient digits. The changes in those selections get undone by carry-
propagation after the final rounding.
Carry-save representations of remainders are volatile in a similar way,
especially when divisors end in long strings of zeros or of ones, as
do SRTEST's. The ambiguity of Z(X) conducts this volatility to the
selection of quotient digits. Thus, slightly different dividend/-
divisor pairs that generate almost the same final quotient frequently
generate very different sequences of (redundant) quotient digits. In
consequence, the leading digits of divisors and dividends, though
they do determine leading digits of final quotients, frequently do not
determine as many leading selections of (redundant) quotient digits.
We shall find two practical consequences of this volatility. First,
it will attenuate the capability of tests to uncover certain kinds of
errors in whatever PLA or equivalent logic is used for quotient digit
selection. Second, that kind of error will almost certainly get
misdiagnosed as a common flaw, namely hypersensitivity to roundoff,
in an applications program because the error's effect will flicker out
of sight after a change in end-figures of input data. ( Programmers
tend to cope with roundoff very poorly if at all.)
What kind of error is so closely linked to carry-save's volatility?
Think of the PLA as a table of quotient digit selections whose every
entry we wish to exercise by tests that generate dividends and divisors
for that purpose. The table's most extreme entries ÄÄ those on the
boundary beyond which selection should never go ÄÄ are most likely to
be reachable only by deftly orchestrated sequences of division steps
for which divisors and dividends must be calculated quite precisely
lest, because of volatility, roundoff defeat their purpose. Lacking
knowledge of those sequences, the architect of a test program like
SRTEST must exercise those extreme entries some other way, and that
turns out to be extremely difficult.
The difficulty arises out of a peculiarity of carry-save adders already
mentioned. For definiteness continue to suppose that Z(X) is an
integer obtained by chopping fractional bits. Without carry-save
representation of a non-integer remainder X , its integer part Z(X)
is determined uniquely:
Z(X) = N ó X = N+f < N+1 .
With carry-save there are two possibilities: Z(X) = N or Z(X) = N-1.
If one of the two carry-save words representing X = C+S were a
uniformly distributed random variable, the former possibility would
occur with probability very near the fractional part f of X . ( A
corroborating probabilistic computation has been omitted here.)
However, carry-save adders produce pairs of words of which each tends
to have, after their common leading zeros, roughly as many zero bits
as the other has ones. This happens because those adders produce pairs
of "sum" and "carry" bits that are three times more likely to be
different than to be the same, given randomly chosen addends. In
consequence the error committed when both words' fractional bits are
discarded tends to avoid the extremes, 0 and 2 , of its range;
when the fraction f is small the larger possibility Z = N occurs
with probability far less than f , and when f is nearly 1 the
lesser possibility Z = N-1 has probability far less than 1-f . That
is why some extremal values of Z(X) may occur very infrequently.
In experiments with random numbers, the probability that Z(N+f) = N
was observed to be approximated surprisingly well by a smooth function
arcsin( sin( ãf )^4 )/ã if 0 ó f ó 1/2 ,
1 - arcsin( sin( ã(1-f) )^4 )/ã if 1/2 ó f < 1 .
It is a mystery. If this approximation can be trusted, Z(9/8) = 1
can be expected to occur not 1/8 but somewhat less often than 1% of
the time. Perhaps bias of this kind could be alleviated by retaining
one more bit of each carry-save word than will be delivered to Z(X)
from their sum, as Fandrianto (1987) did, though doing so might
accomplish no more for an SRT divider than render its defect, if
there be one, easier to uncover by testing.
???????????????????????????????????????????????????????????????????????
Awaiting Exposition:
Tests of SRTEST on artificial flaws in simulated SRT dividers.
Convert Fortran SRTEST to C .
Give example of financial miscalculation due to Pentium's bug?
( Hearsay on p. 32 of PC Magazine, 7 Feb. 1995 ?)
Are corrected Pentiums distinguishable visually from older faulty
chips? Vaughan Pratt claims that the older ones have a mark
" SX 857 " on their bottom, newer have " SX 957 ". But my
old one is marked "SX 879" .
???????????????????????????????????????????????????????????????????????