FCLASS:   Proposed Classification of Standard Floating-Point Operands
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                   W. Kahan
                                 23 March 2002


    (This document is written in the  ASCII  format common to all  IBM PCs
    and readable in  MS DOS  using its  EDIT ...  or  MORE < ...  commands.
    Almost all the document is readable in other operating systems too.)


    Abstract
    ~~~~~~~~
    This document names attributes of floating-point numbers and standard
    formats that some programs will occasionally have to discover and
    manipulate.  They need support from programming languages.  Normally,
    attributes of floating-point variables'  _formats_  such as byte-widths
    are needed at compile-time;  attributes of floating-point variables'
    _values_  such as their signs are needed at run-time.


    Compile-time Attributes
    ~~~~~~~~~~~~~~~~~~~~~~~
    Programmers who use  EQUIVALENCE  declarations in  FORTRAN,  or  STRUCT
    and  UNION  declarations in  C,  to assemble data-structures with
    fields that vary at run-time must discover the byte-widths of floating-
    point formats first.  Most formats specified by  IEEE Standard 754 for
    Binary Floating-Point Arithmetic  have standard widths known in advance
    regardless of platform;  but the  EXTENDED  formats' widths differ on
    different hardware and sometimes for different compilers on the same
    hardware.  This is why compilers must provide a compile-time function
                            WIDTH(X)  or  SIZEOF(X)
    that returns the width in bytes of memory occupied by floating-point
    variables of the same declared type as  X .  (Register-widths are not
    relevant here.)  The  WIDTH  function returns integer values ...

          Name of IEEE 754 Format               Width in Bytes
          ~~~~~~~~~~~~~~~~~~~~~~~               ~~~~~~~~~~~~~~~~~~
          Single-Precision (float)                4
          Single-Extended                         6  (or more)
          Double-Precision (double)               8
          Double-Extended (long double)          10,  12  or  16
          Quadruple-Precision (long double)      16
         ( Doubled-double is NOT standardized    16 )

    This list will lengthen as other floating-point formats come into use,
    if they ever do.

    Languages must provide locutions that permit  WIDTH(...)  to figure in
    structure declarations that are portable in the sense that a program
    recompiled elsewhere will still run correctly even if the data-
    structures it creates may not be read correctly when copied verbatim
    from one computer's memory to another.  Of course,  such programs are
    somewhat like suicide:-  something to be discouraged but not prevented.
    Still,  the necessary locutions,  like conditional compilation,  have
    many important applications.

    Another attribute needed at compile-time is the  "Little-Endian"  or
    "Big-Endian"  affiliation of a floating-point format.  That is one of a
    few attributes,  like whether the significand's leading bit is explicit
    (as it is for all current implementations of  Double-Extended  formats)
    or not  (Single,  Double  and  Quadruple  precisions),  needed for bit-
    twiddling of floating-point numbers.  This practice too is something
    best discouraged but not prevented,  especially not if a language fails
    to supply built-in run-time functions of the kind to be discussed next.
    Bit-twiddling is forced upon a programmer who must implement her own
    versions of these functions either because they are lacking in her
    chosen language or because its versions run too slowly.  Bit-twiddling
    is also necessary in software that tests floating-point hardware upon
    specially patterned operands created using only integer hardware.  The
    specifications of fields for sign,  exponent and significand are the
    province of another standard,  IEEE 1596.5  on  Data Communication,  so
    they need not be explored here.


    Run-Time Attributes
    ~~~~~~~~~~~~~~~~~~~
    Programs frequently filter operands before performing operations that
    may malfunction on peculiar values;  examples include subnormal numbers
    and zeros that may be troublesome divisors,  and infinities that spoil
    matrix multiplications,  and signed zeros and infinities that mark the
    ends of open or closed intervals for  Interval Arithmetic.  At issue
    now are not the comparisons against numerical thresholds accomplished
    by the usual order predicates like  " x < y "  but the filtering out of
    peculiar operands like  NaNs  that might complicate such comparisons.
    When filtering is simple and fast enough it may well be preferred to
    the detection of malfunctions after they occur in lengthy formulas.

    Speed is crucial here.  It is achieved,  when filtering requires more
    than one test per floating-point operand,  by moving the tests from
    floating-point to integer and logical hardware,  and by consolidating
    several tests into a few with the aid of precalculated masks whenever
    possible.  This is possible when the tests concern attributes like ...

           Name          Attribute of Floating-Point Operand
           ~~~~       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           sign       Sign bit,  either  0 (+)  or  1 (-) .
           qNaN       A "quiet" NaN ;  it does not trap.
           sNaN       A "signaling" NaN ;  it traps when used.
           infy       An infinity,  either  +ì  or  -ì .
           finn       A finite nonzero number,  not subnormal.
           subn       A subnormal nonzero number.
           zero       Either  +0.0  or  -0.0 .
           oddf       The last significant bit stored is  1 .
           notf       Not a (standard) floating-point number nor  NaN.

    (The last  "notf"  should not occur for operands loaded from memory nor
     for operands encountered by applications programs;  it is an attribute
     to be encountered perhaps exclusively by operating systems software.)

    The nine attributes named above can be associated in our minds with
    nine bits each of which is  0  or  1  according to whether the operand
    in question falls into the category described.  These bits may well be
    copied from tag bits already generated when an operand is loaded into a
    tagged floating-point register.  So long as they are generated fast,
    their provenance doesn't matter.  Note that an operand causes more than
    one bit to be set to  1  only if its sign bit or its last bit is  1 .


    The Function  Fclass(x)
    ~~~~~~~~~~~~~~~~~~~~~~~
    A function  Fclass(x)  is intended to return a bit-string that conveys
    the attributes of its floating-point operand  x .  In this document,
    Fclass(x)  is treated as a generic function determined by the format of
    x  as well as its value;  some languages will require functions named
    Fclass_(x)  in which the underscore  "_"  is replaced by a suffix ...
                s  for  Single-Precision  operands  x ,
                sx      Single-Extended,
                d       Double-Precision,
                dx      Double-Extended,
                q       Quadruple-Precision,
              ( dd      Doubled-Double  ),
    and so on.  All versions of  Fclass(x),  generic or not,  return the
    same bit-string for the same argument-value  x  regardless of format
    except for  finn,  subn  and  oddf,  which depend upon  x's  format.

    The bit-string returned by  Fclass(x)  may be typed linguistically as
    an integer by some languages,  or as a type of its own kind by better-
    protected languages.  What matters most is that  Fclass(x)  return a
    quick classification of the value of its floating-point argument  x
    that lends itself to logical masking in order to form predicates that
    test for  infinities,  NaNs,  signed zeros,  subnormal numbers,  ...,
    and combinations thereof selected by masks of  Fclass'  type built up
    at compile-time to speed the test demanded by the program.

    Of course,  mask words deserve names like  signm,  sNaNm,  qNaNm,  ...
    that programmers can combine by  ANDing  and  ORing  without having to
    memorize hexadecimal strings.  Here are examples in  Matlab's  syntax,
    treating any nonzero integer as  Boolean TRUE,  zero as  FALSE,  and
    using  &  for  AND ,  |  for  OR ,  ~  for  NOT  and  ==  for  EQUALS :

       isNaN(x)             (sNaNm | qNaNm) & Fclass(x)
       isFinite(x)          ~( (infym | sNaNm | qNaNm) & Fclass(x) )   or
                               (zerom | finnm | subnm) & Fclass(x)
       isInfinite(x)        infym & Fclass(x)
       isPlusInfinity(x)    infym == Fclass(x)
       isSubnormal(x)       subnm & Fclass(x)
       isNormal(x)          ( zero | finn ) & Fclass(x)
       isMinusZero(x)       Fclass(x) == (signm | zerom)
       isPlusZero(x)        Fclass(x) == zerom
         x == 0.0           zerom & Fclass(x)
       notZeroNorOdd(x)     ~( (zerom | oddfm) & Fclass(x) )

    Note that these run fast as logical/integer operations because the
    compound masks are composed at compile-time.


    SignBit
    ~~~~~~~
    If  Fclass(x)  is a two-byte integer with  sign  as its leading bit,
    then one shift suffices to bring out the sign bit as an integer:
        SignBit(x) :=  LogicalShift(-15, Fclass(x))  yields  0  or  1 ;
       -SignBit(x) :=  IntegerShift(-15, Fclass(x))  yields  0  or  -1 .
    If the value of  Fclass  is not a word like that,  the compiler writer
    should supply a fast inlined implementation of  SignBit(x)  because it
    is useful only if it is fast,  and then it is useful in  Sturm Sequence
    calculations in tight loops that count real roots of polynomial
    equations and isolate real eigenvalues of symmetric matrices.


    Implementation Details
    ~~~~~~~~~~~~~~~~~~~~~~
    Implementors of floating-point hardware or firmware will find eight of
    Fclass(x)'s  bits worth keeping along with perhaps more bits in a  Tag
    field associated with each floating-point register.  The  Tag  field's
    function is to speed up operations upon operands already classified as
    a by-product of their insertion into the registers.  In conjunction
    with two extra exponent bits,  the  Tag  field can also serve to speed
    the handling of over/underflowed intermediate results and subnormal
    operands without recourse to traps;  that is a story for another day.

    An implementation of  Fclass(x)  in software can resort exclusively to
    integer and logical operations.  It must be done this way on  Pentium-
    like architectures whose floating-point registers forget where their
    contents came from,  obscuring  finn,  subn  and  oddf.

    At present,  the uses contemplated for  Fclass  preponderantly use it
    just once per argument,  embedding it in a boolean expression like the
    ones illustrated above in  Matlab's  syntax.  So long as this pattern
    of use persists,  little is lost by implementing  Fclass  the way the
    Itanium  architecture does:  it combines the sensing of  Fclass(x)  and
    the masking in one operation performed in its floating-point registers.
    Thus,  the expressions illustrated above could be converted at compile-
    time into single instructions containing the appropriate mask.

    However,  future uses may well test  Fclass(x)  with different masks at
    different times for the same argument  x .  This possibility deserves
    exploration;  would  Interval  and  Complex Arithmetic  codes benefit
    if values of  Fclass  for two arguments could be stored and combined
    later when needed?

    Whether  Fclass  is implemented as a separate operation or is mixed
    with a mask at every invocation will not matter to the programmer with
    an efficiently optimizing compiler except that repeated references to
    Fclass(x)  with the same  x  may be rather slower on some architectures
    than on others.


    Appendix:  Silent Order Predicates
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Fclass(x)  can help compilers support silent order predicates on those
    machines whose order predicates all trap or signal  INVALID OPERATION
    if an operand is  NaN.  Seven conventional comparison predicates are

          Math:       =       Ø      ?      <      ó      ò      >
          C:          ==      !=     ?      <      <=     >=     >
          Fortran:    .EQ.   .NE.   .UN.   .LT.   .LE.   .GE.   .GT.

    The last four of these order predicates were specified in  IEEE 754  to
    signal an  INVALID OPERATION  exception  (trap or raise its flag)  and
    deliver a  .FALSE.  value if  NaN  was compared with itself or anything
    else.  The signal was deemed necessary to protect legacy software
    recompiled for hardware with  NaNs  though designed for old arithmetics
    without them.  A user who noticed the signal could perhaps trace it
    back to an event that had previously gone unnoticed or did not occur on
    older hardware.  Protection was intentionally imperfect because the two
    predicates  "NaN .EQ. NaN"  (specified  .FALSE.)  and  "NaN .NE. NaN"
    (specified  .TRUE.)  did not signal;  they served to detect  NaNs  in a
    language that lacked  "NaN".  Unfortunately they are  "optimized"  away
    too often by compilers that replace  " x .NE. x "  by  " .FALSE. "  at
    compile time;  this why  " isNaN(x) "  should be used instead.

    (The predicate  " x ? y "  or  " x .UN. y "  above is silently  .TRUE.
    just when at least one of  x  and  y  is  NaN .  Strictly speaking,
    this predicate is mathematically superfluous since there is no need for
    NaNs  in mathematical proofs,  which take trichotomy for granted.  Only
    computers,  unable to stop and revise their own programs in the light
    of unforseen circumstances,  need  NaNs  and an  UNORDERED  predicate.)

    Nowadays programmers aware of  NaNs' peculiarities need silent order
    predicates that never signal.  In the syntax I prefer for  C ,  these
    should augment the predicates listed above thus:

    Math:   =    Ø    ?    !?    <>     <     ó      ò      >
    C:                           <>     <     <=     >=     >     & signal
            ==   !=   ?    !?    !=?   !>=    !>     !<    !<=    silently.

    Note that silent  " x !> y "  differs from signaling  " !(x > y) "
    when either  x  or  y  can be  NaN .  However,  " x != y "  matches
    " !(x == y) "  and similarly for  !?  since none of these ever signal.
    (I have yet to need more than the fourteen predicates exhibited above.)

    Whatever their syntax,  all the silent predicates should run fast too.
    This is why a fast  Fclass  can be used to filter operands before the
    hardware's signaling predicate is applied to get the silent result.


    Appendix:  What is  oddf  good for?
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    It may speed up some implementations of multi-double arithmetic.  Or it
    may be superfluous.  The subject deserves further investigation.  Like
    the  INEXACT  exception,  it is an idea that most current practitioners
    of floating-point arithmetic have never considered though,  in a past
    now so distant that almost nobody alive remembers it,  such things used
    to be part of the trickery in which skilled practitioners exulted.  For
    instance,  if  ~(oddfm & Fclass(x))  then  3.0*x  suffers no roundoff,
    which fact figures in a program that solves cubic equations.