The generic interface Float
provides access to the floating-point
operations required or recommended by the IEEE floating-point
standard. Consult the standard to resolve any fine points in the
specification of the procedures. Non-IEEE implementations that
have values similar to NaNs and infinities should explain how those
values behave in an implementation guide. (NaN is an IEEE term
whose informal meaning is ``not a number''.)
GENERIC INTERFACEFloat (R); IMPORT FloatMode; TYPE T = R.T; PROCEDURE Scalb(x: T; n: INTEGER): T RAISES {FloatMode.Trap};
Return $\hboxx
\cdot 2^{\hboxn
}$.
PROCEDURE Logb(x: T): T RAISES {FloatMode.Trap};
Return the exponent ofx
. More precisely, return the unique integer $n$ such that the ratio $\hboxABS(x) / Base
^{n}$ is in the half-open interval[1..Base)
, unlessx
is denormalized, in which case return the minimum exponent value forT
.
PROCEDURE ILogb(x: T): INTEGER;
LikeLogb
, but returns an integer, never raises an exception, and always returns the $n$ such that $\hboxABS(x) / Base
^{n}$ is in the half-open interval[1..Base)
, even for denormalized numbers. Special cases: it returnsFIRST(INTEGER)
whenx
= 0.0,LAST(INTEGER)
whenx
is plus or minus infinity, and zero whenx
is NaN.
PROCEDURE NextAfter(x, y: T): T RAISES {FloatMode.Trap};
Return the next representable neighbor ofx
in the direction towardsy
. Ifx = y
, returnx
.
PROCEDURE CopySign(x, y: T): T;
Returnx
with the sign ofy
.
PROCEDURE Finite(x: T): BOOLEAN;
ReturnTRUE
ifx
is strictly between minus infinity and plus infinity. This always returnsTRUE
on non-IEEE implementations.
PROCEDURE IsNaN(x: T): BOOLEAN;
ReturnFALSE
ifx
represents a numerical (possibly infinite) value, andTRUE
ifx
does not represent a numerical value. For example, on IEEE implementations, returnsTRUE
if x is a NaN,FALSE
otherwise.
\index{NaN (not a number)}
PROCEDURE Sign(x: T): [0..1];
Return the sign bitx
. For non-IEEE implementations, this is the same asORD(x >= 0)
; for IEEE implementations,Sign(-0) = 1
andSign(+0) = 0
.
PROCEDURE Differs(x, y: T): BOOLEAN;
Return(x < y OR y < x)
. Thus, for IEEE implementations,Differs(NaN,x)
is alwaysFALSE
; for non-IEEE implementations,Differs(x,y)
is the same asx # y
.
PROCEDURE Unordered(x, y: T): BOOLEAN;
ReturnNOT (x <= y OR y <= x)
. Thus, for IEEE implementations,Unordered(NaN, x)
is alwaysTRUE
; for non-IEEE implementations,Unordered(x, y)
is alwaysFALSE
.
PROCEDURE Sqrt(x: T): T RAISES {FloatMode.Trap};
Return the square root ofT
. This must be correctly rounded ifFloatMode.IEEE
isTRUE
.
TYPE IEEEClass = {SignalingNaN, QuietNaN, Infinity, Normal, Denormal, Zero}; PROCEDURE Class(x: T): IEEEClass;
Return the IEEE number class containingx
. On non-IEEE systems, the result will beNormal
orZero
.
PROCEDURE FromDecimal( sign: [0..1]; READONLY digits: ARRAY OF [0..9]; exp: INTEGER): T RAISES {FloatMode.Trap};
Convert from floating-decimal to type T
.
\index{floating-point!conversion from decimal} \index{decimal conversion!to floating-point}
Let F
denote the nonnegative, floating-decimal number
digits[0] . digits[1] ... digits[LAST(digits)] * 10^exp = sum(i, digits[i] * 10^(exp - i))The result of
FromDecimal
is the number (-1)^sign * F
, rounded
to a value of type T
.
The procedure FromDecimal
is a floating-point operation, just
like +
and *
, in the sense that it rounds its ideal result
correctly, observing the current rounding mode, and it sets flags
and raises traps by the usual rules. On IEEE implementations, it
returns minus zero when F
is sufficiently small and sign=1
.
TYPE DecimalApprox = RECORD class: IEEEClass; sign: [0..1]; len: [1..R.MaxSignifDigits]; digits: ARRAY[0..R.MaxSignifDigits-1] OF [0..9]; exp: INTEGER; errorSign: [-1..1] END; PROCEDURE ToDecimal(x: T): DecimalApprox;
Convert from type T
to floating-decimal.
\index{floating-point!conversion to decimal} \index{decimal conversion!from floating-point}
Let D
denote ToDecimal(x)
. Then, D.class = Class(x)
and
D.sign = Sign(x)
. The other fields are defined only when
D.class
is either Normal
or Denormal
. In those cases, the
values D.len
, D.digits[0]
through D.digits[D.len-1]
, and
D.exp
encode a floating-decimal number F
with the property that
(-1)^D.sign * F
approximates x
in a sense discussed below. The
encoding is such that
F = digits[0] . digits[1] ... digits[len - 1] * 10^exp = sum(i, digits[i] * 10^(exp - i))and
ABS(x) = F * (1 + errorSign * epsilon)where
epsilon
is small and positive. In particular, D.errorSign
is +1
, 0
, or -1
according as ABS(x)
is larger than, equal
to, or smaller than F
.
The current rounding mode determines the sense in which the
floating-decimal number (-1)^sign * F
approximates x
, but in a
slightly subtle way. Define the opposite of a directed rounding
mode by reversing the direction, as follows:
Opp(TowardPlusInfinity) := TowardMinusInfinity Opp(TowardMinusInfinity) := TowardPlusInfinity Opp(TowardZero) := AwayFromZeroNote that
AwayFromZero
isn't actually a rounding mode, but it is
clear what it would mean if it were. For all other rounding modes
M
, we define Opp(M) = M
. If the current rounding mode is M
,
the call ToDecimal(x)
returns a floating-decimal number that
FromDecimal
would convert, under rounding mode Opp(M)
, back to
x
. Among all such numbers, the returned value has as few digits
as possible. This implies that both D.digits[0]
and
D.digits[D.len-1]
are nonzero. If there is a tie for having the
fewest digits, the tying number closest to x
wins. If there is
also a tie for being closest to x
, it must be a two-way tie and
the number whose last digit is even wins.
Unlike FromDecimal
, ToDecimal
never sets a FloatMode.Flag
and
never raises FloatMode.Trap
.
The idea of converting to decimal by retaining just as many digits
as are necessary to convert back to binary exactly was popularized
by Guy L.~Steele Jr.\ and Jon L White~\cite{Steele}. David M.~Gay
pointed out the importance, in this context, of demanding that the
conversion to binary handle mid-point cases by a known
rule~\cite{Gay}. For example, in IEEE double precision, the
floating-decimal number 1e23
is precisely halfway between two
adjacent floating-binary numbers. If conversion to binary were
allowed to go either way in such a mid-point case, conversion to
decimal would have to avoid producing the simple number 1e23
,
producing instead either 1.0000000000000001e23
or
9.999999999999999e22
. We believe the idea of combining the
Steele/White style of automatic precision control with directed
rounding by using opposite rounding modes, as above, is new with
Lyle Ramshaw.
END Float.