Math 128a - Spring 2002 - Lecture 2 - Jan 24 (Thursday)
Review 64-bit Floating point numbers from last time:
store s = sign (1 bit),
e = exponent (11 bits), (0 <= e <= 2047)
f = fraction (52 bits) (0 <= f < 1)
usual case: 0 < e < 2047
(-1)^s * 2^(e-1023) * (1+fraction)
(range from about 10^(-308) to 10^(308)
e = 2047, f = 0 means x = +- infinity
e = 2047, f != 0 means x = NaN = Not a Number
0 = e and f = 0 means x = +-0
0 = e and f != 0, means x is "subnormal"
x = (-1)^s 2^(-1022) * (0 + f)
(illustrate with Matlab "hex format")
What operations are available?
Basic four: +, -, *, /
Also sqrt(x), remainder(x,y), compare(x,y)
binary-decimal conversion
Rounding: what do you get from 1/3, which is not
an exact floating point number (why not?)
Def: if op is one of +, -, *, / then
the floating point result of a op b is written
fl(a op b), and is equal to the floating point
number nearest the true value of a op b.
(in case of a tie, choose the floating point
number whose bottom bit is 0)
Known as "round to nearest", as accurate as possible
Exceptions, when "round to nearest" trickier to interpret:
Overflow - 1e200^2 = 1e400, too big, get +Inf
Underflow - 1e-308^2, get zero
1e-308 / 100, get subnormal number
Divide-by-zero - 1/+-0, get +-Inf
Invalid - inf-inf, 0/0, sqrt(-1), inf/inf, get NaN
For each exception, always get a "sticky bit", a flag that says
an exception occurred since bit last reset
May continue computing (default) or trap
(not available in matlab, just isnan(), etc)
Go back and understand infinite loop from last time:
Example 2: run bisect to find root of cos(x)=0 in [1,2]
[xout,fn,iters]=bisect('cos',[1,2],1e-15,0)
Example 3: run bisect to find root of cos(x)=0 in [1,2]
[xout,fn,iters]=bisect('cos',[1,2],1e-16,0)
[xout,fn,iters]=bisect('cos',[1,2],1e-16,1)
Explanation if xlower and xupper are adjacent floating point
numbers, then xmid = (xlower+xupper)/2 rounds to
xlower or xupper, and no progress is made
Note: consider 3 digit decimal:
fl((.999+.997)/2) = 1.00; so (xlower+xupper)/2
does not even have to be in the interval [a,b] !
This can't happen with binary numbers (homework)
Possible fixes for bisection algorithm:
stop if xmid = xlower or xupper
stop if number of iterations exceeds some upper bound
Both reasonable, but neither of these does what most people
want which is to
"find me the answer to "5 digits" or "10 digits" as fast
as you can" - How do we do this?
What does this mean exactly?
pi is roughly 3.14159265358979323...
pi is 3.1416 to 5 digits
pi is 3.141592654 to 10 digits
Def: Let x be an approximation to X. then
the "absolute error" in x is |x-X|
the "relative error" in x is |x-X|/|X|
Ex: |3.1416 - pi|/|pi| ~ 10^(-5) (a little less)
|3.141592654 - pi|/|pi| ~ 10^(-10) (a little more)
Fact: small relative error corresponds to idea of
knowing leading digits of answer (details on homework)
Returning to bisection, if the user wants, say, 5 digits, then
we want to stop as soon as the relative error is < 10^(-5).
How do we change the stopping test in the code to do this?
How do we change
while (width > tol)
do a bisection step
end
to
while (relative error > reltol)
do a bisection step
end
Why is this a problem? can't get exact relative error
Instead, we will use the approximation
relative error ~ width of interval / largest number in interval
(on homework you will show that this is a good approximation)
New stopping test for bisection:
while ( width/ largest number in interval > reltol)
do bisection step
end
Example 4: [xout,fx,iters]=bisect ('cos',[100000,100004],1e-10,0)
[xout,fx,iters]=bisect_rel('cos',[100000,100004],1e-10,0)
second one converges faster!
Example 5: [xout,fx,iters]=bisect ('cos',[1000000,1000004],1e-10,0)
[xout,fx,iters]=bisect_rel('cos',[1000000,1000004],1e-10,0)
second one converges!
Example 6: [xout,fx,iters]=bisection_rel('cube',[-1,2],1e-5,0)
yields interval not containing 0, and negative values at both
endpoints; why?
Proposed fix for bisection algorithm: why?
replace test fmid*flower <= 0 by sign(fmid)*sign(flower) <= 0
yields interval with two negative endpoints - why?
Example 7: [xout,fx,iters]=bisection_rel('identity',[-1,2],1e-5,0)
Now it goes into an infinite loop - why?
fix by having absolute lower bound on width:
while ( width > max( reltol*largest number in interval, abstol))
do a bisection step
end
Example 8: f(x) = (x-2)^13, in [1.1,3.2], tol = 1e-15
[xout,fn,iters]=bisect('func',[1.1, 3.2], 1e-15, 0)
Example 9: f(x) = (x-2)^13 evaluated using expanded form
(polyval in Matlab)
tol =1e-10
[xout,fn,iters]=bisect('func',[1+rand(1), 3.2], 1e-10, 0)
observe that answer changes a lot depending on rand(1)
Example 10: f(x) = (x-2)^13 evaluated using polyval
[xout,fn,iters]=bisect_anim('func',[rand(1), 3.2], 1.e-15)
Example 11: x=(1:.001:3), y=func(x), plot(x,y), axis([1 3 -1e-8 1e-8])
round off makes answer "junk" but with structure
x=2+(0:2000)*2*eps;y=func(x);plot(y,'.')
Goals:
Want to understand how to bound error in evaluating polynomial
Want to modify bisection (or any zero finder) to use error bound to
give bound to user on location of zero
Relative error is natual way to bound error in floating point
UN = smallest normalized floating point number
= "Underflow threshold"
= 2^(-1022)
~ 10^(-308)
OV = largest normalized floating point number
= "Overflow threshold"
~ 2^(1024)
~ 10^(308)
Prop: Suppose x is a real number, and x* is its
closest possible floating point approximation.
(eg x = a+b and x* = fl(a+b))
Then provided UN <= |x| <= OV we have that
|x-x*|/|x| <= 2^(-53) ~ 1e-16
Notation: 2^(-53) is called "epsilon" or "unit roundoff" or
"machine epsilon" or "macheps" for short
(note, in matlab eps = 2^(-52) is a built-in constant
defined as twice our epsilon!)
Proof: Assume w.l.o.g. that x is positive.
Write x* = 2^e *(1+f) where 0 <= f < 1
First suppose 0 < f < 1. The two closest floating number
to x* are x* - 2^e * 2^(-52) and x* + 2^e * 2^(-52)
Then fact that x* is closest to x means that
x* - 2^e * 2^(-53) <= x <= x* + 2^e * 2^(-53)
or
-2^e * 2^(-53) <= x - x* <= 2^e * 2^(-53)
or
-2^e * 2^(-53) x - x* 2^e * 2^(-53)
------------- <= ----- <= ------------
x x x
or
| x - x* | / |x| <= 2^e * 2^(-53) / |x|
<= 2^e * 2^(-53) / 2^e
since |x| >= 2^e * (1+2^(-53))
= 2^(-53)
The case f=0 is left as an exercise to the student.
Corollary: Let op be one of =,-,*,/.
Suppose UN <= |fl(a op b)| <= OV.
Then
|fl(a op b) - (a op b)| / | a op b | <= 2^(-53)
or
(*) fl(a op b) = (a op b)*(1+ delta)
where |delta| <= epsilon
Proof: Define delta = [fl(a op b) - (a op b)] / (a op b)
which must satisfy |delta| < eps. Then solve for fl(a op b).
We note that IEEE also guarantees that
(**) fl(sqrt(a)) = sqrt(a)*(1+delta), |delta| <= eps
(*) and (**) are the simple models from which we can derive error bounds.
We will do some simple examples, and then polynomial evaluation
Example 12: The program
p = 1, for i=1:n, p=p*x(i), end
computes the product x(1)*x(2)* ... *x(n)
What is difference between the true p and the computed p?
Approach: write down a "program" that includes all the
rounding error explicitly, and evaluate it exactly:
p* = 1, for i=1:n, p* = (p* * x(i))*(1 + delta(i))
where delta(i) is the roundoff error in the
i-th multiplication. Evaluating, we get
p* = x(1)*...*x(n) * (1+delta(1))*...*(1+delta(n))
= p * (1+delta(1))*...*(1+delta(n))
= p * (1+r) where 1+r = (1+delta(1))*...*(1+delta(n))
Ideally, 1+r is as close to 1 as possible. How close is it?
Simple bounds:
(1-epsilon)^n <= 1+r = (1+delta(1))*...*(1+delta(n) <= (1+epsilon)^n
so
(1-epsilon)^n - 1 <= r <= (1+epsilon)^n - 1
How to approximate (1+-epsilon)^n?
Simplest way: Binomial theorem or Taylor expansion:
(1+-epsilon)^n = 1 +- n*epsilon + O((n*epsilon)^2)
so if |n*epsilon|<<1, get that
-n*epsilon + O((n*epsilon)^2 ) <= r <= n*epsilon + O((n*epsilon)^2 )
For example, n<=100, epsilon ~ 1e-16, means p* approximates p to at least
14 decimal places
Approximation ok if n*epsilon << 1 (see below)
To be rigorous, and not ignore O(epsilon^2) term, one can prove
Lemma: |r| <= n*epsilon / (1 - n*epsilon ) if n*epsilon < 1
Proof: homework
This is "worst case analysis", only happens if all delta(i) ~ epsilon,
or all delta(i) ~ -epsilon. This is unlikely, but possible, so often
answer is more accurate than worst case prediction (examples later)
From now on, we will write |r| <= n*epsilon, ignoring second order terms
Example 13: The program
s = 0; for i=1:n, s=s+x(i), end
computes the sum x(1)+x(2)+ ... +x(n)
What is difference between the true s and the computed s*?
Approach: as beforewrite down a program that includes all
the rounding error explicitly, and evaluate it exactly:
s* = 0, for i=1:n, s* = (s* + x(i))*(1 + delta(i))
where delta(i) is the roundoff error in the i-th addition
This is tricker algebra, so idea is to modify program to
have a unique identifier for each intermediate value:
s*(0) = 0, for i=1:n, s*(i) = (s*(i-1) + x(i))*(1 + delta(i))
Multiply out a few terms to see pattern (skip induction proof)
s*(1) = x(1) ... no roundoff when adding 0
s*(2) = (x(1)+x(2))*(1+delta(2))
s*(3) = ((x(1)+x(2))*(1+delta(2)) + x(3))*(1+delta(3))
= x(1)*(1+delta(2))*(1+delta(3))
+ x(2)*(1+delta(2))*(1+delta(3))
+ x(3)*(1+delta(3))
s*(4) = x(1)*(1+delta(2))*(1+delta(3))*(1+delta(4))
+ x(2)*(1+delta(2))*(1+delta(3))*(1+delta(4))
+ x(3)*(1+delta(3))*(1+delta(4))
+ x(4)*(1+delta(4))
...
s*(k) = x(1)*(1+delta(2))*...*(1+delta(k))
+ x(2)*(1+delta(2))*...*(1+delta(k))
+ x(3)*(1+delta(3))*...*(1+delta(k))
+ x(4)*(1+delta(4))*...*(1+delta(k))
+ ...
+ x(k)*(1+delta(k))
so s*(n) = sum from i=1 to n of x*(i) where
x*(i) = x(i)*(1+r(i)) where |r(i)| <= (n-1)*epsilon
In words, s*(n) is the exact sum of a perturbed set of
summands x*(i), each of which is a small relative
perturbation of x(i)
What does this mean?
First, suppose all the x(i) are positive; then claim
that s*(n) = s(n)*(1+delta), where |delta| <= (n-1)*epsilon
Proof:
s*(n) - s(n) = sum x(i)*(1+r(i)) - sum x(i)
= sum x(i)*r(i)
<= sum x(i)*(n-1)*epsilon
= s(n)*(1+(n-1)*epsilon)
and similary s*(n) >= s(n)*(1-(n-1)*epsilon)
What if not all x(i) > 0? then can only show that
| s*(n) - s(n) | = | sum_i x(i)*r(i) | <= (n-1)*epsilon* sum_i |x(i)|
so error bound same as if all x(i)>0, and so the relative error is
| s*(n) - s(n) | / | s(n) |
<= (n-1)*epsilon * [ sum |x(i)| / | sum x(i) | ]
Thus, if |s(n)| = | sum x(i) | << sum |x(i)|, i.e. if there
is a lot of "cancellation" in the sum, then there may be a large
relative error:
Ex: fl((x+1e16) - 1e16) = 0, not x, for any 0 < |x| < epsilon*1e16 ~ 4.44
so the relative error is 100%.
What about polynomial root finding, which involves summing and multiplying?
As we approach the the root, we are in precisely the situation where the
value of the polynomial (a sum) is supposed to be near zero, i.e. we
expect large relative errors. The analysis is similar to summing:
Example 14: Horner's rule for polynomial evaluation (used in polyval)
... compute p = sum from i=0 to n of c(i)*x^i
p = c(n); for i=n-1:-1:0, p = x*p + c(i)
Using same approach as for rounding error analysis in Example 13, we get
p*(n) = c(n);
for i=n-1:-1:0,
p*(i) = (x*p*(i+1)*(1+eps(i)) + c(i))*(1+delta(i))
end
with |eps(i)| <= epsilon, |delta(i)| <= epsilon
p*(n) = c(n) ... no roundoff
p*(n-1) = ( (x * p*(n) * (1+eps(n-1)) + c(n-1)) * (1+delta(n-1))
= [x * c(n)] * (1+eps(n-1))*(1+delta(n-1))
+ c(n-1) * (1+delta(n-1))
p*(n-2) = ( (x * p*(n-1)) * (1+eps(n-2)) + c(n-2)) * (1+delta(n-2))
= [x^2 * c*(n)] *
(1+eps(n-1))*(1+eps(n-2))*(1+delta(n-1))*(1+delta(n-2)
+[x * c(n-1)] *
(1+eps(n-2))*(1+delta(n-1))*(1+delta(n-2)
+ c(n-2) *
(1+delta(n-2))
p*(n-k) = [x^k * c(n) ] *
(1+eps(n-1))*...*(1+eps(n-k) *
(1+delta(n-1))*...*(1+delta(n-k) *
+[x^(k-1) * c(n-1) ] *
(1+eps(n-2))*...*(1+eps(n-k) *
(1+delta(n-1))*...*(1+delta(n-k) *
+[x^(k-2) * c(n-2) ] *
(1+eps(n-3))*...*(1+eps(n-k) *
(1+delta(n-2))*...*(1+delta(n-k) *
...
+ [c(n-k)] * (1+delta(n-k))
Note that the maximum number of terms of the form (1+tiny) multiplied
together is 2*n. Thus p*(n) = sum from i=0 to n of c*(i)*x^i where
c*(i) = c(i)*(1+r(i)) where |r(i)| <= 2*n*epsilon
That is the computed value p* of the polynomial is
the exact value of another polynomial evaluated at the same point x,
but with different coefficiencts c*(i), each of which
differs from the original c(i) by a small relative changes r(i).
Further more, we can write the error bound as
| p* - p | = | sum from i=0 to n c*(i)*x^i - c(i)*x^i |
= | sum from i=0 to n [c*(i) - c(i)]*x^i |
= | sum from i=0 to n [r(i)*c(i)]*x^i |
<= sum from i=0 to n |r(i)*c(i)]*x^i |
because r(i) can be positive or negative,
<= 2*n*epsilon * sum from i=0 to n |c(i)|*|x|^i
= 2*n*epsilion * value of a polynomial with
coefficients |c(i)| at |x|,
i.e. all positive numbers, without cancellation
In matlab, if p = polyval(coef,x), then
| p* - p | <= 2*n*epsilon * polyval(abs(coef),abs(x))
In practice, factor 2*n is pessimistic, and often omitted.
Example: [val,errorbnd]=polyvalbnd(coef,x)
[x,val,errorbnd,errorbnd./val] for x=(1:.2:3)'
when relative error bound = errorbnd./val > 1, can't trust
sign in zero finder
Example: [xout,HornerVal,TrueVal,AbsErrorBound,TrueAbsError]=
polyplot(roots,x)
[...]=polyplot(2*ones(13,1),1:.001:3);
[...]=polyplot(1:20,-5:.001:25);
Using this error bound in a zero finder, like bisection:
1) compute both f(xmid) and an error bound on f(xmid)
2) stop when error bound > |f(xmid)|,
because then sign of f(xmid) in doubt, might as well be zero
End of floating point arithmetic details and error analysis, for now:
Big picture to remember:
value of polynomial computed with roundoff is the exact
value of a slightly perturbed polynomial; this is an
example of a general phenomenon called "backward stability"
that is widely used to analyze the impact of roundoff
on an algorithm
(Warning: this is the first of several uses of the word
"stability" in this course. It has differetn meanings
in different contexts, so be aware!)