Notes for Ma221, Lecture 2, Sep 3 2014
Summary of the Syllabus:
Ax=b, least squares, eigenproblems, Singular Value Decomposition (SVD) and variations
Direct methods (aka matrix factorizations: LU, Cholesky, QR, Schur form etc.)
Iterative methods (eg Jacobi, Gauss-Seidel, Conjugate Gradients, Multigrid, etc.)
Deterministic and randomized algorithms
Shared themes:
exploiting mathematical structure (eg symmetry, sparsity, ...)
numerical stability / condition numbers - how accurate is my answer?
efficiency (minimizing flops and communication = data movement)
finding good existing software
Goals: Floating point arithmetic
Roundoff error analysis for polynomial evaluation
Beyond basic error analysis:
exceptions, high precision arithmetic, interval arithmetic,
exploiting mathematical structure to get accuracy without high precision
Example: Polynomial Evaluation, and polynomial zero finding
EX: review how bisection to find a root of f(x)=0 works:
start with an interval [x1,x2] where f changes sign: f(x1)*f(x2) < 0
evaluate at midpoint: f((x1+x2)/2)
keep bisecting subinterval where f changes sign
try it on (x-2)(x-3)(x-4) = x^3 - 9*x^2 + 26*x - 24
rts = [2,3,4]
coeff = poly(rts)
help bisect (on web page)
bisect(coeff,2.6,3.1,1e-12)
bisect(coeff,2.2,3.2,1e-12)
... no surprises, get small intervals around 3
now try it on (x-2)^13
rts = 2*ones(1,13)
coeff = poly(rts)
bisect(coeff,1.7,2.4,1e-12)
bisect(coeff,1.7,2.2,1e-12)
bisect(coeff,1.9,2.2,1e-12) ... huh? a very different answer each time?
Horner's rule to (x-2)^13 (Matlab demo) - what is the real graph?
x = 1.7:1e-4:2.3;
y = polyval(coeff,x);
yacc = (x-2).^13;
plot(x,y,'k.',x,yacc,'r')
axis([1.8,2.2,-1e-8,1e-8])
... can we explain this?
To summarize: Try evaluating (x-2)^13 two ways:
as (x-2)^13 - smooth, monotonic curve, as expected
as x^13 - 26*x^12 + ... - 8192, with Horner's rule:
for x in the range [1.8,2.3], basically get random numbers
in the range [-1e-8,1e-8]
why the big difference? need to understand floating point arithmetic
Floating Point - How real numbers are represented in a computer
Long ago, computers did floating point in many different ways, making it hard to
understand bugs and write portable code. Fortunately Prof. Kahan led an IEEE standards
committee that convinced all the computer manufacturers to agree on one way to do it,
called the IEEE Floating Point Standard, for which he won the Turing Award. This was
in 1985, the standard was recently updated in 2008. See his web page for much more
information: www.cs.berkeley.edu/~wkahan
Scientific Notation: +- d.ddd x radix^e
Floating point usually uses radix=2 (10 sometimes, 16 historically)
so you need to store sign bit (+-), exponent (e), and mantissa (d.ddd).
Both p = #digits and the exponent range are limited, to fit into 32 bits, 64 bits,
etc. For simplicity, we will briefly ignore the limit on exponent range,
i.e. assume no overflow or underflow.
Normalization: We use 3.1000e0 not 0.0031e3 - i.e. the leading digits are nonzero.
Normalization gives uniqueness of representations, which is useful.
Def: rnd(x) = nearest floating point number to x
Relative Representation Error (RRE): | x - rnd(x) | / | rnd(x) |
Maximum Relative Representation Error = max_x RRE(x)
aka machine epsilon, macheps,
= half distance from 1 to next larger number 1+radix^(1-p)
= .5 * radix^(1-p) = | (1+.5*radix^(1-p)) - 1 | / 1
= 2^(-p) in binary
Note: eps in Matlab = 2^(-52) = 2*macheps
Roundoff error model, assuming no over/underflow:
fl(a op b) = rnd(a op b) = true result rounded to nearest
= (a op b)(1 + delta), |delta| <= macheps
where op may be add, subtract, multiply or divide
Use throughout course, all you need for most algorithms
true for complex arithmetic (see Q 1.12 for details, use bigger macheps)
IEEE single(S)/double(D)/quad (Q) formats : radix = 2
S: 1+8+23 bits, -126 <= e <= 127, p = 24, macheps = 2^(-24) ~ 6e-8
overflow threshold (OV) ~ 2^128 ~ 1e38,
underflow threshold (UN) = 2^(-126) ~ 1e-38
D: 1+11+52 bits, -1022 <= e <= 123, p=53, macheps = 2^(-53) ~ 1e-16
OV ~ 2^1024 ~ 1e308,
UN = 2^(-1022) ~ 1e-308
Q: 1+15+112 bits, -16382 <= e <= 16383, macheps = 2^(-113) ~ 1e-36
OV ~ 2^16384 ~ 1e11356,
UN = 2^(-16382) ~ 1e-11356
Note: S and D available in hardware in most platforms.
Q is usually in software (added to 2008 standard).
E(xtended), an 80-bit format on Intel x86 architectures, was
in the old IEEE standard from 1985, but is now deprecated.
Higher available via software simulation (see ARPREC, GMP on web page)
2008 standard also added decimal arithmetic (for Excel...)
That's enough to understand the plot of (x-2)^13, but more about floating point later.
Analyze Horner's Rule for evaluating p(x):
simplest expression:
p = sum_{i=0 to d} a_i x^i
algorithm:
p = a_d, for i=d-1:-1:0, p = x*p + a_i
label intermediate results (no roundoff yet):
p_d = a_d, for i=d-1:-1:0, p_i = x*p_{i+1} + a_i
introduce roundoff terms:
p_d = a_d, for i=d-1:-1:0, p_i = [x*p_{i+1}*(1+d_i) + a_i]*(1+d'_i)
where |d_i| <= macheps and |d'_i| <= macheps
p_0 = sum_{i=0:d-1} [(1+d'_i)*prod_{j=0:i-1}(1+d_j)*(1+d'_j)*a_i*x^i
+ prod_{j=0:d-1} (1+d_j)*(1_d'_j)*a_d*x^d
= sum_{i=0:d-1} [product of 2i+1 terms like 1+d] a_i*x^i
+ [product of 2d terms like (1+d)] a_d*x^d
= sum_{i=0:d} a'_i * x^i
In words: Horner is backward stable: you get exact value of polynomial at x but
with slightly changed coefficients a'_i from input p(x)
how to simplify to get error bound
prod_{i=1:n} (1 + delta_i)
<= prod_{i=1:n} (1 + macheps)
= 1 + n*macheps + O(macheps^2) ... usually ignore (macheps^2)
<= 1 + n*macheps/(1-n*macheps) if n*macheps < 1 ... (lemma left to students)
similarly
prod_{i=1:n} (1 + delta_i)
>= prod_{i=1:n} (1 - macheps)
= 1 - n*macheps + O(macheps^2) ... usually ignore (macheps^2)
>= 1 - n*macheps/(1-n*macheps) if n*macheps < 1 ... (lemma left to students)
so |prod_{1 to n} (1 + delta) - 1| <= n*eps ... ignore eps^2
and thus
|computed p_d - p(x)| <= sum_{i=0:d-1} (2i+1)*eps*|a_i*x^i|
+ 2*d*eps*|a_d*x^d|
<= sum_{i=0:d} 2*d*eps*|a_i*x^i|
relerr = |computed p_d - p(x)|/|p(x)|
<= sum_{i=0:d} |a_i x^i| / |p(x)| * 2*d*eps
= condition number * relative backward error
How many decimal digits can we trust?
d correct digits <=> relative error = 10^(-d)
<=> -log_10 (relative error) = d
How to modify Horner to compute error bound:
p = a_d ebnd = |a_d|,
for i=d-1:-1:1, p = x*p + a_i, eb = |x|*eb + |a_i|
ebnd = 2*d*macheps*ebnd
Matlab demo:
coeff = poly(2*ones(13,1));
x = [1.6:1e-4:2.4];
y = polyval(coeff,x);
yacc = (x-2).^13;
ebnd = 13*eps*polyval(abs(coeff),abs(x));
% note eps in Matlab = 2*macheps
plot(x,y,'k.',x,yacc,'c',x,y-ebnd,'r',x,y+ebnd,'r')
axis([1.65 2.35 -1e-8 1e-8]), grid
% need to make vertical axis wider to see bounds
axis([1.65 2.35 -1e-6 1e-6]), grid
% conclusion: don't trust sign outside roughly [1.72, 2.33]
Consider Question 1.21: how could we use this error bound to
stop iterating in root finding using bisection?
... now try wider range, look at actual and estimated # correct digits
x = -1:.0001:5;
y = polyval(coeff,x);
yacc=(x-2).^13;
ebnd=13*eps*polyval(abs(coeff),abs(x));
plot(x,-log10(abs((y-yacc)./yacc)),'k.',x,-log10(ebnd./abs(yacc)),'r')
axis([-1 5 0 16]), grid
title('Number of correct digits in y')
Notice the analogy between Horner's rule and computing dot-products:
p = a_d, for i=d-1:-1:1, p = x*p + a_i
s = 0 , for i=1:d, s = x_i*y_i + s
analysis of dot products very similar (homework 1.10)
Now we briefly discuss some properties and uses of floating point that go beyond these
most basic properties:
(1) Exceptions: what happens when computed result would have an exponent that is too big
to fit, too small to fit, or isn't mathematically defined, like 0/0?
Get "exceptions": Rules about result and how to compute with them, plus flags to
check to see if an exception occurred
What if answer > OV? get infinity (overflow)
fl(1e20*1e20) = fl(1/0) = inf ... in single
fl(3+inf) = inf, fl(1/inf) = 0, fl(1/-0) = -inf
What if answer mathematically undefined? get NaN = Not a Number
fl(sqrt(-1)) = fl(0/0) = fl(inf-inf) = NaN
3 + NaN = NaN ... so you see NaN on output if one occurred
What if answer < UN? get underflow
What to do: if you return zero, then what happens with code:
if (a .ne. b) then x = 1/(a-b) ... can divide by zero,
Instead, IEEE standard says you get "subnormal numbers"
x = +- 0.dddd * 2^exp_min instead of +- 1.dddd * 2^exp
[Draw IEEE number line]
Without subnormals, do error analysis with
fl(a op b) = (a op b)(1 + delta) + eta, |delta| <= macheps, |eta| <= UN
With subnormals, can do error analysis with
fl(a op b) = (a op b)(1 + delta) + eta, |delta| <= macheps,
|eta| <= UN*macheps for *,/ and eta=0 for +-
Thm: With subnormals, for all floats a,b fl(a-b) = 0 iff a=b
Purpose: simplify reasoning about floating point algorithms
Why bother with exceptions? Why not always just stop when one occurs?
(1) Reliability: too hard to have test before each floating point operation
to avoid exception
Ex for control system (see Ariane 5 crash on webpage),
Ex Matlab: don't want to go into infinite loop because of input NaN
(caused several fixes to LAPACK, and also helped motivate an on-going CS
research project to build tools to prove that NaNs, infinities, etc cannot
cause infinite loops or analogous bugs)
(2) Speed: ditto (too slow to test before each operation)
(1) run "reckless" code that is fast but assumes no exceptions
(2) check exception flags
(3) in case of exception, rerun with slower algorithm
Ex: s = root-sum-squares = sqrt(sum_{i=1}^n x_i^2)
What could go wrong? (see Q 1.19)
(3) Sometimes one can prove code correct with exceptions:
Ex: Current fastest algorithms to find (some) eigenvalues of
symmetric matrices depends on these, including 1/(-0) = -inf
Impact on LAPACK: One of The fastest algorithms in LAPACK for finding
eigenvalues of symmetric matrices assumes that 1/+0 = +infinity,
and 1/-0 = -infinity, as specified by the IEEE standard.
We were porting this code to an ATI GPU a couple of years ago,
and discovered that they did not handle exceptions correctly:
1/(-0) was +infinity instead of -infinity.
The code depended on getting -infinity to get the right answer,
and (until they fixed their hardware) we had to take the quantity
in the denominator 1/d and instead compute 1/(d+0), which made
the -0 turn into a +0, whose reciprocal was correctly computed.
See www.cs.berkeley.edu/~volkov for details
Impact on Matlab: In 2009 we released a new version of
an LAPACK routine used in used in least square solvers and eigenvalue
routines, that could be a lot faster on band matrices, and could be
used to more easily generate random orthogonal matrices with a certain
probability distribution requested by some users. Mathworks
noticed that is was more susceptible to over/underflow, and after
careful analysis, we decided this was an intrinsic property, and we
had to remove it from most parts of later releases.
(2) Exploiting lack of roundoff in special cases to get high precision
Fact: if 1/2 < x/y < 2 then fl(x-y) = x-y ... no roundoff
Proof: Cancellation in x-y means exact result fits in p bits
Fact: Suppose M >= abs(x). Suppose we compute
S = fl(M+x), q = fl(S-M), r = fl(x-q)
In exact arithmetic, we would compute r = x-q = x-(S-M) = x-((M+x)-M) = 0.
But in floating point, S+r = M+x exactly, with r being the roundoff error,
i.e. (S,r) is the double precision sum of M+x
(Proof sketch in special case; only roundoff occurs in S=fl(M+x), other two
operations are exact).
This trick, called "Accurate 2-Sum" can be generalized in a number of ways to get
very general algorithms available in the following software packages:
ARPREC (see class website): provides arbitrary precision arithmetic
XBLAS (see class website): provides some double-double precision Basic Linear
Algebra Subroutines (BLAS), like matrix-vector-multiplication. LAPACK uses
these routines to provide high accuracy solvers for Ax=b.
ReproBLAS (see class website): Provides bit-wise reproducible parallel
implementations of some BLAS. The challenge here is that since roundoff makes
floating point addition nonassociative, computing a sum in a different order
will usually give a (slightly) different answer. On a parallel machine, where
the number of processors (and other resources) may be scheduled dynamically,
computing a sum in a different order is likely. Since getting even slightly
different results is a challenge for debugging and (in some cases) correctness,
we and others are working on efficient algorithms that guarantee reproducibility.
Finally, we point out the paper "Accurate and Efficient Floating Point Summation,"
J. Demmel and Y. Hida, SIAM J. Sci. Comp, 2003, for a efficient way to get full
accuracy in a summation despite roundoff (in brief: to sum n numbers to full accuracy,
you need to use log_2(n) extra bits of precision, and sum them in decreasing order
by magnitude), and the website www.ti3.tu-harburg.de/rump/ for a variety of papers
on linear algebra with guarantee accuracy.
(3) Guaranteed error bounds (sometimes) via interval arithmetic:
So far we have used rnd(x) to mean the nearest floating point number to x.
(Note: Ties are broken by choosing the floating point number whose last bit is zero,
also called "round to nearest even". This has the attractive property that half
the ties are rounded up and half rounded down, so there is no bias in long sums.)
But the IEEE Floating Point Standard also requires the operations
rnd_down(x) = largest floating point number <= x
rnd_up(x) = smallest floating point number >= x
Thus [rnd_down(x+y),rnd_up(x+y)] is an interval guaranteed to contain x+y.
And if [x_min,x_max] and [y_min,y_max] are two intervals guaranteed to contain
x and y, resp., then [rnd_down(x_min+y_min),rnd_up(x_max+y_max)] is guaranteed
to contain x+y. So if each floating point number x in a program is represented
by an interval [x_min,x_max], and we use rnd_down and rnd_up to get lower and
upper bounds on the result of each floating point operation (note: multiply
and division are a little trickier than addition), then we can get guaranteed
error bounds on the overall computation; this is called interval arithmetic.
Alas, naively converting all variables and operations in a program to
intervals often results in intervals whose width grows so rapidly, that they
provide little useful information. So there has been much research over time in
designing new algorithms that try to compute intervals that are narrow at reasonable
cost. Google "interval arithmetic" or see the website www.ti3.tu-harburg.de/rump/
for more information.
(4) Accurate Linear Algebra despite roundoff:
A natural question is what algebraic formulas have the property that there is some
way to evaluate them that, despite round off, the final answer is always correct
in most of its leading digits:
Ex: general polynomial: no, not without higher precision
Ex: x^2+y^2: yes, no cancellation,
Ex: determinant of a Vandermonde matrix: V(i,j) = x_i^(j-1): General algorithm
via Gaussian elimination can lose all digits, but formula
det(V) = prod_{i < j} (x_j - x_i) works
A complete answer is an open question, but there are some necessary and sufficient
conditions based on algebraic and geometric properties of the formula, see article by
Demmel/Dumitriu/Holtz/Koev on "Accurate and Efficient Algorithms in Linear Algebra"
in Acta Numerica v 17, 2008: A class of linear algebra problems are identified that,
like det(Vandermonde), permit accurate solution despite roundoff. We will not discuss
this further, just say that the mathematics depends on results going back to
Hilbert's 17th problem, which asked whether positive rational (or polynomial)
functions could always be written as a sum of squares of other rational (or polynomial)
functions (answers: rational = yes, polynomial = no). For example, when
0 < x_1 < x_2 < ... in the Vandermonde matrix V , it turns out most any
linear algebra operation on V can be done efficiently and to nearly full accuracy
despite roundoff, including eig(V). (This work was cited in coauthor
Prof. Olga Holtz's award of the 2008 European Math Society Prize).